Data exploration is a crucial step in any data analysis and data science project. It allows you to gain a deeper understanding of your data, identify patterns and relationships, and identify any potential issues or outliers.
One of the most popular tools for data exploration is the Python library Pandas. The library provides a powerful set of tools for working with data, including data cleaning, transformation, and visualization. However, even with the powerful capabilities of Pandas, data exploration can still be a time-consuming and tedious task. That's where Pandas Profiling comes in.
With Pandas Profiling, you can easily generate detailed reports of your data, including summary statistics, missing values, and correlations, making data exploration faster and more efficient. This article will explore how Pandas Profiling can help you improve your data exploration process and make it easier to understand your data.
- What is Pandas Profiling?
- Installation of Pandas Profiling
- Pandas Profiling in Action
- Drawbacks of Pandas Profiling & How to Overcome it
Pandas profiling is a Python library that generates a comprehensive report of a DataFrame, including information about the number of rows and columns, missing values, data types, and other statistics. It can be used to quickly identify potential issues or outliers in the data, and can also be used to generate summary statistics and visualizations of the data.
The report generated by the pandas profiling library typically includes a variety of information about the dataset, including:
- Overview: Summary statistics for all columns, including the number of rows, missing values, and data types.
- Variables: Information about each column, including the number of unique values, missing values, and the top frequent values.
- Correlations: Correlation matrix and heatmap, showing the relationship between different variables.
- Distribution: Histograms and kernel density plots for each column, show the distribution of values.
- Categorical Variables: Bar plots for categorical variables, showing the frequency of each category.
- Numerical Variables: Box plots for numerical variables, show the distribution of values and outliers.
- Text: Information about text columns, including the number of characters and words.
- File: Information about file columns, including the number of files, and the size of each file.
- High-Cardinality: Information about high-cardinality categorical variables, including their most frequent values.
- Sample: A sample of the data, with the first and last few rows displayed.
It is worth noting that the report is interactive and you can drill down on each section for more details.
To install pandas-profiling, you can use the following command:
import sys
!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
Collecting pandas-profiling[notebook]
Using cached pandas_profiling-3.6.3-py2.py3-none-any.whl (328 kB)
Requirement already satisfied, skipping upgrade: numpy<1.24,>=1.16.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (1.19.2)
Requirement already satisfied, skipping upgrade: requests<2.29,>=2.24.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (2.24.0)
Requirement already satisfied, skipping upgrade: scipy<1.10,>=1.4.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (1.5.2)
Collecting visions[type_image_path]==0.7.5
Using cached visions-0.7.5-py3-none-any.whl (102 kB)
Collecting typeguard<2.14,>=2.13.2
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
sktime 0.9.0 requires numpy>=1.19.3, but you'll have numpy 1.19.2 which is incompatible.
sktime 0.9.0 requires statsmodels<=0.12.1, but you'll have statsmodels 0.13.5 which is incompatible.
Using cached typeguard-2.13.3-py3-none-any.whl (17 kB)
Requirement already satisfied, skipping upgrade: pandas!=1.4.0,<1.6,>1.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (1.1.3)
Collecting phik<0.13,>=0.11.1
Using cached phik-0.12.3-cp37-cp37m-win_amd64.whl (664 kB)
Requirement already satisfied, skipping upgrade: tqdm<4.65,>=4.48.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (4.55.0)
Requirement already satisfied, skipping upgrade: PyYAML<6.1,>=5.0.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (5.3.1)
Collecting multimethod<1.10,>=1.4
Using cached multimethod-1.9.1-py3-none-any.whl (10 kB)
Requirement already satisfied, skipping upgrade: jinja2<3.2,>=2.11.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (2.11.2)
Collecting statsmodels<0.14,>=0.13.2
Using cached statsmodels-0.13.5-cp37-cp37m-win_amd64.whl (9.1 MB)
Requirement already satisfied, skipping upgrade: seaborn<0.13,>=0.10.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (0.11.0)
Requirement already satisfied, skipping upgrade: matplotlib<3.7,>=3.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (3.3.2)
Collecting pydantic<1.11,>=1.8.1
Using cached pydantic-1.10.4-cp37-cp37m-win_amd64.whl (2.1 MB)
Processing c:\users\youss\appdata\local\pip\cache\wheels\70\e1\52\5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655\htmlmin-0.1.12-py3-none-any.whl
Requirement already satisfied, skipping upgrade: ipywidgets>=7.5.1; extra == "notebook" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (8.0.2)
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.3; extra == "notebook" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (4.6.3)
Requirement already satisfied, skipping upgrade: jupyter-client>=5.3.4; extra == "notebook" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (6.1.7)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (1.25.11)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (2022.9.24)
Collecting networkx>=2.4
Using cached networkx-2.6.3-py3-none-any.whl (1.9 MB)
Requirement already satisfied, skipping upgrade: attrs>=19.3.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling[notebook]) (20.2.0)
Collecting tangled-up-in-unicode>=0.0.4
Using cached tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
Collecting imagehash; extra == "type_image_path"
Using cached ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
Requirement already satisfied, skipping upgrade: Pillow; extra == "type_image_path" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling[notebook]) (8.0.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas!=1.4.0,<1.6,>1.1->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas!=1.4.0,<1.6,>1.1->pandas-profiling[notebook]) (2020.1)
Requirement already satisfied, skipping upgrade: joblib>=0.14.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from phik<0.13,>=0.11.1->pandas-profiling[notebook]) (0.17.0)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jinja2<3.2,>=2.11.1->pandas-profiling[notebook]) (1.1.1)
Collecting packaging>=21.3
Using cached packaging-23.0-py3-none-any.whl (42 kB)
Requirement already satisfied, skipping upgrade: patsy>=0.5.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from statsmodels<0.14,>=0.13.2->pandas-profiling[notebook]) (0.5.2)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from matplotlib<3.7,>=3.2->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from matplotlib<3.7,>=3.2->pandas-profiling[notebook]) (1.3.0)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from matplotlib<3.7,>=3.2->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied, skipping upgrade: typing-extensions>=4.2.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pydantic<1.11,>=1.8.1->pandas-profiling[notebook]) (4.3.0)
Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.3.4)
Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.0.5)
Requirement already satisfied, skipping upgrade: ipython>=6.1.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (7.18.1)
Requirement already satisfied, skipping upgrade: jupyterlab-widgets~=3.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (3.0.3)
Requirement already satisfied, skipping upgrade: widgetsnbextension~=4.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (4.0.3)
Requirement already satisfied, skipping upgrade: pywin32>=1.0; sys_platform == "win32" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jupyter-core>=4.6.3; extra == "notebook"->pandas-profiling[notebook]) (227)
Requirement already satisfied, skipping upgrade: pyzmq>=13 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jupyter-client>=5.3.4; extra == "notebook"->pandas-profiling[notebook]) (19.0.2)
Requirement already satisfied, skipping upgrade: tornado>=4.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jupyter-client>=5.3.4; extra == "notebook"->pandas-profiling[notebook]) (6.0.4)
Collecting PyWavelets
Downloading PyWavelets-1.3.0-cp37-cp37m-win_amd64.whl (4.2 MB)
Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from python-dateutil>=2.7.3->pandas!=1.4.0,<1.6,>1.1->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied, skipping upgrade: ipython-genutils in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied, skipping upgrade: setuptools>=18.5 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (50.3.0.post20201006)
Requirement already satisfied, skipping upgrade: pickleshare in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied, skipping upgrade: pygments in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (2.7.2)
Requirement already satisfied, skipping upgrade: backcall in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied, skipping upgrade: jedi>=0.10 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.17.1)
Requirement already satisfied, skipping upgrade: decorator in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied, skipping upgrade: colorama; sys_platform == "win32" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied, skipping upgrade: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (3.0.8)
Requirement already satisfied, skipping upgrade: parso<0.8.0,>=0.7.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jedi>=0.10->ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.0)
Requirement already satisfied, skipping upgrade: wcwidth in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.5)
Installing collected packages: multimethod, networkx, tangled-up-in-unicode, PyWavelets, imagehash, visions, typeguard, phik, packaging, statsmodels, pydantic, htmlmin, pandas-profiling
Attempting uninstall: packaging
Found existing installation: packaging 20.4
Uninstalling packaging-20.4:
Successfully uninstalled packaging-20.4
Attempting uninstall: statsmodels
Found existing installation: statsmodels 0.12.1
Uninstalling statsmodels-0.12.1:
Successfully uninstalled statsmodels-0.12.1
Successfully installed PyWavelets-1.3.0 htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.9.1 networkx-2.6.3 packaging-23.0 pandas-profiling-3.6.3 phik-0.12.3 pydantic-1.10.4 statsmodels-0.13.5 tangled-up-in-unicode-0.2.0 typeguard-2.13.3 visions-0.7.5
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: ok
import pandas as pd
import pandas_profiling as pp
Let's put the pandas profiling into action and see how it works. We will use the popular baby names dataset.
Popular_baby_names_df = pd.read_csv('Popular_Baby_Names.csv')
Popular_baby_names_df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Year of Birth | Gender | Ethnicity | Child's First Name | Count | Rank | |
---|---|---|---|---|---|---|
0 | 2011 | FEMALE | ASIAN AND PACIFIC ISLANDER | SOPHIA | 119 | 1 |
1 | 2011 | FEMALE | ASIAN AND PACIFIC ISLANDER | CHLOE | 106 | 2 |
2 | 2011 | FEMALE | ASIAN AND PACIFIC ISLANDER | EMILY | 93 | 3 |
3 | 2011 | FEMALE | ASIAN AND PACIFIC ISLANDER | OLIVIA | 89 | 4 |
4 | 2011 | FEMALE | ASIAN AND PACIFIC ISLANDER | EMMA | 75 | 5 |
profile = pp.ProfileReport(Popular_baby_names_df, title='Pandas Profiling Report')
# display the report
profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…
Pandas Profiling is a great tool for quickly generating detailed reports of your data, but it does have some drawbacks. One of the main drawbacks is that it can be memory intensive, especially for large datasets. This can cause the tool to run slowly or even crash if you don't have enough memory available.
Another drawback is that Pandas Profiling can only be used with Pandas DataFrames. This means that if you're working with data in a different format, such as a CSV file or a SQL database, you'll need to first convert it to a Pandas DataFrame before you can use Pandas Profiling.
Additionally, Pandas Profiling generates a lot of information and can be overwhelming to digest if you don't know what you're looking for. The report is also not interactive, and you'll have to export it to a file format like HTML, pdf, or excel to share or present it.
To overcome these limitations, you can try the following:
- Use Pandas Profiling on a sample of your data rather than the entire dataset to reduce memory usage.
- Use Pandas to convert your data to a DataFrame before using Pandas Profiling.
- Use the options in Pandas Profiling to customize the report and only include the information that you need.
- Use visualization libraries like Matplotlib, and Seaborn to make the report more interactive and easy to understand.
- Use the report as a starting point for your data exploration, and then use other tools and techniques to dive deeper into your data.
profile.to_file("your_report.html")
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]