Skip to content

Commit

Permalink
feat: report comparisons (#1069)
Browse files Browse the repository at this point in the history
* ci: check for flake8 comprehensions

* fix(config): configuration order is now respected

* fix: index is no longer automatically added to dataframe

* feat: correlation alerts show the name of the correlation

* fix: strip tags from the title of the web report

* feat: comparing two or more datasets (see docs)

* docs(comparison): feature description

* docs(readme): include reference to the dataset comparison use case

* refactor: config private attribute

* refactor: config update, exclude defaults

* refactor: include style attribute in timeseries code

* refactor: include style attribute in templates

* test(comparisons): add tests for report comparison

* refactor: overall correlation lowercase

* refactor: frequency table kwargs

* refactor: frequency table styling

* refactor: fixing renderable tests

* refactor: fixing renderable tests

* style: formatting

* refactor: senstive test

* refactor: pass style argument

* feat: check for empty dataframe

* refactor: namespace invariant type check

* refactor: ipywidgets fixes

* refactor: ipywidgets no comparison support yet

* refactor: process feedback

* fix: comparison bugs (#1137)

* fix: refactoring bugs

* fix: update protected var labels for comparison

* fix: add support to timeseries comparison

* fix: style changes for readability

* test: add simple run test

* fix: reword comparison report doc (#1136)

* fix: rewording

Co-authored-by: Aarni Koskela <[email protected]>

* feat: add comparison validations (#1143)

* feat: add comparison validations

* feat: replace missing plots to avoid dependencies' confilicts (#1148)

* feat: add new missing histogram plot

* feat: add new missing matrix plot

* feat: add new missing heatmap plot

* feat: remove dendrogram

* feat: ignore columns not present on the base report (#1150)

* feat: select only the left side of the comparison

* chore: pre-commit fixes

* fix: not intersection of columns

* [skip ci] Code formatting

* fix: missing plots columns order

* [skip ci] Code formatting

* fix: interactions/missing plot colors

* fix: code formatting

Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Azory YData Bot <[email protected]>
Co-authored-by: alexbarros <[email protected]>
  • Loading branch information
4 people committed Nov 19, 2022
1 parent 7b73e8a commit 51a8e9f
Show file tree
Hide file tree
Showing 77 changed files with 1,920 additions and 568 deletions.
5 changes: 3 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,10 @@ repos:
hooks:
- id: flake8
name: flake8-annotations
args: [ "--select=ANN001,ANN201,ANN202,ANN205,ANN206,ANN301" ]
args: [ "--select=ANN001,ANN201,ANN202,ANN205,ANN206,ANN301,C4" ]
additional_dependencies:
- flake8-annotations
- flake8-comprehensions
exclude: |
(?x)(
^tests/|
Expand All @@ -71,7 +72,7 @@ repos:
hooks:
- id: rst-backticks
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v0.931'
rev: 'v0.982'
hooks:
- id: mypy
additional_dependencies:
Expand Down
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ For each column, the following information (whenever relevant for the column typ
- **Most frequent and extreme values**
- **Histograms**: categorical and numerical
- **Correlations**: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
- **Missing values**: through counts, matrix, heatmap and dendrograms
- **Missing values**: through counts, matrix and heatmap
- **Duplicate rows**: list of the most common duplicated rows
- **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
Expand Down Expand Up @@ -200,11 +200,12 @@ You need [Python 3](https://python3statement.org/) to run the package. Other dep

The documentation includes guides, tips and tricks for tackling common use cases:

| Use case | Description |
|---|---|
| [Profiling large datasets](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/big_data.html ) | Tips on how to prepare data and configure `pandas-profiling` for working with large datasets |
| [Handling sensitive data](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/sensitive_data.html ) | Generating reports which are mindful about sensitive data in the input dataset |
| [Dataset metadata and data dictionaries](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/metadata.html) | Complementing the report with dataset details and column-specific data dictionaries |
| Use case | Description |
|-------------------------------------------------------------------------------------------------------------------------------------|--|
| [Profiling large datasets](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/big_data.html ) | Tips on how to prepare data and configure `pandas-profiling` for working with large datasets |
| [Handling sensitive data](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/sensitive_data.html ) | Generating reports which are mindful about sensitive data in the input dataset |
| [Comparing datasets](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/comparing_datasets.html ) | Comparing multiple version of the same dataset |
| [Dataset metadata and data dictionaries](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/metadata.html) | Complementing the report with dataset details and column-specific data dictionaries |
| [Customizing the report's appearance](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/custom_report_appearance.html ) | Changing the appearance of the report's page and of the contained visualizations |

## 🔗 Integrations
Expand Down
Binary file removed docsrc/assets/qt.png
Binary file not shown.
13 changes: 7 additions & 6 deletions docsrc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,31 +3,32 @@
.. toctree::
:maxdepth: 3
:caption: Getting started
:hidden:
:hidden:

pages/getting_started/overview
pages/getting_started/installation
pages/getting_started/quickstart
pages/getting_started/concepts
pages/getting_started/examples

.. toctree::
:maxdepth: 3
:caption: Use cases
:hidden:

pages/use_cases/big_data
pages/use_cases/sensitive_data
pages/use_cases/comparing_datasets
pages/use_cases/metadata
pages/use_cases/custom_report_appearance


.. toctree::
:maxdepth: 3
:caption: Integrations
:hidden:

pages/integrations/other_dataframe_libraries
pages/integrations/other_dataframe_libraries
pages/integrations/great_expectations
pages/integrations/data_apps
pages/integrations/pipelines
Expand Down Expand Up @@ -57,7 +58,7 @@
pages/support_contrib/help_troubleshoot
pages/support_contrib/common_issues
pages/support_contrib/contribution_guidelines

.. toctree::
:maxdepth: 3
:caption: Reference
Expand All @@ -68,4 +69,4 @@
pages/reference/history
pages/reference/announcements
pages/reference/resources

5 changes: 1 addition & 4 deletions docsrc/source/pages/advanced_usage/available_settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,15 @@ Settings related with the missing data section and the visualizations it can inc
:header-rows: 1

.. code-block:: python
:caption: Configuration example: disable heatmap and dendrogram for large datasets
:caption: Configuration example: disable heatmap for large datasets
profile = df.profile_report(
missing_diagrams={
"heatmap": False,
"dendrogram": False,
}
)
profile.to_file("report.html")
The missing data diagrams are generated by the `missingno <https://github.com/ResidentMario/missingno>`_ package.
Correlations
------------
Expand Down
2 changes: 1 addition & 1 deletion docsrc/source/pages/getting_started/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ For each column, the following information (whenever relevant for the column typ
* **Most frequent and extreme values**
* **Histograms:** categorical and numerical
* **Correlations**: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér's V, Phik, Auto)
* **Missing values**: through counts, matrix, heatmap and dendrograms
* **Missing values**: through counts, matrix and heatmap
* **Duplicate rows**: list of the most common duplicated rows
* **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
* **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@

get_font_size
plot_missing_bar
plot_missing_dendrogram
plot_missing_heatmap
plot_missing_matrix

Expand Down
3 changes: 1 addition & 2 deletions docsrc/source/pages/tables/config_missing.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
Parameter,Type,Default,Description
``missing_diagrams.bar``,boolean,``True``,"Display a bar chart with counts of missing values for each column."
``missing_diagrams.matrix``,boolean,``True``,"Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows."
``missing_diagrams.heatmap``,boolean,``True``,"Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another)."
``missing_diagrams.dendrogram``,boolean,``True``,"Display a dendrogram. Provides insight in the co-occurrence of missing values (i.e. columns that are both filled or both none)."
``missing_diagrams.heatmap``,boolean,``True``,"Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another)."
46 changes: 46 additions & 0 deletions docsrc/source/pages/use_cases/comparing_datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
==================
Dataset Comparison
==================

``pandas-profiling`` can be used to compare multiple version of the same dataset.
This is useful when comparing data from multiple time periods, such as two years.
Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.

The following syntax can be used to compare two datasets:

.. code-block:: python
from pandas_profiling import ProfileReport
train_df = pd.read_csv("train.csv")
train_report = ProfileReport(train_df, title="Train")
test_df = pd.read_csv("test.csv")
test_report = ProfileReport(test_df, title="Test")
comparison_report = train_report.compare(test_report)
comparison_report.to_file("comparison.html")
The comparison report uses the ``title`` attribute out of ``Settings`` as a label throughout.
The colors are configured in ``settings.html.style.primary_colors``.
The numeric precision parameter ``settings.report.precision`` can be played with to obtain some additional space in reports.


In order to compare more than two reports, the following syntax can be used:

.. code-block:: python
from pandas_profiling import ProfileReport, compare
comparison_report = compare([train_report, validation_report, test_report])
# Obtain merged statistics
statistics = comparison_report.get_description()
# Save report to file
comparison_report.to_file("comparison.html")
Note that this functionality only ensures the support report comparison of two datasets.
It is possible to obtain the statistics - the report may have formatting issues.
One of the settings that can be changed is ``settings.report.precision``.
As a rule of thumb, the value 10 can be used for a single report and 8 for comparing two reports.
2 changes: 0 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ numpy>=1.16.0,<1.24
# Could be optional
# Related to HTML report
htmlmin==0.1.12
# Missing values
missingno>=0.4.2, <0.6
# Correlations
phik>=0.11.1,<0.13
# Examples
Expand Down
2 changes: 2 additions & 0 deletions src/pandas_profiling/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
.. include:: ../../README.md
"""

from pandas_profiling.compare_reports import compare
from pandas_profiling.controller import pandas_decorator
from pandas_profiling.profile_report import ProfileReport
from pandas_profiling.version import __version__
Expand All @@ -15,4 +16,5 @@
"pandas_decorator",
"ProfileReport",
"__version__",
"compare",
]
Loading

0 comments on commit 51a8e9f

Please sign in to comment.