feat: report comparisons (#1069)

* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
ydataai · Nov 19, 2022 · 51a8e9f · 51a8e9f
1 parent 7b73e8a
commit 51a8e9f
Show file tree

Hide file tree

Showing 77 changed files with 1,920 additions and 568 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -52,9 +52,10 @@ repos:
     hooks:
     -   id: flake8
         name: flake8-annotations
-        args: [ "--select=ANN001,ANN201,ANN202,ANN205,ANN206,ANN301" ]
+        args: [ "--select=ANN001,ANN201,ANN202,ANN205,ANN206,ANN301,C4" ]
         additional_dependencies:
           - flake8-annotations
+          - flake8-comprehensions
         exclude: |
           (?x)(
             ^tests/|
@@ -71,7 +72,7 @@ repos:
     hooks:
     -   id: rst-backticks
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: 'v0.931'
+    rev: 'v0.982'
     hooks:
     -   id: mypy
         additional_dependencies:

diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ For each column, the following information (whenever relevant for the column typ
 - **Most frequent and extreme values**
 - **Histograms**: categorical and numerical
 - **Correlations**: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
-- **Missing values**: through counts, matrix, heatmap and dendrograms
+- **Missing values**: through counts, matrix and heatmap
 - **Duplicate rows**: list of the most common duplicated rows
 - **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
 - **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
@@ -200,11 +200,12 @@ You need [Python 3](https://python3statement.org/) to run the package. Other dep
 
 The documentation includes guides, tips and tricks for tackling common use cases:
 
-| Use case | Description |
-|---|---|
-| [Profiling large datasets](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/big_data.html ) | Tips on how to prepare data and configure `pandas-profiling` for working with large datasets |
-| [Handling sensitive data](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/sensitive_data.html ) | Generating reports which are mindful about sensitive data in the input dataset |
-| [Dataset metadata and data dictionaries](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/metadata.html) | Complementing the report with dataset details and column-specific data dictionaries |
+| Use case                                                                                                                            | Description |
+|-------------------------------------------------------------------------------------------------------------------------------------|--|
+| [Profiling large datasets](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/big_data.html )                            | Tips on how to prepare data and configure `pandas-profiling` for working with large datasets |
+| [Handling sensitive data](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/sensitive_data.html )                       | Generating reports which are mindful about sensitive data in the input dataset |
+| [Comparing datasets](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/comparing_datasets.html )                        | Comparing multiple version of the same dataset |
+| [Dataset metadata and data dictionaries](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/metadata.html)               | Complementing the report with dataset details and column-specific data dictionaries |
 | [Customizing the report's appearance](https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/custom_report_appearance.html ) | Changing the appearance of the report's page and of the contained visualizations |
 
 ## 🔗 Integrations

diff --git a/docsrc/assets/qt.png b/docsrc/assets/qt.png
diff --git a/docsrc/source/index.rst b/docsrc/source/index.rst
@@ -3,31 +3,32 @@
 .. toctree::
    :maxdepth: 3
    :caption: Getting started
-   :hidden: 
+   :hidden:
 
    pages/getting_started/overview
    pages/getting_started/installation
    pages/getting_started/quickstart
    pages/getting_started/concepts
    pages/getting_started/examples
-   
+
 .. toctree::
    :maxdepth: 3
    :caption: Use cases
    :hidden:
 
    pages/use_cases/big_data
    pages/use_cases/sensitive_data
+   pages/use_cases/comparing_datasets
    pages/use_cases/metadata
    pages/use_cases/custom_report_appearance
-  
+
 
 .. toctree::
    :maxdepth: 3
    :caption: Integrations
    :hidden:
 
-   pages/integrations/other_dataframe_libraries  
+   pages/integrations/other_dataframe_libraries
    pages/integrations/great_expectations
    pages/integrations/data_apps
    pages/integrations/pipelines
@@ -57,7 +58,7 @@
    pages/support_contrib/help_troubleshoot
    pages/support_contrib/common_issues
    pages/support_contrib/contribution_guidelines
-   
+
 .. toctree::
    :maxdepth: 3
    :caption: Reference
@@ -68,4 +69,4 @@
    pages/reference/history
    pages/reference/announcements
    pages/reference/resources
-   
+
diff --git a/docsrc/source/pages/advanced_usage/available_settings.rst b/docsrc/source/pages/advanced_usage/available_settings.rst
@@ -62,18 +62,15 @@ Settings related with the missing data section and the visualizations it can inc
    :header-rows: 1
 
 .. code-block:: python
-  :caption: Configuration example: disable heatmap and dendrogram for large datasets
+  :caption: Configuration example: disable heatmap for large datasets
 
   profile = df.profile_report(
       missing_diagrams={
           "heatmap": False,
-          "dendrogram": False,
       }
   )
   profile.to_file("report.html")
 
-The missing data diagrams are generated by the `missingno <https://github.com/ResidentMario/missingno>`_ package.
-
 Correlations
 ------------
 

diff --git a/docsrc/source/pages/getting_started/overview.rst b/docsrc/source/pages/getting_started/overview.rst
@@ -38,7 +38,7 @@ For each column, the following information (whenever relevant for the column typ
 * **Most frequent and extreme values**
 * **Histograms:** categorical and numerical
 * **Correlations**: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér's V, Phik, Auto)
-* **Missing values**: through counts, matrix, heatmap and dendrograms
+* **Missing values**: through counts, matrix and heatmap
 * **Duplicate rows**: list of the most common duplicated rows
 * **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
 * **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata

diff --git a/...rce/pages/reference/api/_autosummary/pandas_profiling.visualisation.missing.rst b/...rce/pages/reference/api/_autosummary/pandas_profiling.visualisation.missing.rst
@@ -15,7 +15,6 @@
 
       get_font_size
       plot_missing_bar
-      plot_missing_dendrogram
       plot_missing_heatmap
       plot_missing_matrix
 

diff --git a/docsrc/source/pages/tables/config_missing.csv b/docsrc/source/pages/tables/config_missing.csv
@@ -1,5 +1,4 @@
 Parameter,Type,Default,Description
 ``missing_diagrams.bar``,boolean,``True``,"Display a bar chart with counts of missing values for each column."
 ``missing_diagrams.matrix``,boolean,``True``,"Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows."
-``missing_diagrams.heatmap``,boolean,``True``,"Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another)."
-``missing_diagrams.dendrogram``,boolean,``True``,"Display a dendrogram. Provides insight in the co-occurrence of missing values (i.e. columns that are both filled or both none)."
+``missing_diagrams.heatmap``,boolean,``True``,"Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another)."
diff --git a/docsrc/source/pages/use_cases/comparing_datasets.rst b/docsrc/source/pages/use_cases/comparing_datasets.rst
@@ -0,0 +1,46 @@
+==================
+Dataset Comparison
+==================
+
+``pandas-profiling`` can be used to compare multiple version of the same dataset.
+This is useful when comparing data from multiple time periods, such as two years.
+Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.
+
+The following syntax can be used to compare two datasets:
+
+.. code-block:: python
+
+    from pandas_profiling import ProfileReport
+
+    train_df = pd.read_csv("train.csv")
+    train_report = ProfileReport(train_df, title="Train")
+
+    test_df = pd.read_csv("test.csv")
+    test_report = ProfileReport(test_df, title="Test")
+
+    comparison_report = train_report.compare(test_report)
+    comparison_report.to_file("comparison.html")
+
+The comparison report uses the ``title`` attribute out of ``Settings`` as a label throughout.
+The colors are configured in ``settings.html.style.primary_colors``.
+The numeric precision parameter ``settings.report.precision`` can be played with to obtain some additional space in reports.
+
+
+In order to compare more than two reports, the following syntax can be used:
+
+.. code-block:: python
+
+    from pandas_profiling import ProfileReport, compare
+
+    comparison_report = compare([train_report, validation_report, test_report])
+
+    # Obtain merged statistics
+    statistics = comparison_report.get_description()
+
+    # Save report to file
+    comparison_report.to_file("comparison.html")
+
+Note that this functionality only ensures the support report comparison of two datasets.
+It is possible to obtain the statistics - the report may have formatting issues.
+One of the settings that can be changed is ``settings.report.precision``.
+As a rule of thumb, the value 10 can be used for a single report and 8 for comparing two reports.
diff --git a/requirements.txt b/requirements.txt
@@ -9,8 +9,6 @@ numpy>=1.16.0,<1.24
 # Could be optional
 # Related to HTML report
 htmlmin==0.1.12
-# Missing values
-missingno>=0.4.2, <0.6
 # Correlations
 phik>=0.11.1,<0.13
 # Examples

diff --git a/src/pandas_profiling/__init__.py b/src/pandas_profiling/__init__.py
@@ -3,6 +3,7 @@
 .. include:: ../../README.md
 """
 
+from pandas_profiling.compare_reports import compare
 from pandas_profiling.controller import pandas_decorator
 from pandas_profiling.profile_report import ProfileReport
 from pandas_profiling.version import __version__
@@ -15,4 +16,5 @@
     "pandas_decorator",
     "ProfileReport",
     "__version__",
+    "compare",
 ]