feat: introduce auto parameter for correlations #1095

jtook · 2022-10-05T15:05:02Z

The purpose of this feature is to automatically choose the most suitable 'correlation'/'association' metric for each pair of columns .The auto setting is an easily interpretable pairwise column metric that uses previously implemented metrics of the following mapping:

vartype-vartype             : method, 
categorical-categorical : Cramer's V, 
numerical-categorical   : Cramer's V (using a discretized numerical column), 
numerical-numerical     : Spearman's ρ.

fabclmnt

Don't forget to add to the documentation.

eg. https://pandas-profiling.ydata.ai/docs/master/pages/reference/api/_autosummary/pandas_profiling.model.correlations.html?highlight=correlation#module-pandas_profiling.model.correlations

https://pandas-profiling.ydata.ai/docs/master/pages/advanced_usage/corr_mat_access.html?highlight=correlation

docsrc/source/pages/reference/api/_autosummary/pandas_profiling.model.summary_algorithms.rst

docsrc/source/pages/reference/api/_autosummary/pandas_profiling.model.alerts.rst

src/pandas_profiling/config_default.yaml

src/pandas_profiling/config.py

sbrugman · 2022-10-05T21:54:50Z

Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation.

For instance, the name auto could be confusing as there is also autocorrelation. We'd rather not have a discussion for optimal name here. Similarly, we can document there the pro's and con's for adding this method (PhiK is also able to provide coefficients for mixed type variables).

On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled).

sbrugman

See other comment

fabclmnt · 2022-10-06T17:53:38Z

Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation.

For instance, the name auto could be confusing as there is also autocorrelation. We'd rather not have a discussion for optimal name here. Similarly, we can document there the pro's and cons for adding this method (PhiK is also able to provide coefficients for mixed-type variables).

On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled).

Although I do agree that's important, we are still setting the best practices for the flows, and given that the PR has already a small description of the feature it will be rather easy to follow up on the progress.

I think auto is not confusing with autocorrelation given the way the configuration file is structured. This was discussed and for now, we will move forward with auto - which is a decision in line with what we see in other packages.

Regarding the documentation, I do think we can have it improved, and add to the documentation an extensive understanding of the differences between the different association metrics (as it is missing for all of them anyway). My suggestion is to open a docs issue with that detail and address it separately.

Correlation Docs: #1100

codecov-commenter · 2022-10-06T20:40:04Z

Codecov Report

Base: 90.91% // Head: 91.10% // Increases project coverage by +0.18% 🎉

Coverage data is based on head (609df81) compared to base (b891c40).
Patch coverage: 99.13% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1095      +/-   ##
===========================================
+ Coverage    90.91%   91.10%   +0.18%     
===========================================
  Files          174      177       +3     
  Lines         4933     5048     +115     
===========================================
+ Hits          4485     4599     +114     
- Misses         448      449       +1

Flag	Coverage Δ
py3.8-ubuntu-latest-pandas	`91.10% <99.13%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/pandas_profiling/model/correlations.py	`87.87% <80.00%> (-0.65%)`	⬇️
src/pandas_profiling/config.py	`100.00% <100.00%> (ø)`
...ndas_profiling/model/pandas/correlations_pandas.py	`100.00% <100.00%> (ø)`
...pandas_profiling/model/pandas/discretize_pandas.py	`100.00% <100.00%> (ø)`
.../pandas_profiling/report/structure/correlations.py	`96.42% <100.00%> (+0.13%)`	⬆️
tests/unit/test_pandas/test_correlations.py	`100.00% <100.00%> (ø)`
tests/unit/test_pandas/test_discretize.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/pandas_profiling/config_default.yaml

sbrugman · 2022-10-09T22:15:46Z

src/pandas_profiling/model/pandas/correlations_pandas.py

+    config: Settings,
+    df: pd.DataFrame,
+    summary: dict,
+    n_bins: int = 10,


Could we document how the user can change the n_bins argument?
(Personally I'd add it as an option under auto, just below threshold in config.yml, with an additional field added to the Correlation class in config.py)

Does the overload on Auto.compute still work when passing n_bins?

I have taken your approach and implemented this as you have recommended.

I have tested calling the pandas_auto_compute function multiple times with/without n_bins parameter without failure. For instance it passes with the following example:

auto_result = pandas_auto_compute(test_config, df, summary, n_bins = 12) auto_result = pandas_auto_compute(test_config, df, summary) auto_result = pandas_auto_compute(test_config, df, summary, n_bins = 10)

Let me know if this is what you mean!

jtook · 2022-10-11T17:48:12Z

Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation.

For instance, the name auto could be confusing as there is also autocorrelation. We'd rather not have a discussion for optimal name here. Similarly, we can document there the pro's and con's for adding this method (PhiK is also able to provide coefficients for mixed type variables).

On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled).

The idea behind this PR is that the ‘auto’ correlation should be the default setting (with all the other correlations disabled) for assessing correlations. If the user wants more control over calculating the correlations, they should disable the 'auto' correlation and choose the correlation metric/s that better suits their use case.

aquemy · 2022-10-18T07:31:42Z

Hi @sbrugman,

I don't think there is any blocker on this but as we want to release asap and we still need to test more, I will merge this.

…n metrics

…correlation

We want to release this feature and need it merged for further testing.

* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation

jtook requested a review from aquemy October 5, 2022 15:05

jtook changed the base branch from master to develop October 5, 2022 15:05

fabclmnt self-requested a review October 5, 2022 15:09

aquemy requested a review from sbrugman October 5, 2022 15:09

aquemy approved these changes Oct 5, 2022

View reviewed changes

fabclmnt requested changes Oct 5, 2022

View reviewed changes

jtook requested a review from fabclmnt October 5, 2022 16:18

fabclmnt reviewed Oct 5, 2022

View reviewed changes

docsrc/source/pages/reference/api/_autosummary/pandas_profiling.model.summary_algorithms.rst Outdated Show resolved Hide resolved

fabclmnt reviewed Oct 5, 2022

View reviewed changes

docsrc/source/pages/reference/api/_autosummary/pandas_profiling.model.alerts.rst Outdated Show resolved Hide resolved

fabclmnt reviewed Oct 5, 2022

View reviewed changes

src/pandas_profiling/config_default.yaml Show resolved Hide resolved

sbrugman reviewed Oct 5, 2022

View reviewed changes

src/pandas_profiling/config.py Outdated Show resolved Hide resolved

sbrugman previously requested changes Oct 5, 2022

View reviewed changes

fabclmnt self-requested a review October 7, 2022 14:38

fabclmnt approved these changes Oct 7, 2022

View reviewed changes

sbrugman force-pushed the feat/correlation_auto_parameter branch from 7106bbc to 540fd85 Compare October 9, 2022 22:05

sbrugman reviewed Oct 9, 2022

View reviewed changes

src/pandas_profiling/config_default.yaml Outdated Show resolved Hide resolved

sbrugman reviewed Oct 9, 2022

View reviewed changes

fabclmnt requested a review from sbrugman October 12, 2022 04:39

jtook added 5 commits October 18, 2022 09:46

feat: introduce discretization capabilities

49e70ab

feat: introduce 'auto' parameter to correlations

95c37a9

docs: make documentation after adding 'auto' option to the correlatio…

f459c95

…n metrics

feat: introduce n_bins as a parameter to 'auto' correlation

04f59be

feat: introduce option for user to change n_bins argument for ‘auto’ …

6823a96

…correlation

aquemy force-pushed the feat/correlation_auto_parameter branch from 609df81 to 6823a96 Compare October 18, 2022 07:46

aquemy merged commit 1f1a905 into develop Oct 18, 2022

aquemy deleted the feat/correlation_auto_parameter branch October 18, 2022 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce auto parameter for correlations #1095

feat: introduce auto parameter for correlations #1095

jtook commented Oct 5, 2022

fabclmnt left a comment •

edited

Loading

sbrugman commented Oct 5, 2022

sbrugman left a comment

fabclmnt commented Oct 6, 2022 •

edited

Loading

codecov-commenter commented Oct 6, 2022 •

edited

Loading

sbrugman Oct 9, 2022

jtook Oct 11, 2022

jtook commented Oct 11, 2022

aquemy commented Oct 18, 2022

feat: introduce auto parameter for correlations #1095

feat: introduce auto parameter for correlations #1095

Conversation

jtook commented Oct 5, 2022

fabclmnt left a comment • edited Loading

Choose a reason for hiding this comment

sbrugman commented Oct 5, 2022

sbrugman left a comment

Choose a reason for hiding this comment

fabclmnt commented Oct 6, 2022 • edited Loading

codecov-commenter commented Oct 6, 2022 • edited Loading

Codecov Report

sbrugman Oct 9, 2022

Choose a reason for hiding this comment

jtook Oct 11, 2022

Choose a reason for hiding this comment

jtook commented Oct 11, 2022

aquemy commented Oct 18, 2022

fabclmnt left a comment •

edited

Loading

fabclmnt commented Oct 6, 2022 •

edited

Loading

codecov-commenter commented Oct 6, 2022 •

edited

Loading