Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce auto parameter for correlations #1095

Merged
merged 5 commits into from
Oct 18, 2022

Conversation

jtook
Copy link
Contributor

@jtook jtook commented Oct 5, 2022

The purpose of this feature is to automatically choose the most suitable 'correlation'/'association' metric for each pair of columns .The auto setting is an easily interpretable pairwise column metric that uses previously implemented metrics of the following mapping:

vartype-vartype             : method, 
categorical-categorical : Cramer's V, 
numerical-categorical   : Cramer's V (using a discretized numerical column), 
numerical-numerical     : Spearman's ρ. 

@jtook jtook requested a review from aquemy October 5, 2022 15:05
@jtook jtook changed the base branch from master to develop October 5, 2022 15:05
@fabclmnt fabclmnt self-requested a review October 5, 2022 15:09
@aquemy aquemy requested a review from sbrugman October 5, 2022 15:09
Copy link
Contributor

@fabclmnt fabclmnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtook jtook requested a review from fabclmnt October 5, 2022 16:18
@sbrugman
Copy link
Collaborator

sbrugman commented Oct 5, 2022

Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation.

For instance, the name auto could be confusing as there is also autocorrelation. We'd rather not have a discussion for optimal name here. Similarly, we can document there the pro's and con's for adding this method (PhiK is also able to provide coefficients for mixed type variables).

On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled).

Copy link
Collaborator

@sbrugman sbrugman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment

@fabclmnt
Copy link
Contributor

fabclmnt commented Oct 6, 2022

Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation.

For instance, the name auto could be confusing as there is also autocorrelation. We'd rather not have a discussion for optimal name here. Similarly, we can document there the pro's and cons for adding this method (PhiK is also able to provide coefficients for mixed-type variables).

On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled).

Although I do agree that's important, we are still setting the best practices for the flows, and given that the PR has already a small description of the feature it will be rather easy to follow up on the progress.

I think auto is not confusing with autocorrelation given the way the configuration file is structured. This was discussed and for now, we will move forward with auto - which is a decision in line with what we see in other packages.

Regarding the documentation, I do think we can have it improved, and add to the documentation an extensive understanding of the differences between the different association metrics (as it is missing for all of them anyway). My suggestion is to open a docs issue with that detail and address it separately.

Correlation Docs: #1100

@codecov-commenter
Copy link

codecov-commenter commented Oct 6, 2022

Codecov Report

Base: 90.91% // Head: 91.10% // Increases project coverage by +0.18% 🎉

Coverage data is based on head (609df81) compared to base (b891c40).
Patch coverage: 99.13% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1095      +/-   ##
===========================================
+ Coverage    90.91%   91.10%   +0.18%     
===========================================
  Files          174      177       +3     
  Lines         4933     5048     +115     
===========================================
+ Hits          4485     4599     +114     
- Misses         448      449       +1     
Flag Coverage Δ
py3.8-ubuntu-latest-pandas 91.10% <99.13%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pandas_profiling/model/correlations.py 87.87% <80.00%> (-0.65%) ⬇️
src/pandas_profiling/config.py 100.00% <100.00%> (ø)
...ndas_profiling/model/pandas/correlations_pandas.py 100.00% <100.00%> (ø)
...pandas_profiling/model/pandas/discretize_pandas.py 100.00% <100.00%> (ø)
.../pandas_profiling/report/structure/correlations.py 96.42% <100.00%> (+0.13%) ⬆️
tests/unit/test_pandas/test_correlations.py 100.00% <100.00%> (ø)
tests/unit/test_pandas/test_discretize.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@fabclmnt fabclmnt self-requested a review October 7, 2022 14:38
@sbrugman sbrugman force-pushed the feat/correlation_auto_parameter branch from 7106bbc to 540fd85 Compare October 9, 2022 22:05
config: Settings,
df: pd.DataFrame,
summary: dict,
n_bins: int = 10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we document how the user can change the n_bins argument?
(Personally I'd add it as an option under auto, just below threshold in config.yml, with an additional field added to the Correlation class in config.py)

Does the overload on Auto.compute still work when passing n_bins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I have taken your approach and implemented this as you have recommended.
  2. I have tested calling the pandas_auto_compute function multiple times with/without n_bins parameter without failure. For instance it passes with the following example:
auto_result = pandas_auto_compute(test_config, df, summary, n_bins = 12)
auto_result = pandas_auto_compute(test_config, df, summary)
auto_result = pandas_auto_compute(test_config, df, summary, n_bins = 10)

Let me know if this is what you mean!

@jtook
Copy link
Contributor Author

jtook commented Oct 11, 2022

Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation.

For instance, the name auto could be confusing as there is also autocorrelation. We'd rather not have a discussion for optimal name here. Similarly, we can document there the pro's and con's for adding this method (PhiK is also able to provide coefficients for mixed type variables).

On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled).

The idea behind this PR is that the ‘auto’ correlation should be the default setting (with all the other correlations disabled) for assessing correlations. If the user wants more control over calculating the correlations, they should disable the 'auto' correlation and choose the correlation metric/s that better suits their use case.

@fabclmnt fabclmnt requested a review from sbrugman October 12, 2022 04:39
@aquemy
Copy link
Contributor

aquemy commented Oct 18, 2022

Hi @sbrugman,

I don't think there is any blocker on this but as we want to release asap and we still need to test more, I will merge this.

@aquemy aquemy force-pushed the feat/correlation_auto_parameter branch from 609df81 to 6823a96 Compare October 18, 2022 07:46
@aquemy aquemy dismissed sbrugman’s stale review October 18, 2022 07:46

We want to release this feature and need it merged for further testing.

@aquemy aquemy merged commit 1f1a905 into develop Oct 18, 2022
@aquemy aquemy deleted the feat/correlation_auto_parameter branch October 18, 2022 07:47
vascoalramos pushed a commit that referenced this pull request Oct 20, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
portellaa pushed a commit that referenced this pull request Oct 20, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
vascoalramos pushed a commit that referenced this pull request Oct 20, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
portellaa pushed a commit that referenced this pull request Oct 20, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
portellaa pushed a commit that referenced this pull request Oct 20, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
vascoalramos pushed a commit that referenced this pull request Oct 21, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
vascoalramos pushed a commit that referenced this pull request Oct 21, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
vascoalramos pushed a commit that referenced this pull request Oct 21, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
chanedwin pushed a commit that referenced this pull request Oct 30, 2022
* feat: introduce discretization capabilities

* feat: introduce 'auto' parameter to correlations

* docs: make documentation after adding 'auto' option to the correlation metrics

* feat: introduce n_bins as a parameter to 'auto' correlation

* feat: introduce option for user to change n_bins argument for ‘auto’ correlation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants