-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce auto parameter for correlations #1095
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to add to the documentation.
docsrc/source/pages/reference/api/_autosummary/pandas_profiling.model.summary_algorithms.rst
Outdated
Show resolved
Hide resolved
docsrc/source/pages/reference/api/_autosummary/pandas_profiling.model.alerts.rst
Outdated
Show resolved
Hide resolved
Had to look into the previous PRs to see that this is superseding the previous ones. @jtook It would be better to create a feature request issue where we can discuss the broader goal, naming etc. of the feature, so that the PR can be linked for implementation. For instance, the name On the implementation side: for performance it would make sense to reuse the correlation coefficients that have been already computed for categorical-categorical and numerical-numerical (if enabled). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment
Although I do agree that's important, we are still setting the best practices for the flows, and given that the PR has already a small description of the feature it will be rather easy to follow up on the progress. I think Regarding the documentation, I do think we can have it improved, and add to the documentation an extensive understanding of the differences between the different association metrics (as it is missing for all of them anyway). My suggestion is to open a docs issue with that detail and address it separately. Correlation Docs: #1100 |
Codecov ReportBase: 90.91% // Head: 91.10% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #1095 +/- ##
===========================================
+ Coverage 90.91% 91.10% +0.18%
===========================================
Files 174 177 +3
Lines 4933 5048 +115
===========================================
+ Hits 4485 4599 +114
- Misses 448 449 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
7106bbc
to
540fd85
Compare
config: Settings, | ||
df: pd.DataFrame, | ||
summary: dict, | ||
n_bins: int = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we document how the user can change the n_bins
argument?
(Personally I'd add it as an option under auto
, just below threshold
in config.yml
, with an additional field added to the Correlation
class in config.py
)
Does the overload on Auto.compute
still work when passing n_bins
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have taken your approach and implemented this as you have recommended.
- I have tested calling the pandas_auto_compute function multiple times with/without n_bins parameter without failure. For instance it passes with the following example:
auto_result = pandas_auto_compute(test_config, df, summary, n_bins = 12)
auto_result = pandas_auto_compute(test_config, df, summary)
auto_result = pandas_auto_compute(test_config, df, summary, n_bins = 10)
Let me know if this is what you mean!
The idea behind this PR is that the ‘auto’ correlation should be the default setting (with all the other correlations disabled) for assessing correlations. If the user wants more control over calculating the correlations, they should disable the 'auto' correlation and choose the correlation metric/s that better suits their use case. |
Hi @sbrugman, I don't think there is any blocker on this but as we want to release asap and we still need to test more, I will merge this. |
609df81
to
6823a96
Compare
We want to release this feature and need it merged for further testing.
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
* feat: introduce discretization capabilities * feat: introduce 'auto' parameter to correlations * docs: make documentation after adding 'auto' option to the correlation metrics * feat: introduce n_bins as a parameter to 'auto' correlation * feat: introduce option for user to change n_bins argument for ‘auto’ correlation
The purpose of this feature is to automatically choose the most suitable 'correlation'/'association' metric for each pair of columns .The auto setting is an easily interpretable pairwise column metric that uses previously implemented metrics of the following mapping: