-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add size check before trying to split for GMeans #732
Conversation
9c8df21
to
e9ecfa0
Compare
Hello, thanks a lot for the PR! I think this makes sense in the extreme cases as you point out. I am rebasing the changes to the new code. |
Codecov Report
@@ Coverage Diff @@
## master #732 +/- ##
==========================================
- Coverage 85.59% 85.48% -0.12%
==========================================
Files 130 130
Lines 10365 10375 +10
==========================================
- Hits 8872 8869 -3
- Misses 1493 1506 +13
Continue to review full report at Codecov.
|
Thanks a lot @pberba for this contribution and @franchuterivera for rebasing this! |
* MAINT cleanup readme and remove old service yaml file (.landscape.yaml) * MAINT bump to dev version * move from fork to spawn * FIX_1061 (automl#1063) * FIX_1061 * Fxi type of target * Moving to classes_ * classes_ should be np.ndarray * Force float before nan * Pynisher context is passed to metafeatures (automl#1076) * Pynisher context to metafeatures * Update test_smbo.py Co-authored-by: Matthias Feurer <[email protected]> * Calculate loss support (automl#1075) * Calculate loss support * Relaxed log loss test for individual models * Feedback from automl#1075 * Missing loss in comment * Revert back test as well * Fix rank for metrics for which greater value is not good (automl#1079) * Enable Mypy in evaluation (except Train Evaluator) (automl#1077) * Almost all files for evaluation * Feedback from PR * Feedback from comments * Solving rebase artifacts * Revert bytes * Automatically update the Copyright when building the html (automl#1074) * update the year automatically * Fixes for new numpy * Revert test * Prepare new release (automl#1081) * prepare new release * fix unit test * bump version number * Fix 1072 (automl#1073) * Improve selector checking * Remove copy error * Rebase changes to development * No .cache and check selector path * Missing params in signature (automl#1084) * Add size check before trying to split for GMeans (automl#732) * Add size check before trying to split * Rebase to new code Co-authored-by: chico <[email protected]> * Fxi broken links in docs and update parallel docs (automl#1088) * Fxi broken links * Feedback from comments * Update manual.rst Co-authored-by: Matthias Feurer <[email protected]> * automl#660 Enable Power Transformations Update (automl#1086) * Power Transformer * Correct typo * ADD_630 * PEP8 compliance * Fix target type Co-authored-by: MaxGreil <[email protected]> * Stale Support (automl#1090) * Stale Support * Enhanced criteria for stale * Enable weekly cron job * test Co-authored-by: Matthias Feurer <[email protected]> Co-authored-by: Matthias Feurer <[email protected]> Co-authored-by: Rohit Agarwal <[email protected]> Co-authored-by: Pepe Berba <[email protected]> Co-authored-by: MaxGreil <[email protected]>
This adds a simple check before trying to split the current cluster into two. Since we have the
minimum_samples_per_cluster
parameter, there is no point in trying to split a cluster with less thanself.minimum_samples_per_cluster*2
samples.This also prevents the algorithm from running into the state where it is trying to split a cluster with just one sample, which could happen if there are some outliers in your data.
You can recreate this problem using the sample data from HDBSCAN
And this may result to the following error: