Add size check before trying to split for GMeans #732

pberba · 2019-10-19T05:16:12Z

This adds a simple check before trying to split the current cluster into two. Since we have the minimum_samples_per_cluster parameter, there is no point in trying to split a cluster with less than self.minimum_samples_per_cluster*2 samples.

This also prevents the algorithm from running into the state where it is trying to split a cluster with just one sample, which could happen if there are some outliers in your data.

You can recreate this problem using the sample data from HDBSCAN

from autosklearn.metalearning.metalearning.clustering.gmeans import GMeans
import numpy as np

data = np.load('clusterable_data.npy')
# Add outliers
data[0][0] = -1
data[0][1] = -1

gmeans = GMeans()
clusters = gmeans.fit_predict(data)

And this may result to the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-10-e61faf5588d4> in <module>
      1 gmeans = GMeans()
----> 2 clusters = gmeans.fit_predict(data)
      3 clusters

<ipython-input-9-5b649f70193c> in fit_predict(self, X)
     82 
     83     def fit_predict(self, X):
---> 84         self.fit(X)
     85         predictions = self.KMeans.predict(X)
     86         return predictions

<ipython-input-9-5b649f70193c> in fit(self, X)
     37                                                          n_init=self.n_init,
     38                                                          random_state=self.random_state)
---> 39                         predictions = KMeans_.fit_predict(X_)
     40                         bins = np.bincount(predictions)
     41                         minimum = np.min(bins)

~/anaconda3/envs/tm/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y, sample_weight)
    996             Index of the cluster each sample belongs to.
    997         """
--> 998         return self.fit(X, sample_weight=sample_weight).labels_
    999 
   1000     def fit_transform(self, X, y=None, sample_weight=None):

~/anaconda3/envs/tm/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y, sample_weight)
    970                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    971                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 972                 return_n_iter=True)
    973         return self
    974 

~/anaconda3/envs/tm/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, sample_weight, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    314     if _num_samples(X) < n_clusters:
    315         raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
--> 316             _num_samples(X), n_clusters))
    317 
    318     tol = _tolerance(X, tol)

ValueError: n_samples=1 should be >= n_clusters=2

franchuterivera · 2021-02-18T18:09:36Z

Hello, thanks a lot for the PR!

I think this makes sense in the extreme cases as you point out. I am rebasing the changes to the new code.

codecov · 2021-02-18T18:40:59Z

Codecov Report

Merging #732 (66ed20c) into master (a10b384) will decrease coverage by 0.11%.
The diff coverage is 62.50%.

@@            Coverage Diff             @@
##           master     #732      +/-   ##
==========================================
- Coverage   85.59%   85.48%   -0.12%     
==========================================
  Files         130      130              
  Lines       10365    10375      +10     
==========================================
- Hits         8872     8869       -3     
- Misses       1493     1506      +13

Impacted Files	Coverage Δ
...arn/metalearning/metalearning/clustering/gmeans.py	`0.00% <0.00%> (ø)`
autosklearn/experimental/askl2.py	`80.19% <75.00%> (-1.72%)`	⬇️
autosklearn/__version__.py	`100.00% <100.00%> (ø)`
...eline/components/feature_preprocessing/fast_ica.py	`91.30% <0.00%> (-6.53%)`	⬇️
...mponents/feature_preprocessing/nystroem_sampler.py	`85.29% <0.00%> (-5.89%)`	⬇️
...ine/components/feature_preprocessing/kernel_pca.py	`94.73% <0.00%> (-1.76%)`	⬇️
..._preprocessing/select_percentile_classification.py	`87.93% <0.00%> (-1.73%)`	⬇️
autosklearn/ensemble_builder.py	`76.65% <0.00%> (-0.41%)`	⬇️
autosklearn/util/backend.py	`76.15% <0.00%> (+1.42%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a10b384...e9ecfa0. Read the comment docs.

mfeurer · 2021-02-24T20:49:06Z

Thanks a lot @pberba for this contribution and @franchuterivera for rebasing this!

* MAINT cleanup readme and remove old service yaml file (.landscape.yaml) * MAINT bump to dev version * move from fork to spawn * FIX_1061 (automl#1063) * FIX_1061 * Fxi type of target * Moving to classes_ * classes_ should be np.ndarray * Force float before nan * Pynisher context is passed to metafeatures (automl#1076) * Pynisher context to metafeatures * Update test_smbo.py Co-authored-by: Matthias Feurer <[email protected]> * Calculate loss support (automl#1075) * Calculate loss support * Relaxed log loss test for individual models * Feedback from automl#1075 * Missing loss in comment * Revert back test as well * Fix rank for metrics for which greater value is not good (automl#1079) * Enable Mypy in evaluation (except Train Evaluator) (automl#1077) * Almost all files for evaluation * Feedback from PR * Feedback from comments * Solving rebase artifacts * Revert bytes * Automatically update the Copyright when building the html (automl#1074) * update the year automatically * Fixes for new numpy * Revert test * Prepare new release (automl#1081) * prepare new release * fix unit test * bump version number * Fix 1072 (automl#1073) * Improve selector checking * Remove copy error * Rebase changes to development * No .cache and check selector path * Missing params in signature (automl#1084) * Add size check before trying to split for GMeans (automl#732) * Add size check before trying to split * Rebase to new code Co-authored-by: chico <[email protected]> * Fxi broken links in docs and update parallel docs (automl#1088) * Fxi broken links * Feedback from comments * Update manual.rst Co-authored-by: Matthias Feurer <[email protected]> * automl#660 Enable Power Transformations Update (automl#1086) * Power Transformer * Correct typo * ADD_630 * PEP8 compliance * Fix target type Co-authored-by: MaxGreil <[email protected]> * Stale Support (automl#1090) * Stale Support * Enhanced criteria for stale * Enable weekly cron job * test Co-authored-by: Matthias Feurer <[email protected]> Co-authored-by: Matthias Feurer <[email protected]> Co-authored-by: Rohit Agarwal <[email protected]> Co-authored-by: Pepe Berba <[email protected]> Co-authored-by: MaxGreil <[email protected]>

pberba and others added 2 commits February 18, 2021 19:01

Add size check before trying to split

ae94689

Rebase to new code

e9ecfa0

franchuterivera force-pushed the fix/gmeans-min-cluster branch from 9c8df21 to e9ecfa0 Compare February 18, 2021 18:08

mfeurer approved these changes Feb 24, 2021

View reviewed changes

mfeurer changed the base branch from master to development February 24, 2021 20:48

mfeurer merged commit d3aa95e into automl:development Feb 24, 2021

github-actions bot pushed a commit that referenced this pull request Feb 24, 2021

Pepe Berba: Add size check before trying to split for GMeans (#732)

2c7a444

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add size check before trying to split for GMeans #732

Add size check before trying to split for GMeans #732

pberba commented Oct 19, 2019

franchuterivera commented Feb 18, 2021

codecov bot commented Feb 18, 2021 •

edited

Loading

mfeurer commented Feb 24, 2021

Add size check before trying to split for GMeans #732

Add size check before trying to split for GMeans #732

Conversation

pberba commented Oct 19, 2019

franchuterivera commented Feb 18, 2021

codecov bot commented Feb 18, 2021 • edited Loading

Codecov Report

mfeurer commented Feb 24, 2021

codecov bot commented Feb 18, 2021 •

edited

Loading