Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size check before trying to split for GMeans #732

Merged
merged 2 commits into from
Feb 24, 2021

Conversation

pberba
Copy link
Contributor

@pberba pberba commented Oct 19, 2019

This adds a simple check before trying to split the current cluster into two. Since we have the minimum_samples_per_cluster parameter, there is no point in trying to split a cluster with less than self.minimum_samples_per_cluster*2 samples.

This also prevents the algorithm from running into the state where it is trying to split a cluster with just one sample, which could happen if there are some outliers in your data.

You can recreate this problem using the sample data from HDBSCAN

from autosklearn.metalearning.metalearning.clustering.gmeans import GMeans
import numpy as np

data = np.load('clusterable_data.npy')
# Add outliers
data[0][0] = -1
data[0][1] = -1

gmeans = GMeans()
clusters = gmeans.fit_predict(data)

And this may result to the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-10-e61faf5588d4> in <module>
      1 gmeans = GMeans()
----> 2 clusters = gmeans.fit_predict(data)
      3 clusters

<ipython-input-9-5b649f70193c> in fit_predict(self, X)
     82 
     83     def fit_predict(self, X):
---> 84         self.fit(X)
     85         predictions = self.KMeans.predict(X)
     86         return predictions

<ipython-input-9-5b649f70193c> in fit(self, X)
     37                                                          n_init=self.n_init,
     38                                                          random_state=self.random_state)
---> 39                         predictions = KMeans_.fit_predict(X_)
     40                         bins = np.bincount(predictions)
     41                         minimum = np.min(bins)

~/anaconda3/envs/tm/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y, sample_weight)
    996             Index of the cluster each sample belongs to.
    997         """
--> 998         return self.fit(X, sample_weight=sample_weight).labels_
    999 
   1000     def fit_transform(self, X, y=None, sample_weight=None):

~/anaconda3/envs/tm/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y, sample_weight)
    970                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    971                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 972                 return_n_iter=True)
    973         return self
    974 

~/anaconda3/envs/tm/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, sample_weight, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    314     if _num_samples(X) < n_clusters:
    315         raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
--> 316             _num_samples(X), n_clusters))
    317 
    318     tol = _tolerance(X, tol)

ValueError: n_samples=1 should be >= n_clusters=2

@franchuterivera
Copy link
Contributor

Hello, thanks a lot for the PR!

I think this makes sense in the extreme cases as you point out. I am rebasing the changes to the new code.

@codecov
Copy link

codecov bot commented Feb 18, 2021

Codecov Report

Merging #732 (66ed20c) into master (a10b384) will decrease coverage by 0.11%.
The diff coverage is 62.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #732      +/-   ##
==========================================
- Coverage   85.59%   85.48%   -0.12%     
==========================================
  Files         130      130              
  Lines       10365    10375      +10     
==========================================
- Hits         8872     8869       -3     
- Misses       1493     1506      +13     
Impacted Files Coverage Δ
...arn/metalearning/metalearning/clustering/gmeans.py 0.00% <0.00%> (ø)
autosklearn/experimental/askl2.py 80.19% <75.00%> (-1.72%) ⬇️
autosklearn/__version__.py 100.00% <100.00%> (ø)
...eline/components/feature_preprocessing/fast_ica.py 91.30% <0.00%> (-6.53%) ⬇️
...mponents/feature_preprocessing/nystroem_sampler.py 85.29% <0.00%> (-5.89%) ⬇️
...ine/components/feature_preprocessing/kernel_pca.py 94.73% <0.00%> (-1.76%) ⬇️
..._preprocessing/select_percentile_classification.py 87.93% <0.00%> (-1.73%) ⬇️
autosklearn/ensemble_builder.py 76.65% <0.00%> (-0.41%) ⬇️
autosklearn/util/backend.py 76.15% <0.00%> (+1.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a10b384...e9ecfa0. Read the comment docs.

@mfeurer mfeurer changed the base branch from master to development February 24, 2021 20:48
@mfeurer mfeurer merged commit d3aa95e into automl:development Feb 24, 2021
@mfeurer
Copy link
Contributor

mfeurer commented Feb 24, 2021

Thanks a lot @pberba for this contribution and @franchuterivera for rebasing this!

franchuterivera added a commit to franchuterivera/auto-sklearn that referenced this pull request Mar 11, 2021
* MAINT cleanup readme and remove old service yaml file (.landscape.yaml)

* MAINT bump to dev version

* move from fork to spawn

* FIX_1061 (automl#1063)

* FIX_1061

* Fxi type of target

* Moving to classes_

* classes_ should be np.ndarray

* Force float before nan

* Pynisher context is passed to metafeatures (automl#1076)

* Pynisher context to metafeatures

* Update test_smbo.py

Co-authored-by: Matthias Feurer <[email protected]>

* Calculate loss support (automl#1075)

* Calculate loss support

* Relaxed log loss test for individual models

* Feedback from automl#1075

* Missing loss in comment

* Revert back test as well

* Fix rank for metrics for which greater value is not good (automl#1079)

* Enable Mypy in evaluation (except Train Evaluator) (automl#1077)

* Almost all files for evaluation

* Feedback from PR

* Feedback from comments

* Solving rebase artifacts

* Revert bytes

* Automatically update the Copyright when building the html (automl#1074)

* update the year automatically

* Fixes for new numpy

* Revert test

* Prepare new release (automl#1081)

* prepare new release

* fix unit test

* bump version number

* Fix 1072 (automl#1073)

* Improve selector checking

* Remove copy error

* Rebase changes to development

* No .cache and check selector path

* Missing params in signature (automl#1084)

* Add size check before trying to split for GMeans (automl#732)

* Add size check before trying to split

* Rebase to new code

Co-authored-by: chico <[email protected]>

* Fxi broken links in docs and update parallel docs (automl#1088)

* Fxi broken links

* Feedback from comments

* Update manual.rst

Co-authored-by: Matthias Feurer <[email protected]>

* automl#660 Enable Power Transformations Update (automl#1086)

* Power Transformer

* Correct typo

* ADD_630

* PEP8 compliance

* Fix target type

Co-authored-by: MaxGreil <[email protected]>

* Stale Support (automl#1090)

* Stale Support

* Enhanced criteria for stale

* Enable weekly cron job

* test

Co-authored-by: Matthias Feurer <[email protected]>
Co-authored-by: Matthias Feurer <[email protected]>
Co-authored-by: Rohit Agarwal <[email protected]>
Co-authored-by: Pepe Berba <[email protected]>
Co-authored-by: MaxGreil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants