-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Providing intrinsic clustering quality indices #257
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #257 +/- ##
==========================================
+ Coverage 95.01% 95.40% +0.39%
==========================================
Files 19 20 +1
Lines 1385 1503 +118
==========================================
+ Hits 1316 1434 +118
Misses 69 69 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
These metrics (I find the term index a bit confusing in this context) are indeed very useful, but the code needs some work.
I've indicated in the code suggestions that we need to export the single function clustering_quality().
I've just reviewed the first calinski_harabasz()
method, but for the other ones please update the formatting, variable names and linear algebra code in the similar way.
Also please add the proper unit tests (you have them already, but they are in the wrong source file)
Let's try to make the new code in Clustering.jl index base-agnostic. It should not be so "irritating": using eachindex()/eachslice() should address most of the issues (and potentially even simplify & speed-up the code). |
Thank you about the thoughtful comments, it is a chance to learn something new along the way. Few thoughts about broad topics:
|
If there's a consensus to use index in the literature, let's call it index (quality_index kwarg in clustering_quality() function).
The elements should be of |
Attempt at making |
Hello everyone, I'm currently working on intrinsic cluster validity indices and wanted to propose a new feature for the package, but I found this thread open that is more or less what I was thinking. The development seems to have paused a couple of months ago; I would like to contribute. Before diving in, I would like to know if help is indeed needed here? |
Hi! The current state is that the commit is fully functional and I am waiting for more detailed re-review by @alyst . You can check the files, few most popular quality measures are implemented there. |
Conflict resolved. The code is mostly index-agnostic; I learned that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, there was a big progress since the last version!
I've left some comments/suggestions to the docs and the code.
I have not made these suggestions everywhere, but please use space after comma in the code and in the docs.
At the moment clustering_quality.jl
tests are not enabled in the CI, so it is not clear whether the code runs and gives the correct results, and what is the coverage. Please include this file in the runtests.jl
. By looking at the existing tests I would say that ideally we would also need the tests with non-0,1 weights and with varying fuzzyness.
If the CI results look ok, I think we are close to merging it after the final review.
@test_throws ArgumentError clustering_quality(zeros(2,2),zeros(2,2), [1, 2], quality_index = :calinski_harabasz) | ||
@test_throws DimensionMismatch clustering_quality([1,2,3], zeros(2,2), quality_index = :dunn) | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test corner cases to make sure they are handled correctly (ideally, the correct value according to the quality index definition should be returned):
- degenerated input data: 0 points (and 0 clusters)
- trivial clustering: single cluster
- 0-dimensional data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quality indices are generally undefined for single cluster. The reason is that the cluster spread is undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, silhouettes() actually throws an error for the degenerated clustering, so the other indices should follow that behaviour.
The other corner case that silhouettes() checks are empty clusters (gaps in assignments
values).
See test/silhouette.jl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cluster is empty than it's center is undefined, which means that the user gave nonsense arguments (or inconsistent number of assignments/centers, this is checked). Only silhouettes and Dunn index do not require them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we need to test that the behavior in such cases meets our expectations, because this will happen (bugs in the user code). I'd say: let's add the tests, see how clustering_quality
handles it and then decide if we need to adjust the behavior.
src/clustering_quality.jl
Outdated
_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix, | ||
assignments::AbstractVector{<:Integer}) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix, | |
assignments::AbstractVector{<:Integer}) = | |
_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix, | |
assignments::AbstractVector{<:Integer}, fuzziness::Nothing) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can calculate inner_inertia without using _gather_samples
as you suggest, but colwise
can be more efficient than calling chosen distance within the loop. This makes possible gains unclear and distance-dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairwise()
is for sure more efficient when we are calculating pairwise distances (for certain metrics) between N and M points.
But in this case it's a distance between N points and 1 point (in _gather_samples()
implementation).
Is it also getting significant performance boost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what I understand, possible gains are entirely metric-dependent. For the most popular one, in
using BenchmarkTools
X = randn(1000,1000)
C = randn(1000)
@benchmark sum(colwise(Euclidean(),X,C))
@benchmark sum(Euclidean()(x,C) for x in eachcol(X))
on my machine I get colwise
being faster by 60% and taking 10x less memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. Could you please add your benchmark results here?
I guess the performance gain is due to the use of BLAS matrix multiplication for Euclidean/SqEuclidean (so for the other metrics the gain should be lower).
The memory inefficiency is a bit unfortunate, I'm not sure it's because of eachcol(X)
vs view(X, :, i)
, but due to internal arrays allocation in the generic metric()
code.
Anyway, thanks again for checking -- you can revert it to the old version (otherwise I will do it), just keep the signature of inner_inertia()
the same and call gather_samples()
inside it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, colwise
is
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 132.700 μs … 4.223 ms ┊ GC (min … max): 0.00% … 96.37%
Time (median): 137.900 μs ┊ GC (median): 0.00%
Time (mean ± σ): 144.589 μs ± 47.773 μs ┊ GC (mean ± σ): 0.28% ± 0.96%
▃▇█▆▅▅▃▃▂▂▁▁▁ ▁ ▁ ▂
██████████████████████████▇▇▆▆▆▇▆▆▆▆▆▅▃▅▅▄▁▄▄▁▄▅▃▁▄▄▃▁▃▄▄▁▅▅ █
133 μs Histogram: log(frequency) by time 256 μs <
Memory estimate: 7.95 KiB, allocs estimate: 2.
Euclidean()(x,C)
is
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 197.400 μs … 3.395 ms ┊ GC (min … max): 0.00% … 87.95%
Time (median): 203.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 211.891 μs ± 122.278 μs ┊ GC (mean ± σ): 2.17% ± 3.57%
▃█▇▇▁
▁▂▅█████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
197 μs Histogram: frequency by time 254 μs <
Memory estimate: 78.17 KiB, allocs estimate: 3001.
I've checked view(X,:,i)
quickly, it is even worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked view(X,:,i) quickly, it is even worse.
Worse in terms of memory or performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same machine
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 325.000 μs … 2.099 ms ┊ GC (min … max): 0.00% … 69.74%
Time (median): 339.600 μs ┊ GC (median): 0.00%
Time (mean ± σ): 348.629 μs ± 85.286 μs ┊ GC (mean ± σ): 1.09% ± 3.81%
▁▆█▆▅▄▃▁▁▂▂▁ ▁ ▂
▇▃▁█████████████████▇▇▇▇▆▆▅▅▅▅▅▆▅▅▅▄▅▅▁▅▄▃▃▃▄▅▄▁▁▃▃▁▄▃▃▄▄▁▁▄ █
325 μs Histogram: log(frequency) by time 478 μs <
Memory estimate: 116.66 KiB, allocs estimate: 5466.
@jaksle We have reached the point when the CI tests are passing, thank you! |
@alyst You mentioned wanting to reshape documentation. It may be a good idea to divide "Evaluation & Validation" into 2 sections. Clustering quality and silhouettes are about intrinsic validation, the rest is about measuring similarity between clusterings, cross validation you could say. |
Yes, that's what I had in mind: separate intrinsic quality metrics and the comparison between the clusterings (cross-validation is one use case). |
@jaksle I think we're almost done with the PR. But it would be nice to improve the coverage. |
@alyst Yes, of course. For |
@jaksle Yes, I think no comprehensive tests are required for silhouettes. And there is also Dunn method that is not covered in that clustering_quality() method variant. |
I added testing of |
@jaksle I've updated the tests to fix the coverage (also, I've enhanced the exception logic -- it should throw a correct error message if the index name is valid, but not compatible with input arguments). Plus, I've reshuffled the docs, so that clustering comparison and quality evaluation are the two big subsections. I think this PR is ready to be merged -- but let me know if you have some questions/concerns about my recent changes. And thank you for all your hard work! |
This looks nice. I think this is a very good clean-up. I also think the PR is ready to merge. Thank you too for all the corrections! |
Continuation of issue #253. It introduces 4 indices: Calinsky-Harabasz, Davies-Bouldin, Xie-Beni and Dunn. CH, DB, XB require that the space is linear, e.g. we can talk about average positions, k-means and fuzzy c-means work under the same assumptions. Only CH and XB are defined for fuzzy clustering. D is the most computationally complex, but is the most general, only distances between data points are required, as such works also for k-medoids.
Issues: