Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering metrics #2003

Closed
12 of 13 tasks
SkafteNicki opened this issue Aug 17, 2023 · 16 comments
Closed
12 of 13 tasks

Clustering metrics #2003

SkafteNicki opened this issue Aug 17, 2023 · 16 comments
Labels
enhancement New feature or request good first issue Good for newcomers New metric
Milestone

Comments

@SkafteNicki
Copy link
Member

SkafteNicki commented Aug 17, 2023

🚀 Feature

Lets add clustering metrics to TM:

Motivation

In Supervised Learning, the labels are known and evaluation can be done by calculating the degree of correctness by comparing the predicted values against the labels. However, in Unsupervised Learning, the labels are not known, which makes it hard to evaluate the degree of correctness as there is no ground truth.

That being said, it is still consistent that a good clustering algorithm has clusters that have small within-cluster variance (data points in a cluster are similar to each other) and large between-cluster variance (clusters are dissimilar to other clusters).

ref: https://towardsdatascience.com/7-evaluation-metrics-for-clustering-algorithms-bdc537ff54d2

CTA

pls also check our contribution guide: https://torchmetrics.readthedocs.io/en/stable/generated/CONTRIBUTING.html

@SkafteNicki SkafteNicki added enhancement New feature or request New metric labels Aug 17, 2023
@github-actions
Copy link

Hi! thanks for your contribution!, great first issue!

@SkafteNicki SkafteNicki added this to the future milestone Aug 17, 2023
@matsumotosan
Copy link
Member

matsumotosan commented Aug 19, 2023

Hi @SkafteNicki, I'd like to give Mutual Info Score a shot!

Should the implementation be done in a new directory like torchmetrics/clustering and torchmetrics/functional/clustering?

@SkafteNicki
Copy link
Member Author

@matsumotosan cool thanks for wanting to contribute :]
Yes, please add new directories as I consider this a new domain of metrics.

@SkafteNicki
Copy link
Member Author

Also if it help, I was fooling around with some code in the weekend, trying to calculate the contingency matrix and pair matrix which seems to be two common objects used for calculating clustering metrics:

p = torch.tensor([0, 0, 1, 2])
t = torch.tensor([0, 0, 1, 1])

p_u = len(torch.unique(p))
t_u = len(torch.unique(t))

lin = p + p_u * t
contingency_matrix = torch.bincount(t + t_u * p, minlength=p_u * t_u).reshape(p_u, t_u)

n_c = contingency_matrix.sum(dim=1)
n_k = contingency_matrix.sum(dim=0)
sum_squared = (contingency_matrix ** 2).sum()

pair_matrix = torch.zeros(2,2)
pair_matrix[1,1] = sum_squared - len(p)
pair_matrix[0,1] = (contingency_matrix * n_k).sum() - sum_squared
pair_matrix[1,0] = (contingency_matrix.T * n_c).sum() - sum_squared
pair_matrix[0,0] = len(p) ** 2 - pair_matrix[0,1] - pair_matrix[1,0] - sum_squared

probably needs to be improved but maybe useful.

@matsumotosan
Copy link
Member

I tried calculating the contingency matrix based on sklearn's implementation. I'll compare it to the method you suggested since it does seem to be a key computation.

@Borda Borda pinned this issue Aug 24, 2023
@Borda Borda modified the milestones: future, v1.2.0 Aug 24, 2023
@Borda Borda added the good first issue Good for newcomers label Aug 24, 2023
@matsumotosan
Copy link
Member

I'd like to work on normalized mutual info and Dunn index next!

@SkafteNicki
Copy link
Member Author

I'd like to work on normalized mutual info and Dunn index next!

Assigned :]

@matsumotosan
Copy link
Member

@SkafteNicki Would it be useful if we accumulated all the inputs for clustering-related tests in a separate file similar to inputs.py for classification? I saw that your rand_score tests used the same test cases that I used for mutual_info and normalized_mutual_info. I could also add cases for Dunn index that may be useful for other metrics in the future.

@stancld
Copy link
Contributor

stancld commented Sep 2, 2023

@SkafteNicki I'll take Silhouette

@SkafteNicki
Copy link
Member Author

@SkafteNicki Would it be useful if we accumulated all the inputs for clustering-related tests in a separate file similar to inputs.py for classification? I saw that your rand_score tests used the same test cases that I used for mutual_info and normalized_mutual_info. I could also add cases for Dunn index that may be useful for other metrics in the future.

@matsumotosan I think that is a good idea, anything to help reduce replicated code is always good.
Can you send a PR with this ? Else I do it

@matsumotosan
Copy link
Member

I'll send a draft PR in a bit.

@matsumotosan
Copy link
Member

@SkafteNicki I'll take adjusted mutual info score as well.

@m0saan
Copy link

m0saan commented Sep 3, 2023

Hello @SkafteNicki, which of these are not yet claimed?

@SkafteNicki
Copy link
Member Author

Hello @SkafteNicki, which of these are not yet claimed?

Hi @m0saan,
Currently Fowlkes mallows score and Davies–Bouldin index is not yet claimed.

@matsumotosan
Copy link
Member

@SkafteNicki I can take Fowlkes-Mallows.

@Borda
Copy link
Member

Borda commented Aug 2, 2024

Most have been done, Thank you all for your contributions! 🐿️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers New metric
Projects
Status: Done
Development

No branches or pull requests

6 participants
@Borda @SkafteNicki @matsumotosan @stancld @m0saan and others