-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering metrics #2003
Comments
Hi! thanks for your contribution!, great first issue! |
Hi @SkafteNicki, I'd like to give Mutual Info Score a shot! Should the implementation be done in a new directory like |
@matsumotosan cool thanks for wanting to contribute :] |
Also if it help, I was fooling around with some code in the weekend, trying to calculate the contingency matrix and pair matrix which seems to be two common objects used for calculating clustering metrics: p = torch.tensor([0, 0, 1, 2])
t = torch.tensor([0, 0, 1, 1])
p_u = len(torch.unique(p))
t_u = len(torch.unique(t))
lin = p + p_u * t
contingency_matrix = torch.bincount(t + t_u * p, minlength=p_u * t_u).reshape(p_u, t_u)
n_c = contingency_matrix.sum(dim=1)
n_k = contingency_matrix.sum(dim=0)
sum_squared = (contingency_matrix ** 2).sum()
pair_matrix = torch.zeros(2,2)
pair_matrix[1,1] = sum_squared - len(p)
pair_matrix[0,1] = (contingency_matrix * n_k).sum() - sum_squared
pair_matrix[1,0] = (contingency_matrix.T * n_c).sum() - sum_squared
pair_matrix[0,0] = len(p) ** 2 - pair_matrix[0,1] - pair_matrix[1,0] - sum_squared probably needs to be improved but maybe useful. |
I tried calculating the contingency matrix based on sklearn's implementation. I'll compare it to the method you suggested since it does seem to be a key computation. |
I'd like to work on normalized mutual info and Dunn index next! |
Assigned :] |
@SkafteNicki Would it be useful if we accumulated all the inputs for clustering-related tests in a separate file similar to |
@SkafteNicki I'll take Silhouette |
@matsumotosan I think that is a good idea, anything to help reduce replicated code is always good. |
I'll send a draft PR in a bit. |
@SkafteNicki I'll take adjusted mutual info score as well. |
Hello @SkafteNicki, which of these are not yet claimed? |
Hi @m0saan, |
@SkafteNicki I can take Fowlkes-Mallows. |
Most have been done, Thank you all for your contributions! 🐿️ |
🚀 Feature
Lets add clustering metrics to TM:
Rand Score
#2025Motivation
In Supervised Learning, the labels are known and evaluation can be done by calculating the degree of correctness by comparing the predicted values against the labels. However, in Unsupervised Learning, the labels are not known, which makes it hard to evaluate the degree of correctness as there is no ground truth.
That being said, it is still consistent that a good clustering algorithm has clusters that have small within-cluster variance (data points in a cluster are similar to each other) and large between-cluster variance (clusters are dissimilar to other clusters).
ref: https://towardsdatascience.com/7-evaluation-metrics-for-clustering-algorithms-bdc537ff54d2
CTA
pls also check our contribution guide: https://torchmetrics.readthedocs.io/en/stable/generated/CONTRIBUTING.html
The text was updated successfully, but these errors were encountered: