Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Some metrics don't work with int64 data type arrays unlike their sklearn counterparts #4784

Open
shaswat-indian opened this issue Jun 23, 2022 · 2 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d

Comments

@shaswat-indian
Copy link
Contributor

Describe the bug
Currently some metrics like entropy support only int32 type of arrays. I realised that even after allowing int64 dtype by modifying this line, the output was wrong due to the reason described below.

Some metrics like homogeneity which use the underlying entropy metric return wrong output for arrays of type int64 (or if no data type is provided). The issue is because of a bug in the cub module being used in the entropy metric, specifically at this line as it doesn't return the correct count from the histogram.

The root cause behind this seems to be this in the cub repo.

Steps/Code to reproduce bug

import numpy as np
from sklearn.metrics.cluster import entropy as sk_entropy
from cuml.metrics.cluster import entropy as cuml_entropy

cluster_1 = np.array([0, 0, 1, 1], dtype=np.int64)
print(sk_entropy(cluster_1))
print(cuml_entropy(cluster_1))

Expected behavior
The API should work with int64 arrays as well just like the sklearn API.

Environment details (please complete the following information):

  • Linux Distro/Architecture: [Ubuntu 18.04 amd64]
  • CUDA: [11.5]
  • On branch branch-22.08
@shaswat-indian shaswat-indian added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jun 23, 2022
rapids-bot bot pushed a commit that referenced this issue Jul 14, 2022
This PR resolves #802 by adding python API for `v_measure_score`.

Also came across an [issue](#4784) while working on this.

Authors:
  - Shaswat Anand (https://github.com/shaswat-indian)

Approvers:
  - Micka (https://github.com/lowener)
  - William Hicks (https://github.com/wphicks)

URL: #4785
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
This PR resolves rapidsai#802 by adding python API for `v_measure_score`.

Also came across an [issue](rapidsai#4784) while working on this.

Authors:
  - Shaswat Anand (https://github.com/shaswat-indian)

Approvers:
  - Micka (https://github.com/lowener)
  - William Hicks (https://github.com/wphicks)

URL: rapidsai#4785
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

1 participant