Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] "Precomputed" Distance Matrix in (some) Clustering Algorithms #4516

Open
Mortom123 opened this issue Jan 25, 2022 · 5 comments
Open

[FEA] "Precomputed" Distance Matrix in (some) Clustering Algorithms #4516

Mortom123 opened this issue Jan 25, 2022 · 5 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request inactive-30d inactive-90d

Comments

@Mortom123
Copy link

Sometimes we do not have point representations in space but rather only distances between those points.
Therefore it would be great if some algorithms (I'm especially interested in HDBSCAN and Agglomerative Clustering) are able to work on precomputed (sparse) distance matrices, similar to using "precomputed" metric in a lot of sklearn algorithms.

Personally, I'm working with biological, structural data, hence I only have differences in structure but not points in space.

There are several issues that also relate to this FEA - #4475 #4460 (#1192, #4409), and the implementation for e.g. DBSCAN already happened with issue #3302.

@Mortom123 Mortom123 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 25, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@erke-apoqlar
Copy link

Hi,

Are there any updates about making precomputed matrixes available for HDBSCAN?

@SnzFor16Min
Copy link

Just attempted to perform HDBSCAN on a cupyx.scipy.sparse._csr.csr_matrix and received immediate complaints upon the sparse input:

    hdb.fit(D)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 687, in cuml.internals.base.UniversalBase.dispatch_func
  File "hdbscan.pyx", line 762, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/input_utils.py", line 380, in input_to_cuml_array
    arr = CumlArray.from_input(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 1114, in from_input
    arr = cls(X, index=index, order=requested_order, validate=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 292, in __init__
    new_data = cur_xpy.asarray(data, dtype=dtype)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cupy/_creation/from_data.py", line 88, in asarray
    return _core.array(a, dtype, False, order, blocking=blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy/_core/core.pyx", line 2379, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2406, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2541, in cupy._core.core._array_default
ValueError: setting an array element with a sequence.
python-BaseException

As I notice there's a SparseCumlArray class but surprisingly HDBSCAN does not buy it either:

    hdb.fit(D)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 687, in cuml.internals.base.UniversalBase.dispatch_func
  File "hdbscan.pyx", line 762, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/input_utils.py", line 380, in input_to_cuml_array
    arr = CumlArray.from_input(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 1114, in from_input
    arr = cls(X, index=index, order=requested_order, validate=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 292, in __init__
    new_data = cur_xpy.asarray(data, dtype=dtype)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cupy/_creation/from_data.py", line 88, in asarray
    return _core.array(a, dtype, False, order, blocking=blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy/_core/core.pyx", line 2379, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2406, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2541, in cupy._core.core._array_default
TypeError: float() argument must be a string or a real number, not 'SparseCumlArray'
python-BaseException

Looking forward to any suggestion or support schedule for this, as precomputed, sparse distance matrices are common in clustering algorithms.

@KanishkT123
Copy link

KanishkT123 commented Sep 19, 2024

@cjnolet , what would it take to get this made and merged in? I'm happy to take a shot at it, no promises as to how far I get. But I'm working with some data right now that would very much benefit from 'cosine', and failing that, 'precompute' is a good option to get a lot of different metrics working.

I would just need some guidance on where to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

4 participants