Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements of UMAP/TSNE precomputed KNN feature #4865

Merged
merged 20 commits into from
Feb 3, 2023

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Aug 10, 2022

The X input in fit and fit_transform functions is unnecessary when a KNN graph is provided. This PR adds a precomputed boolean parameter. It specifies whether X would serve as a classic input or as a precomputed KNN graph. Additional the transform function is modified so that it cannot take a KNN graph anymore.

This PR does the following :

  1. Provides a precomputed_knn parameter to UMAP and tSNE constructor.
    It can be provided in the form of a :
    • tuple (distances, indices)
    • pairwise distance matrix of shape (n_samples, n_samples)
    • KNN graph in CSR/COO/CSC format
  2. Makes the legacy knn_graph parameter of the UMAP and tSNE fit method capable of taking in all the forms aforementioned. The knn_graph parameter when provided would take precedence over the precomputed_knn parameter.
  3. Removes knn_graph parameter from UMAP transform method as it wasn't actually doing anything.
  4. Adds precomputed KNN capabilities for the UMAP fit method in the sparse case.

@divyegala
Copy link
Member

@viclafargue Can we also do this for tSNE?

@viclafargue viclafargue changed the title Fix UMAP KNN graph feature Fix UMAP/TSNE KNN graph feature Aug 18, 2022
@dantegd dantegd added the 2 - In Progress Currenty a work in progress label Aug 30, 2022
@cjnolet
Copy link
Member

cjnolet commented Sep 27, 2022

We discussed this offline a little while back, but the user here is really requesting that we accept a pairwise distance matrix directly. While looking into the changes in this PR, I discovered that UMAP actually does support a knn graph as input now, however it's configured at the estimator level and not passed in on the fit() or transform() functions. We should follow UMAP's API here, but perhaps instead of introducing breaking changes, we should deprecate the knn_graph option in fit() and transform() and add the option to configure it on the estimator. I think these changes should be almost entirely done at the Python layer, as the C++ side itself doesn't need to manage that state.

@dantegd dantegd changed the base branch from branch-22.10 to branch-22.12 October 28, 2022 20:54
@viclafargue viclafargue added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 8, 2022
@viclafargue viclafargue added breaking Breaking change and removed non-breaking Non-breaking change labels Dec 8, 2022
@viclafargue
Copy link
Contributor Author

viclafargue commented Dec 8, 2022

This PR does the following :

  1. Provides a precomputed_knn parameter to UMAP and tSNE constructor.
    It can be provided in the form of a :
    • tuple (distances, indices)
    • pairwise distance matrix of shape (n_samples, n_samples)
    • KNN graph in CSR/COO/CSC format
  2. Makes the legacy knn_graph parameter of the UMAP and tSNE fit method capable of taking in all the forms aforementioned. The knn_graph parameter when provided would take precedence over the precomputed_knn parameter.
  3. Removes knn_graph parameter from UMAP transform method as it wasn't actually doing anything.
  4. Adds precomputed KNN capabilities for the UMAP fit method in the sparse case.

Tried to keep C++ changes to a minimum, but some changes had to be made though.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added CMake conda conda issue labels Dec 8, 2022
@github-actions github-actions bot removed conda conda issue CMake labels Jan 5, 2023
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far! Mostly minor things at this point.

python/cuml/common/sparsefuncs.py Show resolved Hide resolved
python/cuml/common/sparsefuncs.py Show resolved Hide resolved
python/cuml/manifold/umap.pyx Show resolved Hide resolved
python/cuml/tests/test_tsne.py Show resolved Hide resolved
@@ -199,6 +199,15 @@ class TSNE(Base,
'sqeuclidean' metric, the distances will still be squared when True.
Note: This argument should likely be set to False for distance metrics
other than 'euclidean' and 'l2'.
precomputed_knn : array / sparse array / tuple, optional (device or host)
Either one of :
- Tuple (distances, indices) of arrays of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider also supporting the knn_search_index option (third tuple element) which is supported in the reference implementation: https://github.com/lmcinnes/umap/blob/3f19ce19584de4cf99e3d0ae779ba13a57472cd9/umap/umap_.py#L1626. We can push that feature off but we should at least create an issue for it to keep it on our radar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it would be better to work on this in a follow-up PR. Just opened an issue over here : #5118.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Thanks for adding this feature, @viclafargue!

Will merge pending conflicts/CI

@cjnolet cjnolet added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Jan 11, 2023
@codecov-commenter
Copy link

Codecov Report

Base: 67.12% // Head: 67.26% // Increases project coverage by +0.13% 🎉

Coverage data is based on head (bb5f8a8) compared to base (f6abf81).
Patch coverage: 86.04% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.02    #4865      +/-   ##
================================================
+ Coverage         67.12%   67.26%   +0.13%     
================================================
  Files               192      192              
  Lines             12396    12394       -2     
================================================
+ Hits               8321     8337      +16     
+ Misses             4075     4057      -18     
Impacted Files Coverage Δ
python/cuml/common/sparsefuncs.py 92.24% <86.04%> (-0.15%) ⬇️
python/cuml/dask/common/input_utils.py 34.37% <0.00%> (-1.51%) ⬇️
python/cuml/dask/ensemble/randomforestregressor.py 33.87% <0.00%> (-1.05%) ⬇️
...ython/cuml/dask/ensemble/randomforestclassifier.py 29.88% <0.00%> (-0.80%) ⬇️
python/cuml/preprocessing/TargetEncoder.py 85.22% <0.00%> (-0.08%) ⬇️
...on/cuml/benchmark/automated/bench_random_forest.py 0.00% <0.00%> (ø)
python/cuml/dask/preprocessing/label.py 40.00% <0.00%> (+0.71%) ⬆️
...ython/cuml/dask/neighbors/kneighbors_classifier.py 23.42% <0.00%> (+1.57%) ⬆️
python/cuml/dask/neighbors/nearest_neighbors.py 28.75% <0.00%> (+2.00%) ⬆️
python/cuml/dask/neighbors/kneighbors_regressor.py 32.75% <0.00%> (+2.11%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dantegd
Copy link
Member

dantegd commented Feb 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit a1a6e21 into rapidsai:branch-23.02 Feb 3, 2023
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
~~The X input in `fit` and `fit_transform` functions is unnecessary when a KNN graph is provided. This PR adds a `precomputed` boolean parameter. It specifies whether X would serve as a classic input or as a precomputed KNN graph. Additional the `transform` function is modified so that it cannot take a KNN graph anymore.~~

This PR does the following :
1) Provides a `precomputed_knn` parameter to UMAP and tSNE constructor.
It can be provided in the form of a :
    - tuple (distances, indices)
    - pairwise distance matrix of shape (n_samples, n_samples)
    - KNN graph in CSR/COO/CSC format
2) Makes the legacy `knn_graph` parameter of the UMAP and tSNE fit method capable of taking in all the forms aforementioned. The `knn_graph` parameter when provided would take precedence over the `precomputed_knn` parameter.
3) Removes `knn_graph` parameter from UMAP transform method as it wasn't actually doing anything.
4) Adds precomputed KNN capabilities for the UMAP fit method in the sparse case.

Authors:
  - Victor Lafargue (https://github.com/viclafargue)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4865
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Author Waiting for author to respond to review breaking Breaking change CUDA/C++ Cython / Python Cython or Python issue gpuCI gpuCI issue improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants