Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow cosine distance metric in dbscan #4776

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jun 13, 2022

closes #4210
Added cosine distance metric for computing epsilon neighborhood in DBSCAN. The cosine distance computed as L2 norm of L2 normalized vectors and the epsilon value is adjusted accordingly.

@tarang-jain tarang-jain changed the title Allow cosine distance metric in dbscan [FEA] Allow cosine distance metric in dbscan Jun 13, 2022
@tarang-jain tarang-jain marked this pull request as ready for review June 13, 2022 18:38
@tarang-jain tarang-jain requested review from a team as code owners June 13, 2022 18:38
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far! A couple small(-ish) suggestions about the general design, mostly for maintainability.

int algo_vd;
if (metric == raft::distance::Precomputed) {
algo_vd = 2;
} else if (metric == raft::distance::CosineExpanded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than duplicating the call to the epsilon neighborhood primitive for the cosine case, I'd prefer to pass the metric through directly when metric != precomputed and normalize the input conditionally in the case where metric == cosine.

value_t eps2 = 2 * data.eps;

rmm::device_uvector<value_t> rowNorms(m, stream);
rmm::device_uvector<value_t> l2Normalized(m * n, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two options for supporting cosine- either we normalize the input or we perform the normalization in the computation. If we normalize the input, we should do so directly to the input and then revert the values back afterwords because this is very expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with attempting to modify the input: the output array address in raft::linalg::matrixVectorOp cannot be a const float *. Note that data.x is of the type const float *.

@@ -47,6 +48,9 @@ void run(const raft::handle_t& handle,
case 2:
Precomputed::launcher<Type_f, Index_>(handle, data, start_vertex_id, batch_size, stream);
break;
case 3:
Cosine::launcher<Type_f, Index_>(handle, data, start_vertex_id, batch_size, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I prefer to use the same "launcher" as the other metrics and pass the metric in directly. This will also make it much easier to support other metrics in the future, rather than having to duplicate the launcher each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we also pass algo_vd as an argument to the launcher?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Ideally Algo::launcher would just accept the distance type.

@@ -147,7 +147,7 @@ class DBSCAN(Base,
min_samples : int (default = 5)
The number of samples in a neighborhood such that this group can be
considered as an important core point (including the point itself).
metric: {'euclidean', 'precomputed'}, default = 'euclidean'
metric: {'euclidean', 'precomputed', 'cosine'}, default = 'euclidean'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a little nitpick: I would prefer if we kept precomputed either before or after the actual distance metrics for clarity. We should also add a little note to the docs here that the input will be modified temporarily when cosine distance is used (and might not match completely afterwards due to numerical rounding).

@@ -107,6 +107,41 @@ def test_dbscan_precomputed(datatype, nrows, max_mbytes_per_batch, out_dtype):
algorithm="brute")
sk_labels = sk_dbscan.fit_predict(X_dist)

print("cu_labels:", cu_labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to remove debug prints from tests

@@ -267,6 +267,7 @@ class DBSCAN(Base,
"L2": DistanceType.L2SqrtUnexpanded,
"euclidean": DistanceType.L2SqrtUnexpanded,
"precomputed": DistanceType.Precomputed,
"cosine": DistanceType.CosineExpanded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this, maybe after euclidean, for readability.

@@ -0,0 +1,90 @@
/*
* Copyright (c) 2018-2022, NVIDIA CORPORATION.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only use the current year for new files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should this be:
Copyright (c) 2021-2022, NVIDIA CORPORATION ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for new files it would just be 2022.

@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 16, 2022
@cjnolet
Copy link
Member

cjnolet commented Jun 16, 2022

@tarang-jain tarang-jain reopened this Jun 16, 2022
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Pending CI

@cjnolet
Copy link
Member

cjnolet commented Jun 29, 2022

@tarang-jain can you merge the current upstream (22.08) into your PR branch? The CI failures here have since been fixed.

@cjnolet
Copy link
Member

cjnolet commented Jul 7, 2022

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Jul 7, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c8aebc3 into rapidsai:branch-22.08 Jul 7, 2022
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
closes rapidsai#4210 
Added cosine distance metric for computing epsilon neighborhood in DBSCAN. The cosine distance computed as L2 norm of L2 normalized vectors and the epsilon value is adjusted accordingly.

Authors:
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4776
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add Cosine Distance metric to DBSCAN
2 participants