-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Allow cosine distance metric in dbscan #4776
[FEA] Allow cosine distance metric in dbscan #4776
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far! A couple small(-ish) suggestions about the general design, mostly for maintainability.
cpp/src/dbscan/dbscan.cuh
Outdated
int algo_vd; | ||
if (metric == raft::distance::Precomputed) { | ||
algo_vd = 2; | ||
} else if (metric == raft::distance::CosineExpanded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than duplicating the call to the epsilon neighborhood primitive for the cosine case, I'd prefer to pass the metric through directly when metric != precomputed
and normalize the input conditionally in the case where metric == cosine
.
cpp/src/dbscan/vertexdeg/cosine.cuh
Outdated
value_t eps2 = 2 * data.eps; | ||
|
||
rmm::device_uvector<value_t> rowNorms(m, stream); | ||
rmm::device_uvector<value_t> l2Normalized(m * n, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two options for supporting cosine- either we normalize the input or we perform the normalization in the computation. If we normalize the input, we should do so directly to the input and then revert the values back afterwords because this is very expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem with attempting to modify the input: the output array address in raft::linalg::matrixVectorOp cannot be a const float *. Note that data.x is of the type const float *.
cpp/src/dbscan/vertexdeg/runner.cuh
Outdated
@@ -47,6 +48,9 @@ void run(const raft::handle_t& handle, | |||
case 2: | |||
Precomputed::launcher<Type_f, Index_>(handle, data, start_vertex_id, batch_size, stream); | |||
break; | |||
case 3: | |||
Cosine::launcher<Type_f, Index_>(handle, data, start_vertex_id, batch_size, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I prefer to use the same "launcher" as the other metrics and pass the metric in directly. This will also make it much easier to support other metrics in the future, rather than having to duplicate the launcher each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should we also pass algo_vd
as an argument to the launcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Ideally Algo::launcher
would just accept the distance type.
python/cuml/cluster/dbscan.pyx
Outdated
@@ -147,7 +147,7 @@ class DBSCAN(Base, | |||
min_samples : int (default = 5) | |||
The number of samples in a neighborhood such that this group can be | |||
considered as an important core point (including the point itself). | |||
metric: {'euclidean', 'precomputed'}, default = 'euclidean' | |||
metric: {'euclidean', 'precomputed', 'cosine'}, default = 'euclidean' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a little nitpick: I would prefer if we kept precomputed
either before or after the actual distance metrics for clarity. We should also add a little note to the docs here that the input will be modified temporarily when cosine distance is used (and might not match completely afterwards due to numerical rounding).
python/cuml/tests/test_dbscan.py
Outdated
@@ -107,6 +107,41 @@ def test_dbscan_precomputed(datatype, nrows, max_mbytes_per_batch, out_dtype): | |||
algorithm="brute") | |||
sk_labels = sk_dbscan.fit_predict(X_dist) | |||
|
|||
print("cu_labels:", cu_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to remove debug prints from tests
python/cuml/cluster/dbscan.pyx
Outdated
@@ -267,6 +267,7 @@ class DBSCAN(Base, | |||
"L2": DistanceType.L2SqrtUnexpanded, | |||
"euclidean": DistanceType.L2SqrtUnexpanded, | |||
"precomputed": DistanceType.Precomputed, | |||
"cosine": DistanceType.CosineExpanded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also move this, maybe after euclidean
, for readability.
cpp/src/dbscan/vertexdeg/cosine.cuh
Outdated
@@ -0,0 +1,90 @@ | |||
/* | |||
* Copyright (c) 2018-2022, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only use the current year for new files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should this be:
Copyright (c) 2021-2022, NVIDIA CORPORATION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for new files it would just be 2022
.
@tarang-jain I think this PR is waiting on a style fix: https://gpuci.gpuopenanalytics.com/blue/organizations/jenkins/rapidsai%2Fgpuci%2Fcuml%2Fprb%2Fcuml-style/detail/cuml-style/10390/pipeline#log-119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Pending CI
@tarang-jain can you merge the current upstream (22.08) into your PR branch? The CI failures here have since been fixed. |
…cosine-distance-metric-DBSCAN
rerun tests |
@gpucibot merge |
closes rapidsai#4210 Added cosine distance metric for computing epsilon neighborhood in DBSCAN. The cosine distance computed as L2 norm of L2 normalized vectors and the epsilon value is adjusted accordingly. Authors: - Tarang Jain (https://github.com/tarang-jain) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4776
closes #4210
Added cosine distance metric for computing epsilon neighborhood in DBSCAN. The cosine distance computed as L2 norm of L2 normalized vectors and the epsilon value is adjusted accordingly.