Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] reduce memory pressure in membership vector computation #5268

Merged

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Mar 13, 2023

Batching is done only while computing the pairwise distance matrix between the dataset and the set of exemplar points.

Closes #4879

@tarang-jain tarang-jain requested a review from a team as a code owner March 17, 2023 19:43
break;
case raft::distance::DistanceType::CosineExpanded:
raft::distance::
distance<raft::distance::DistanceType::CosineExpanded, value_t, value_t, value_t, int>(
handle, X, exemplars_dense.data(), dist.data(), m, n_exemplars, n, true);
handle, query + batch_offset * n, exemplars_dense.data(), dist.data(), samples_per_batch, n_exemplars, n, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This is a good solution to provide some significant savings in memory.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes LGTM at a glance but I'm going to request changes for now and give a more in-depth review after #5247 is merged.

@tarang-jain tarang-jain force-pushed the fea-new-reduce-memory-pressure-apmv branch from 91c2d6a to 7a95bfe Compare March 29, 2023 22:41
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a couple minor things, really.

had ``prediction_data=True`` set.

batch_size : int, optional, default=0
Distance based membership is computed in batches to fit on the device. If not specified, or set to 0, distance based membership is computed at once for all points in the training data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be worded a little bit better to highlight the purpose of this setting, maybe mention "points in the training data" closer to the beginning? Something like "Lowers memory requirement by computing distance-based membership in smaller batches of points in the training data. Batch size of 0 uses all of the training points, batch size of 1000 computes distances for 1000 points at a time."

n_prediction_points,
clusterer.min_samples,
_metrics_mapping[clusterer.metric],
<float*> membership_vec_ptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid any potential future issues, we should probably validate that this is actually positive (in Python) before we pass it down.

@tarang-jain tarang-jain added non-breaking Non-breaking change feature request New feature or request labels Mar 31, 2023
@cjnolet
Copy link
Member

cjnolet commented Mar 31, 2023

/merge

@rapids-bot rapids-bot bot merged commit aeb01ea into rapidsai:branch-23.04 Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CUDA/C++ Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Reduce memory pressure in HDBSCAN all_points_membership_vectors (or provide/link to best practices)
2 participants