Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to DBSCAN fitting #301

Merged
merged 54 commits into from
Feb 16, 2024

Conversation

nickjcroucher
Copy link
Collaborator

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

  • indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board
  • slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new --assign-subsample option) negates the speed up of the initial fit
  • slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a --no-assign flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

If you approve these changes conceptually, then I'll add tests and docs. At the moment, local tests fail on the mandrake clustering step - I don't know if these are related to the failing tests for mandrake, or a local installation problem - will see what the CI outcomes are. Hence the slightly early-stage PR, sorry!

@nickjcroucher nickjcroucher marked this pull request as draft February 16, 2024 10:31
@nickjcroucher nickjcroucher changed the base branch from master to gpu_dbscan February 16, 2024 14:55
@nickjcroucher nickjcroucher marked this pull request as ready for review February 16, 2024 14:55
@nickjcroucher nickjcroucher merged commit c080e2a into bacpop:gpu_dbscan Feb 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant