Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] HDBSCAN #3821

Merged
merged 209 commits into from
Jun 3, 2021
Merged
Show file tree
Hide file tree
Changes from 199 commits
Commits
Show all changes
209 commits
Select commit Hold shift + click to select a range
c0a1b8b
HDBSCAN
cjnolet Apr 20, 2021
3800f76
Fixing style
cjnolet Apr 20, 2021
362d429
Progress
cjnolet Apr 21, 2021
abf2267
Checking in
cjnolet Apr 22, 2021
c68bd8d
Checking in todo stubs of remaining work for initial implementation
cjnolet Apr 22, 2021
78a8432
Creating output condensed hierarchy
cjnolet Apr 22, 2021
b502e1d
Resizing rmm arrays
cjnolet Apr 22, 2021
146ac00
Adding missing semicolon
cjnolet Apr 22, 2021
3cee98b
initial ideas to compute stabilities
divyegala Apr 23, 2021
60253ce
Checking in
cjnolet Apr 23, 2021
e4668d3
Updating
cjnolet Apr 23, 2021
1e9f557
separating kernels into detail
divyegala Apr 23, 2021
0331697
Merge branch 'fea-020-hdbscan' of github.com:cjnolet/cuml into fea-02…
divyegala Apr 23, 2021
fd77b48
allocating stabilities through caller
divyegala Apr 23, 2021
7edb4a3
cleaning stabilities a bit
divyegala Apr 23, 2021
0efd58b
Initial stub for eom is done.
cjnolet Apr 23, 2021
bbb35db
Adding data to epilogue
cjnolet Apr 23, 2021
83d8ac3
COuple small updates
cjnolet Apr 23, 2021
f563aca
Adding get_stability scores
cjnolet Apr 23, 2021
bbe1ffe
tests building
divyegala Apr 23, 2021
d135645
cleaning up some comments
cjnolet Apr 23, 2021
0fce1bb
Fixing style
cjnolet Apr 23, 2021
7052836
merging
divyegala Apr 23, 2021
b6048ce
merge and compile
divyegala Apr 23, 2021
f05d545
Updates to get hdbscan to compile
cjnolet Apr 23, 2021
1628a36
Getting remaining stuff in extract_clusters to compile
cjnolet Apr 23, 2021
463f757
Using zip iterators for copy_if
cjnolet Apr 23, 2021
dcc83db
compiling compute stabilities
divyegala Apr 23, 2021
4829b21
stabilities and merge building
divyegala Apr 23, 2021
8814181
Sorry!
cjnolet Apr 23, 2021
025c5d4
Renaming and moving a few things around
cjnolet Apr 23, 2021
5335252
Debugging mutual reachability through gtest
cjnolet Apr 26, 2021
2844a11
Mutual reachability and dendrogram re-labeling executes
cjnolet Apr 26, 2021
d4dfc07
Fixing style
cjnolet Apr 26, 2021
c8c9e55
Condensed hierarchy executes without error. Output needs to be verifi…
cjnolet Apr 26, 2021
5619959
Fixing formatting
cjnolet Apr 26, 2021
624fdb0
Some light prints show the condensing is at least performing the bfs.
cjnolet Apr 26, 2021
de4569c
Fixing style
cjnolet Apr 26, 2021
6cc8d74
condensed clusters
cjnolet Apr 27, 2021
4865e64
Some updates. Fixing off by 1 error
cjnolet Apr 27, 2021
aa87e5b
correcting stabilities calculation
divyegala Apr 27, 2021
1532e2a
Adding gtest for condensing
cjnolet Apr 27, 2021
88716d3
Pushing
cjnolet Apr 27, 2021
269d828
condensing *should* be correct
cjnolet Apr 27, 2021
de0c7b3
Cleaning up cluster condensing. Should be mostly correct
cjnolet Apr 27, 2021
741df66
more updates
divyegala Apr 28, 2021
52563df
Merge branch 'fea-020-hdbscan' of github.com:cjnolet/cuml into fea-02…
divyegala Apr 28, 2021
5095f9f
Filling in union find & host-based labeling
cjnolet Apr 28, 2021
c389a7a
Fixing style
cjnolet Apr 28, 2021
0b5d3e7
monotonic parent child and stabilities running
divyegala Apr 28, 2021
071b899
merge
divyegala Apr 28, 2021
13c8672
Adding gtest for eom
cjnolet Apr 28, 2021
1cc4dbf
Adding gtest for eom
cjnolet Apr 28, 2021
2f17cf8
updated rmm, so removed ->on(stream)
divyegala Apr 29, 2021
a9d1058
working stabilities gtest
divyegala Apr 29, 2021
68a9344
Adding iris to cluster condensing tests
cjnolet Apr 29, 2021
c3dbd8c
Code path through stabilities, eom, and labeling seems to work.
cjnolet Apr 29, 2021
89deff8
Merge branch 'branch-0.20' into fea-020-hdbscan
cjnolet Apr 29, 2021
b143a55
Fixing reachability
cjnolet Apr 29, 2021
dcd590c
Adding support for max_cluster_size
cjnolet Apr 29, 2021
699183d
Making labels monotonic
cjnolet Apr 29, 2021
e78b48a
sketching probabilities
divyegala Apr 29, 2021
b6a85ef
Merge branch 'fea-020-hdbscan' of github.com:cjnolet/cuml into fea-02…
divyegala Apr 29, 2021
ea759ef
passing probabilities gtest
divyegala Apr 30, 2021
0818d10
fixing compute-sanitizer issue in excess of mass
cjnolet Apr 30, 2021
998c8d4
Cleaning up API for inputs and outputs
cjnolet Apr 30, 2021
fb67441
Fixing memory error
cjnolet Apr 30, 2021
e000f41
cluster epsilon search start
divyegala May 1, 2021
f41330a
C++ changes for python wrapper
cjnolet May 3, 2021
3bb021e
Addiing hdbscan python wrapper and empty classes to aid in plotting.
cjnolet May 3, 2021
ebfa61d
C++ changes for python wrapper
cjnolet May 3, 2021
ce9f809
Merge branch 'fea-020-hdbscan' into fea-020-hdbscan_py
cjnolet May 3, 2021
21c39ae
C++ changes
cjnolet May 3, 2021
9088549
Cython changes
cjnolet May 3, 2021
069b1b5
C++ changes
cjnolet May 3, 2021
f26ca59
Merge branch 'fea-020-hdbscan' into fea-020-hdbscan_py
cjnolet May 3, 2021
2f624ef
Using rmm allocator directly in thrust for now
cjnolet May 3, 2021
38734d6
Using rmm allocator directly in thrust for now
cjnolet May 3, 2021
1f7c9ad
Merge branch 'fea-020-hdbscan' into fea-020-hdbscan_py
cjnolet May 3, 2021
4a5b426
Fixing python style
cjnolet May 3, 2021
90d7de6
Fixing cpp style
cjnolet May 3, 2021
8d560a4
Merge branch 'fea-020-hdbscan' into fea-020-hdbscan_py
cjnolet May 3, 2021
8408c78
merge
divyegala May 3, 2021
babf10b
Merge branch 'fea-020-hdbscan' of github.com:cjnolet/cuml into fea-02…
divyegala May 3, 2021
ee3c07b
introducing cluster tree creation
divyegala May 4, 2021
5f11ea3
correcting eom test with cluster tree
divyegala May 4, 2021
3694173
Adding labeling for robust single linkage (and dbscan)
cjnolet May 4, 2021
f6cb0b5
Merge branch 'fea-020-hdbscan' into fea-020-hdbscan_py
cjnolet May 4, 2021
8266f3c
stubbing out robust single linkage
cjnolet May 4, 2021
7332199
Small changes to support robust single linkage
cjnolet May 4, 2021
a4b560c
cluster selection travel upwards working
divyegala May 4, 2021
6a6a9df
separating bfs out
divyegala May 4, 2021
93c106c
added cluster negation to epsilon search
divyegala May 4, 2021
ee90094
merge
divyegala May 4, 2021
b7d67dd
style check
divyegala May 4, 2021
bec62e7
Separating extract into smaller files.
cjnolet May 4, 2021
e063a68
styl
cjnolet May 4, 2021
01c9904
Merge branch 'fea-020-hdbscan' into fea-020-hdbscan_py
cjnolet May 4, 2021
7b57cd0
leaf selection method
divyegala May 4, 2021
0ef2b86
Cleaning up prints for hdbscan and fixing small segfault
cjnolet May 5, 2021
e9a6a4e
Updates to python
cjnolet May 5, 2021
5e7b850
Cleaning up prints for hdbscan and fixing small segfault
cjnolet May 5, 2021
4f45468
Testing updates
cjnolet May 5, 2021
3ae88eb
Checking in
cjnolet May 5, 2021
5ac0a8f
merging hdbscan cpp
divyegala May 5, 2021
affeb86
Merge branch 'fea-020-hdbscan_py' of github.com:cjnolet/cuml into fea…
divyegala May 5, 2021
00b0b39
Working through debugging
cjnolet May 5, 2021
a0caa05
Working through debugging
cjnolet May 5, 2021
8ff536b
Fixign swapped args
cjnolet May 5, 2021
acd4c5a
Updates to pytest
cjnolet May 5, 2021
c097722
C++ changes
cjnolet May 5, 2021
1b7a638
Adding debugging info
cjnolet May 5, 2021
150b76e
Updates to C++
cjnolet May 5, 2021
1d94a6a
Python / cython updates
cjnolet May 5, 2021
afe8c60
C++ updates for debugging
cjnolet May 5, 2021
433afe9
Python updates
cjnolet May 5, 2021
6e6b2d7
fixing stabilities computation for root
divyegala May 5, 2021
cbfc359
Merge branch 'fea-020-hdbscan_py' of github.com:cjnolet/cuml into fea…
divyegala May 5, 2021
4e4f93e
make monotonic filter noise
divyegala May 5, 2021
c7fdaf4
C++ changes
cjnolet May 6, 2021
8fa37a8
Python changes
cjnolet May 6, 2021
c9717b3
sorting condensed parents and children
divyegala May 6, 2021
9032027
merge
divyegala May 6, 2021
a153abd
Flipiing
cjnolet May 6, 2021
d63bb72
+ changes
cjnolet May 6, 2021
d35db43
pytthon changes
cjnolet May 6, 2021
baffb8b
C++ updates
cjnolet May 6, 2021
70c18ed
Python updates
cjnolet May 6, 2021
8efd3f2
Properly propagating lambda during cluster condensing so noise points…
cjnolet May 6, 2021
463f76f
Python updates
cjnolet May 6, 2021
f969ee8
Removing prints
cjnolet May 6, 2021
342f805
Adding different hyperparams to pytests for easier debugging.
cjnolet May 6, 2021
c999ecd
Adding additional hyperparams to pytest
cjnolet May 6, 2021
13b5314
Adding allow single cluster to params
cjnolet May 6, 2021
4b4e47e
C++ changes
cjnolet May 7, 2021
518cb79
Updates to unify logic for RSL and HDBSCAN. Added pytest for common c…
cjnolet May 7, 2021
c1eb5ff
C++ updates, fixing root cluster case. Leaf selection method fixes
cjnolet May 7, 2021
a20e13f
Correspoding pytest updates
cjnolet May 7, 2021
584487d
Debugging info
cjnolet May 10, 2021
1f20381
Updating python test
cjnolet May 10, 2021
addc71e
Verified excess of mass
cjnolet May 10, 2021
ef4c3af
Updating tests.
cjnolet May 10, 2021
4f2f99b
debugging through stabilities
divyegala May 11, 2021
9b4bd9d
pytest updates
cjnolet May 11, 2021
b93870a
pytest updates
cjnolet May 11, 2021
6c4977d
Adding mst generated from reference impl to condensing gtest
cjnolet May 11, 2021
46ca653
Adding c++ gtest
cjnolet May 11, 2021
b3457c5
Updates to pytest
cjnolet May 11, 2021
daae422
Querying for additional points for mutual reachability
cjnolet May 12, 2021
414788c
Updates to pytest to over-query knn graph for mutual reachability
cjnolet May 12, 2021
29000f2
testing through mst and knn
divyegala May 12, 2021
9d6daf2
Updates to pytests
cjnolet May 12, 2021
acba5e8
Removing knn print
cjnolet May 12, 2021
8fce905
Adding condensed tree for plotting
cjnolet May 12, 2021
372ccc5
Plotting min span tree and dendrogram
cjnolet May 12, 2021
d556d47
Pulling plots fom tests
cjnolet May 12, 2021
8beee00
merge
divyegala May 13, 2021
98800dc
Moving HDBSCAN and robust single linkage to experimental
cjnolet May 13, 2021
d8f1c19
Checking in cpp stuff
cjnolet May 14, 2021
a531ae1
Checkking in pytest
cjnolet May 14, 2021
0e4fb0b
New knn is running but neighborhoods are not yet completely correct
cjnolet May 15, 2021
915d34d
Pytest
cjnolet May 15, 2021
2b3d229
C++ changes
cjnolet May 17, 2021
5964cab
pytest changes
cjnolet May 17, 2021
0ac5b69
intermediate merge
divyegala May 17, 2021
69a48aa
more merge
divyegala May 17, 2021
9d181e1
testing leaf method
divyegala May 17, 2021
d14dd38
c++ updates
cjnolet May 17, 2021
27f4c7b
Python updates
cjnolet May 17, 2021
4da0fe6
Updates to c++
cjnolet May 19, 2021
cca3222
Style updates (c++)
cjnolet May 19, 2021
7fdad14
Supporting n_clusters =1. Adding test
cjnolet May 4, 2021
f1caec2
Fixing python style
cjnolet May 19, 2021
cf835fe
Merge branch 'branch-0.20' into fea-020-hdbscan_py
cjnolet May 19, 2021
3ba064c
Cleaining up debug prints
cjnolet May 19, 2021
e1f8bb1
Copyright checkjer
cjnolet May 20, 2021
76b392e
debuging epsilong search
divyegala May 20, 2021
7b2c082
testing all parameters epsilon
divyegala May 20, 2021
166dc04
Adding doxygen docs
cjnolet May 20, 2021
0cd10c8
Fixing build errors
cjnolet May 20, 2021
64e3475
no need to sort for probabilities since condensed hierarchy is sorted…
divyegala May 20, 2021
a475062
separating kernels out
divyegala May 20, 2021
8db6d14
Fixing pytest for hdbscan
cjnolet May 20, 2021
4f72996
Updates to cython
cjnolet May 20, 2021
e4a0139
Using sizeof and renaming stabilities to clustr_persistence
cjnolet May 20, 2021
1da875d
refactor gtests and get assertions running
divyegala May 21, 2021
0dfb9d1
Merge branch 'fea-020-hdbscan_py' of github.com:cjnolet/cuml into fea…
divyegala May 21, 2021
2ad39bc
cluster selection epsilon for eom gtest
divyegala May 21, 2021
8e302db
style check
divyegala May 21, 2021
6fc604f
allow_single_cluster=True gtest for eom
divyegala May 21, 2021
b1f551b
finalizing hdbscan gtests
divyegala May 24, 2021
54b8c10
digits end-to-end gtest
divyegala May 24, 2021
a6ac3e7
Review feedback so far
cjnolet May 24, 2021
70d19f3
Removing robust single linkage for 21.06. Will add in a future release
cjnolet May 24, 2021
fe55fb9
Review updates
cjnolet May 24, 2021
5e5d99f
review feedback
divyegala May 26, 2021
9bdcbf6
Merge branch 'fea-020-hdbscan_py' of github.com:cjnolet/cuml into fea…
divyegala May 26, 2021
804bc9c
more review comments
divyegala May 26, 2021
e94d54d
merge upstream
divyegala May 26, 2021
03ce6d3
remove cudaMalloc from cub function
divyegala May 26, 2021
85b2282
Final review items
cjnolet May 27, 2021
9c38782
not running cluster condensing gtest
divyegala May 27, 2021
af28c9f
Merge branch 'branch-21.06' of github.com:rapidsai/cuml into fea-020-…
divyegala May 27, 2021
6769356
Updates based on final review items.
cjnolet May 28, 2021
d143f05
Adding hdbscan to build dependencies (for pytests)
cjnolet May 29, 2021
13dca59
Removing hdbscan change from HDBSCAN PR
cjnolet Jun 1, 2021
095dcea
Fixing docs and addig param
cjnolet Jun 2, 2021
444cffb
Fixing base test
cjnolet Jun 3, 2021
1519478
Fixing doxygen
cjnolet Jun 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,8 @@ if(BUILD_CUML_CPP_LIBRARY)
src/glm/glm.cu
src/genetic/genetic.cu
src/genetic/node.cu
src/hdbscan/hdbscan.cu
src/hdbscan/condensed_hierarchy.cu
src/holtwinters/holtwinters.cu
src/kmeans/kmeans.cu
src/knn/knn.cu
Expand Down
4 changes: 2 additions & 2 deletions cpp/cmake/thirdparty/get_raft.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ function(find_and_configure_raft)
BUILD_EXPORT_SET cuml-exports
INSTALL_EXPORT_SET cuml-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/${PKG_FORK}/raft.git
GIT_TAG ${PKG_PINNED_TAG}
GIT_REPOSITORY https://github.com/cjnolet/raft.git
GIT_TAG fea-020-hdbscan
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
SOURCE_SUBDIR cpp
OPTIONS
"BUILD_TESTS OFF"
Expand Down
301 changes: 301 additions & 0 deletions cpp/include/cuml/cluster/hdbscan.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
/*
* Copyright (c) 2018-2021, NVIDIA CORPORATION.
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <raft/linalg/distance_type.h>
#include <raft/handle.hpp>

#include <rmm/device_uvector.hpp>

#include <cstddef>

namespace ML {
namespace HDBSCAN {
namespace Common {

template <typename value_idx, typename value_t>
class CondensedHierarchy {
public:
/**
* Constructs an empty condensed hierarchy object which requires
* condense() to be called in order to populate the state.
* @param handle_
* @param n_leaves_
*/
CondensedHierarchy(const raft::handle_t &handle_, size_t n_leaves_);

/**
* Constructs a condensed hierarchy object with existing arrays
* which already contain a condensed hierarchy.
* @param handle_
* @param n_leaves_
* @param size_
* @param n_edges_
* @param parents_
* @param children_
* @param lambdas_
* @param sizes_
*/
CondensedHierarchy(const raft::handle_t &handle_, size_t n_leaves_,
int n_edges_, value_idx *parents_, value_idx *children_,
value_t *lambdas_, value_idx *sizes_);

/**
* Constructs a condensed hierarchy object by moving
* rmm::device_uvector. Used to construct cluster trees
* @param handle_
* @param n_leaves_
* @param size_
* @param n_edges_
* @param parents_
* @param children_
* @param lambdas_
* @param sizes_
*/
CondensedHierarchy(const raft::handle_t &handle_, size_t n_leaves_,
int n_edges_, int n_clusters_,
rmm::device_uvector<value_idx> &&parents_,
rmm::device_uvector<value_idx> &&children_,
rmm::device_uvector<value_t> &&lambdas_,
rmm::device_uvector<value_idx> &&sizes_);
/**
* Populates the condensed hierarchy object with the output
divyegala marked this conversation as resolved.
Show resolved Hide resolved
* from Condense::build_condensed_hierarchy. First, it reverses
* values in the parent array since root has the largest value.
* Then, it makes the combined parent and children arrays monotonic.
* Finally, (parent, children, sizes) as key sort lamba array as
* value
* @param full_parents
* @param full_children
* @param full_lambdas
* @param full_sizes
*/
void condense(value_idx *full_parents, value_idx *full_children,
value_t *full_lambdas, value_idx *full_sizes,
value_idx size = -1);

value_idx get_cluster_tree_edges();

value_idx *get_parents() { return parents.data(); }
value_idx *get_children() { return children.data(); }
value_t *get_lambdas() { return lambdas.data(); }
value_idx *get_sizes() { return sizes.data(); }
value_idx get_n_edges() { return n_edges; }
int get_n_clusters() { return n_clusters; }
value_idx get_n_leaves() const { return n_leaves; }

private:
const raft::handle_t &handle;

rmm::device_uvector<value_idx> parents;
rmm::device_uvector<value_idx> children;
rmm::device_uvector<value_t> lambdas;
rmm::device_uvector<value_idx> sizes;
tfeher marked this conversation as resolved.
Show resolved Hide resolved

size_t n_edges;
size_t n_leaves;
int n_clusters;
value_idx root_cluster;
};

enum CLUSTER_SELECTION_METHOD { EOM = 0, LEAF = 1 };

class RobustSingleLinkageParams {
public:
int k = 5;
int min_samples = 5;
int min_cluster_size = 5;
int max_cluster_size = 0;

float cluster_selection_epsilon = 0.0;

bool allow_single_cluster = false;

float alpha = 1.0;
};

class HDBSCANParams : public RobustSingleLinkageParams {
public:
CLUSTER_SELECTION_METHOD cluster_selection_method =
CLUSTER_SELECTION_METHOD::EOM;
};

/**
* Container object for output information common between
* robust single linkage variants.
* @tparam value_idx
* @tparam value_t
*/
template <typename value_idx, typename value_t>
class robust_single_linkage_output {
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
public:
/**
* Construct output object with empty device arrays of
* known size.
* @param handle_ raft handle for ordering cuda operations
* @param n_leaves_ number of data points
* @param labels_ labels array on device (size n_leaves)
* @param children_ dendrogram src/dst array (size n_leaves - 1, 2)
* @param sizes_ dendrogram cluster sizes array (size n_leaves - 1)
* @param deltas_ dendrogram distances array (size n_leaves - 1)
* @param mst_src_ min spanning tree source array (size n_leaves - 1)
* @param mst_dst_ min spanning tree destination array (size n_leaves - 1)
* @param mst_weights_ min spanninng tree distances array (size n_leaves - 1)
*/
robust_single_linkage_output(const raft::handle_t &handle_, int n_leaves_,
value_idx *labels_, value_idx *children_,
value_idx *sizes_, value_t *deltas_,
value_idx *mst_src_, value_idx *mst_dst_,
value_t *mst_weights_)
: handle(handle_),
n_leaves(n_leaves_),
n_clusters(0),
labels(labels_),
children(children_),
sizes(sizes_),
deltas(deltas_),
mst_src(mst_src_),
mst_dst(mst_dst_),
mst_weights(mst_weights_) {}

int get_n_leaves() const { return n_leaves; }
int get_n_clusters() const { return n_clusters; }
value_idx *get_labels() { return labels; }
value_idx *get_children() { return children; }
value_idx *get_sizes() { return sizes; }
value_t *get_deltas() { return deltas; }
value_idx *get_mst_src() { return mst_src; }
value_idx *get_mst_dst() { return mst_dst; }
value_t *get_mst_weights() { return mst_weights; }

/**
* The number of clusters is set by the algorithm once it is known.
* @param n_clusters_ number of resulting clusters
*/
void set_n_clusters(int n_clusters_) { n_clusters = n_clusters_; }

protected:
const raft::handle_t &get_handle() { return handle; }

const raft::handle_t &handle;

int n_leaves;
int n_clusters;

value_idx *labels; // size n_leaves

// Dendrogram
value_idx *children; // size n_leaves * 2
value_idx *sizes; // size n_leaves
value_t *deltas; // size n_leaves

// MST (size n_leaves - 1).
value_idx *mst_src;
value_idx *mst_dst;
value_t *mst_weights;
};

/**
* Plain old container object to consolidate output
* arrays. This object is intentionally kept simple
* and straightforward in order to ease its use
* in the Python layer. For this reason, the MST
* arrays and renumbered dendrogram array, as well
* as its aggregated distances/cluster sizes, are
* kept separate. The condensed hierarchy is computed
* and populated in a separate object because its size
* is not known ahead of time. An RMM device vector is
* held privately and stabilities initialized explicitly
* since that size is also not known ahead of time.
* @tparam value_idx
* @tparam value_t
*/
template <typename value_idx, typename value_t>
class hdbscan_output : public robust_single_linkage_output<value_idx, value_t> {
public:
hdbscan_output(const raft::handle_t &handle_, int n_leaves_,
value_idx *labels_, value_t *probabilities_,
value_idx *children_, value_idx *sizes_, value_t *deltas_,
value_idx *mst_src_, value_idx *mst_dst_,
value_t *mst_weights_)
: robust_single_linkage_output<value_idx, value_t>(
handle_, n_leaves_, labels_, children_, sizes_, deltas_, mst_src_,
mst_dst_, mst_weights_),
probabilities(probabilities_),
stabilities(0, handle_.get_stream()),
condensed_tree(handle_, n_leaves_) {}

// Using getters here, making the members private and forcing
// consistent state with the constructor. This should make
// it much easier to use / debug.
value_t *get_probabilities() { return probabilities; }
value_t *get_stabilities() {
ASSERT(stabilities.size() > 0, "stabilities needs to be initialized");
return stabilities.data();
}

/**
* Once n_clusters is known, the stabilities array
* can be initialized.
* @param n_clusters_
*/
void set_n_clusters(int n_clusters_) {
robust_single_linkage_output<value_idx, value_t>::set_n_clusters(
n_clusters_);
stabilities.resize(
n_clusters_,
robust_single_linkage_output<value_idx, value_t>::get_handle()
.get_stream());
}

CondensedHierarchy<value_idx, value_t> &get_condensed_tree() {
return condensed_tree;
}

private:
value_t *probabilities; // size n_leaves

// Size not known ahead of time. Initialize
// with `initialize_stabilities()` method.
rmm::device_uvector<value_t> stabilities;

// Use condensed hierarchy to wrap
// condensed tree outputs since we do not
// know the size ahead of time.
CondensedHierarchy<value_idx, value_t> condensed_tree;
};

template class CondensedHierarchy<int, float>;

}; // namespace Common
}; // namespace HDBSCAN

/**
* Executes HDBSCAN clustering on an mxn-dimensional input array, X.
* @param[in] handle raft handle for resource reuse
* @param[in] X array (size m, n) on device in row-major format
* @param m number of rows in X
* @param n number of columns in X
* @param metric distance metric to use
* @param params struct of configuration hyper-parameters
* @param out struct of output data and arrays on device
*/
void hdbscan(const raft::handle_t &handle, const float *X, size_t m, size_t n,
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
raft::distance::DistanceType metric,
HDBSCAN::Common::HDBSCANParams &params,
HDBSCAN::Common::hdbscan_output<int, float> &out);
} // END namespace ML
1 change: 1 addition & 0 deletions cpp/include/cuml/cluster/linkage.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

#include <raft/linalg/distance_type.h>
#include <raft/sparse/hierarchy/common.h>
#include <raft/handle.hpp>

namespace raft {
class handle_t;
Expand Down
Loading