Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] HDBSCAN Clustering #3546

Closed
wants to merge 85 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
43a8118
Checking in
cjnolet Dec 15, 2020
335e1f9
Getting MST to return results
cjnolet Dec 15, 2020
8dc58ba
Still trying to figure out why MST isn't returning expected results
cjnolet Dec 15, 2020
cef178a
Adding symmetrization to linkage
cjnolet Dec 16, 2020
1f00cbf
Fixing style
cjnolet Dec 16, 2020
209733b
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Dec 16, 2020
957192f
Test is executing end-to-end, need to verify results
cjnolet Dec 17, 2020
20e32e8
Adding new symmetrizaiton
cjnolet Dec 17, 2020
107b34d
Checking in
cjnolet Dec 25, 2020
5dd72ba
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Jan 11, 2021
3fe27c7
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Jan 11, 2021
48fbd1a
Adding final cluster extraction
cjnolet Jan 12, 2021
98dbf47
Fixing style
cjnolet Jan 12, 2021
567ab06
Fixing symmetrizatio bug
cjnolet Jan 12, 2021
7635b25
Output matches sklearn
cjnolet Jan 12, 2021
f3f45eb
Updating include check for test
cjnolet Jan 12, 2021
02125d6
Fixing style
cjnolet Jan 12, 2021
c8f1a58
Checking in
cjnolet Jan 12, 2021
20f57a4
Cleaning up logging
cjnolet Jan 12, 2021
3b08d79
Updating raft commit
cjnolet Jan 13, 2021
3e95e4b
Fixing style
cjnolet Jan 13, 2021
9525b80
Fixing style
cjnolet Jan 13, 2021
33b1948
Cleaning up log statements & fixing small bug in inherit_labels
cjnolet Jan 13, 2021
2ad0580
Adding test for outliers
cjnolet Jan 13, 2021
067ca4b
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Jan 14, 2021
c69d965
Updating test cmakelists
cjnolet Jan 14, 2021
7f78358
Adding benchmark for linkage
cjnolet Jan 14, 2021
e608d2b
Changes from benchmarking
cjnolet Jan 15, 2021
0a9cf0d
Updating changes
cjnolet Jan 22, 2021
f012f35
Checking in
cjnolet Jan 27, 2021
5f12acf
Fixing c++ style
cjnolet Feb 8, 2021
e63ad1d
Updating raft hash
cjnolet Feb 8, 2021
35d8a23
Removing sparse prims since they've been moved to raft
cjnolet Feb 9, 2021
5636caa
Merge branch 'branch-0.18' into imp-019-remove_sparse_prims
cjnolet Feb 9, 2021
31f1afc
Updating copyrights
cjnolet Feb 9, 2021
d564696
Updating raft hash
cjnolet Feb 10, 2021
2dfb89d
Merge branch 'branch-0.18' into imp-019-remove_sparse_prims
cjnolet Feb 10, 2021
5ca044b
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 10, 2021
cae115a
Setting libcumprims to 0.18 for now
cjnolet Feb 10, 2021
9492296
Getting a start on connected knn graph construction
cjnolet Feb 10, 2021
a27ce76
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 11, 2021
46af3a6
Making progress on fix connectivities
cjnolet Feb 11, 2021
4e58700
Making progress
cjnolet Feb 11, 2021
cefe5e8
gettting there
cjnolet Feb 12, 2021
7f2ce4e
Making progress on connectivity fixing
cjnolet Feb 16, 2021
ba24734
Checking in
cjnolet Feb 18, 2021
367b3c4
Debugging knn graph impl
cjnolet Feb 18, 2021
1de0f5b
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 18, 2021
54d1288
Very close.
cjnolet Feb 19, 2021
433cdea
knn graph connection algorithm runs end to end.
cjnolet Feb 20, 2021
560048b
Fixing style
cjnolet Feb 20, 2021
f51a078
Style update
cjnolet Feb 20, 2021
27eab25
Removing HDBSCAN to isolate changeset to SLHC
cjnolet Feb 22, 2021
390d2a5
Revert "Removing HDBSCAN to isolate changeset to SLHC"
cjnolet Feb 22, 2021
87f4a87
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 22, 2021
1f4b90d
Merge branch 'imp-019-remove_sparse_prims' into fea-019-slhc
cjnolet Feb 22, 2021
d7e2d31
Fixing style
cjnolet Feb 22, 2021
903e85f
updating import
cjnolet Feb 22, 2021
c16012e
Fixies
cjnolet Feb 23, 2021
105424a
Using fused l2 nn from raft
cjnolet Feb 24, 2021
0c8663b
Fixing style
cjnolet Feb 24, 2021
4e37c30
Updating copyright
cjnolet Feb 24, 2021
0f158a8
Fixing style
cjnolet Feb 24, 2021
9f34bbd
Updating copyright years
cjnolet Feb 24, 2021
309ef27
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 24, 2021
8a081d2
Merge branch 'imp-019-remove_sparse_prims' into imp-019-use_raft_fuse…
cjnolet Feb 24, 2021
4bf102b
Merge branch 'imp-019-use_raft_fused_l2_nn' into fea-019-slhc
cjnolet Feb 24, 2021
70148d5
Updating raft hash so ci will build
cjnolet Feb 24, 2021
e546579
Using raft hash to make CI build
cjnolet Feb 24, 2021
538891c
Moving cumlprims conda recipe back to minor_version
cjnolet Feb 24, 2021
3b43aad
Merge branch 'branch-0.19' into imp-019-use_raft_fused_l2_nn
cjnolet Mar 3, 2021
279d631
Updating style
cjnolet Mar 3, 2021
e0c9b1d
Updating raft hash to point to my branch until raft pr is merged
cjnolet Mar 3, 2021
8f0f709
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 3, 2021
8652239
Removing tests that are no longer needed
cjnolet Mar 4, 2021
23893c3
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 4, 2021
032887d
Merge branch 'branch-0.19' into fea-019-slhc
cjnolet Mar 4, 2021
427ebc6
Updating raft hash to branch-0.19
cjnolet Mar 4, 2021
72b3f10
Merge remote-tracking branch 'rapids/branch-0.19' into imp-19-use_raf…
cjnolet Mar 5, 2021
2039567
Updating raft hash
cjnolet Mar 6, 2021
7121a1c
Merge branch 'branch-0.19' into imp-19-use_raft_fused_l2_nn_2
cjnolet Mar 11, 2021
2ca9f49
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 15, 2021
26b53db
Merge branch 'fea-019-slhc' into fea-019-hdbscan
cjnolet Mar 15, 2021
ebde06c
Removing fix_connectivities since that's already in raft
cjnolet Mar 15, 2021
4317f5a
Merge branch 'fea-019-slhc' into fea-019-hdbscan
cjnolet Mar 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,7 @@ if(BUILD_CUML_CPP_LIBRARY)
src/kmeans/kmeans.cu
src/knn/knn.cu
src/knn/knn_sparse.cu
src/hierarchy/linkage.cu
src/metrics/accuracy_score.cu
src/metrics/adjusted_rand_index.cu
src/metrics/completeness_score.cu
Expand Down
1 change: 1 addition & 0 deletions cpp/bench/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ if(BUILD_CUML_BENCH)
sg/arima_loglikelihood.cu
sg/dbscan.cu
sg/kmeans.cu
sg/linkage.cu
sg/main.cpp
sg/rf_classifier.cu
# FIXME: RF Regressor is having an issue where the tests now seem to take
Expand Down
13 changes: 7 additions & 6 deletions cpp/bench/prims/fused_l2_nn.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -15,8 +15,8 @@
*/

#include <raft/cudart_utils.h>
#include <distance/fused_l2_nn.cuh>
#include <limits>
#include <raft/distance/fused_l2_nn.cuh>
#include <raft/linalg/norm.cuh>
#include <raft/random/rng.cuh>
#include "../common/ml_benchmark.hpp"
Expand Down Expand Up @@ -52,7 +52,7 @@ struct FusedL2NN : public Fixture {
raft::linalg::rowNorm(yn, y, params.k, params.n, raft::linalg::L2Norm, true,
stream);
auto blks = raft::ceildiv(params.m, 256);
MLCommon::Distance::initKernel<T, cub::KeyValuePair<int, T>, int>
raft::distance::initKernel<T, cub::KeyValuePair<int, T>, int>
<<<blks, 256, 0, stream>>>(out, params.m, std::numeric_limits<T>::max(),
op);
}
Expand All @@ -69,9 +69,9 @@ struct FusedL2NN : public Fixture {
void runBenchmark(::benchmark::State& state) override {
loopOnState(state, [this]() {
// it is enough to only benchmark the L2-squared metric
MLCommon::Distance::fusedL2NN<T, cub::KeyValuePair<int, T>, int>(
raft::distance::fusedL2NN<T, cub::KeyValuePair<int, T>, int>(
out, x, y, xn, yn, params.m, params.n, params.k, (void*)workspace, op,
false, false, stream);
pairRedOp, false, false, stream);
});
}

Expand All @@ -80,7 +80,8 @@ struct FusedL2NN : public Fixture {
T *x, *y, *xn, *yn;
cub::KeyValuePair<int, T>* out;
int* workspace;
MLCommon::Distance::MinAndDistanceReduceOp<int, T> op;
raft::distance::KVPMinReduce<int, T> pairRedOp;
raft::distance::MinAndDistanceReduceOp<int, T> op;
}; // struct FusedL2NN

static std::vector<FLNParams> getInputs() {
Expand Down
103 changes: 103 additions & 0 deletions cpp/bench/sg/linkage.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
/*
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <raft/linalg/distance_type.h>
#include <raft/sparse/hierarchy/common.h>
#include <cuml/cluster/linkage.hpp>
#include <cuml/common/logger.hpp>
#include <cuml/cuml.hpp>
#include <utility>
#include "benchmark.cuh"

namespace ML {
namespace Bench {
namespace linkage {

struct Params {
DatasetParams data;
BlobsParams blobs;
};

template <typename D>
class Linkage : public BlobsFixture<D> {
public:
Linkage(const std::string& name, const Params& p)
: BlobsFixture<D>(name, p.data, p.blobs) {}

protected:
void runBenchmark(::benchmark::State& state) override {
using MLCommon::Bench::CudaEventTimer;
if (!this->params.rowMajor) {
state.SkipWithError("Single-Linkage only supports row-major inputs");
}

this->loopOnState(state, [this]() {
out_arrs.labels = labels;
out_arrs.children = out_children;

printf("RUNNING!\n");

Logger::get().setLevel(CUML_LEVEL_WARN);
ML::single_linkage_pairwise(
*this->handle, this->data.X, this->params.nrows, this->params.ncols,
raft::distance::DistanceType::L2Expanded, &out_arrs, 50, 50);
});
}

void allocateTempBuffers(const ::benchmark::State& state) override {
this->alloc(labels, this->params.nrows);
this->alloc(out_children, (this->params.nrows - 1) * 2);
}

void deallocateTempBuffers(const ::benchmark::State& state) override {
this->dealloc(labels, this->params.nrows);
this->dealloc(out_children, (this->params.nrows - 1) * 2);
}

private:
int* labels;
int* out_children;
raft::hierarchy::linkage_output<int, D> out_arrs;
};

std::vector<Params> getInputs() {
std::vector<Params> out;
Params p;
p.data.rowMajor = true;
p.blobs.cluster_std = 5.0;
p.blobs.shuffle = false;
p.blobs.center_box_min = -10.0;
p.blobs.center_box_max = 10.0;
p.blobs.seed = 12345ULL;
std::vector<std::pair<int, int>> rowcols = {
{35000, 128}, {16384, 128}, {12288, 128}, {8192, 128}, {4096, 128},
};
for (auto& rc : rowcols) {
p.data.nrows = rc.first;
p.data.ncols = rc.second;
for (auto nclass : std::vector<int>({1})) {
p.data.nclasses = nclass;
out.push_back(p);
}
}
return out;
}

ML_BENCH_REGISTER(Params, Linkage<float>, "blobs", getInputs());

} // namespace linkage
} // end namespace Bench
} // end namespace ML
2 changes: 1 addition & 1 deletion cpp/cmake/Dependencies.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ else(DEFINED ENV{RAFT_PATH})

ExternalProject_Add(raft
GIT_REPOSITORY https://github.com/rapidsai/raft.git
GIT_TAG 4a79adcb0c0e87964dcdc9b9122f242b5235b702
GIT_TAG 6455e05b3889db2b495cf3189b33c2b07bfbebf2
PREFIX ${RAFT_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
Expand Down
65 changes: 65 additions & 0 deletions cpp/include/cuml/cluster/hdbscan.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
/*
* Copyright (c) 2018-2020, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <raft/linalg/distance_type.h>

#include <cuml/common/logger.hpp>
#include <cuml/cuml.hpp>

namespace ML {

template <typename value_idx, typename value_t>
struct hdbscan_output {
value_idx m;
value_idx *labels; // size: m
value_t *probabilities; // size: m

value_idx *mst_src; // size: m-1
value_idx *mst_dst; // size: m-1
value_t *mst_data; // size: m-1

value_idx *linakage_parents; // size: m
value_idx *linkage_children; // size: m
value_idx *linkage_deltas; // size: m
value_idx *linkage_sizes; // size: m
};

struct hdbscan_output_float : public hdbscan_output<int, float> {};

/**
* @defgroup HdbscanCpp C++ implementation of Dbscan algo
* @brief Fits an HDBSCAN model on an input feature matrix and outputs the labels,
* dendrogram, and minimum spanning tree.

* @param[in] handle
* @param[in] X
* @param[in] m
* @param[in] n
* @param[in] metric
* @param[in] k
* @param[in] min_pts
* @param[in] alpha
* @param[out] out
*/
void hdbscan(const raft::handle_t &handle, const float *X, int m, int n,
raft::distance::DistanceType metric, int k, int min_pts,
float alpha, hdbscan_output<int, float> *out);

/** @} */

} // namespace ML
60 changes: 60 additions & 0 deletions cpp/include/cuml/cluster/linkage.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
/*
* Copyright (c) 2018-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <raft/linalg/distance_type.h>
#include <raft/sparse/hierarchy/common.h>

#include <cuml/cuml.hpp>

namespace ML {

/**
* @defgroup HdbscanCpp C++ implementation of Dbscan algo
* @brief Fits an HDBSCAN model on an input feature matrix and outputs the labels,
* dendrogram, and minimum spanning tree.
* TODO: Use a separate type to represent number of edges so we can scale up
* number of edges without having to use 64-bit ints for vertices.

* @param[in] handle
* @param[in] X
* @param[in] m
* @param[in] n
* @param[in] metric
* @param[out] out
*/
void single_linkage_pairwise(const raft::handle_t &handle, const float *X,
size_t m, size_t n,
raft::distance::DistanceType metric,
raft::hierarchy::linkage_output<int, float> *out,
int c = 15, int n_clusters = 5);

void single_linkage_neighbors(const raft::handle_t &handle, const float *X,
size_t m, size_t n,
raft::distance::DistanceType metric,
raft::hierarchy::linkage_output<int, float> *out,
int c = 15, int n_clusters = 5);

void single_linkage_pairwise(
const raft::handle_t &handle, const float *X, size_t m, size_t n,
raft::distance::DistanceType metric,
raft::hierarchy::linkage_output<int64_t, float> *out, int c = 15,
int n_clusters = 5);

/** @} */

}; // namespace ML
2 changes: 1 addition & 1 deletion cpp/src/dbscan/adjgraph/algo.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
#include "../common.cuh"
#include "pack.h"

#include <sparse/convert/csr.cuh>
#include <raft/sparse/convert/csr.cuh>

using namespace thrust;

Expand Down
2 changes: 1 addition & 1 deletion cpp/src/dbscan/runner.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
#include <cuml/common/device_buffer.hpp>
#include <label/classlabels.cuh>
#include <raft/cuda_utils.cuh>
#include <sparse/csr.cuh>
#include <raft/sparse/csr.cuh>
#include "adjgraph/runner.cuh"
#include "corepoints/compute.cuh"
#include "corepoints/exchange.cuh"
Expand Down
36 changes: 36 additions & 0 deletions cpp/src/hdbscan/hdbscan.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
* Copyright (c) 2018-2020, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cuml/cuml_api.h>
#include <common/cumlHandle.hpp>
#include <cuml/cluster/hdbscan.hpp>

#include <hdbscan/runner.h>

namespace ML {

template <typename value_idx = int64_t, typename value_t = float>
void hdbscan(const raft::handle_t &handle, value_t *X, size_t m, size_t n,
raft::distance::DistanceType metric, int k, int min_pts,
float alpha, hdbscan_output<value_idx, value_t> *out) {
HDBSCAN::_fit<value_idx, value_t>(handle, X, m, n, metric, k, min_pts, alpha);
}

void hdbscan(const raft::handle_t &handle, const float *X, size_t m, size_t n,
raft::distance::DistanceType metric, int k, int min_pts,
float alpha, hdbscan_output<int, float> *out);

}; // end namespace ML
Loading