Skip to content

Density-Based Spatial Clustering of Applications with Noise

License

Notifications You must be signed in to change notification settings

superliuxz/DBSCAN

Repository files navigation

DBSCAN

My multi-threaded, GPU-enabled implementation of Density-Based Spatial Clustering of Applications with Noise.

The GPU implementation has achieved >20x speedup against de facto SKLearn; multi-threaded CPU implementation is 40% faster than SKLearn with one thread, and scalable with multiple threads.

image info

image info

image info

Requirements

C++17; GCC 7.5; CUDA 10.2; Thrust; pthread.

How to build

  • For the first time: git submodule update --init --recursive

Build main

  • cmake -DCMAKE_BUILD_TYPE=None -Bbuild -H.
    • For cpu-main
      • set environment variable AVX=1 to enable AVX;
      • set environment variable BIT_ADJ=1 if on average, each vertex has more than |V|/64 number of neighbours.
    • For gpu-main
      • Modify gpu/CMakeLists.txt, change the architecture code to fit your hardware.
  • cmake --build build --target cpu-main/gpu-main

Build tests

  • cmake -DCMAKE_BUILD_TYPE=Debug -Bbuild -H.
    • For cpu-test set AVX=1 and BIT_ADJ=1 correspondingly.
    • For gpu-test modify gpu/CMakeLists.txt correspondingly.
  • cmake --build build --target cpu-test/gpu-test

How to run

Existing test data under test_inputs/.

  • [optional] Generate an example using the Python script: python3 generate_dateset.py --n-samples=20000 --cluster-std=3.0

    • 20,000 points with 3.0 stddiv;
  • [optional] Cluster the input and visualize it using Sklearn: python3 dbscan.py --input-name=test_input.txt --eps=0.1 --min-pts=12

    • 0.1 radius and 12 neighbour points for clustering.

CPU algorithm

  • ./build/bin/cpu-main --input=<path_to_input> --eps=<eps> --min-pts=<P>.
    • Append --print to see the cluster ids.
    • Append --num-threads=K to speed up the processing.

GPU algorithm

  • ./build/bin/gpu-main --input=<path_to_input> --eps=<eps> --min-pts=<P>.
    • Append --print to see the cluster ids.

Test data

The query parameters are selected to emulate 10% Noises.

test_input_20k.txt is generated using --cluster-std=2.2 --n-samples=20000, with 4 clusters. Should use --eps=0.15 --min-pts=180 to query.

test_input_50k.txt is generated using --cluster-std=2.2 --n-samples=50000, with 20 clusters. Should use --eps=0.09 --min-pts=110 to query.

test_input_100k.txt is generated using --cluster-std=2.2 --n-samples=100000, with 20 clusters. Should use --eps=0.07 --min-pts=100 to query.

test_input_200k.txt is generated using --cluster-std=2.2 --n-samples=200000, with 20 clusters. Should use --eps=0.05 --min-pts=100 to query.

test_input_300k.txt is generated using --cluster-std=2.2 --n-samples=300000, with 20 clusters. Should use --eps=0.044 --min-pts=100 to query.

test_input_400k.txt is generated using --cluster-std=2.2 --n-samples=400000, with 20 clusters. Should use --eps=0.04 --min-pts=105 to query.

test_input_500k.txt is generated using --cluster-std=2.2 --n-samples=500000, with 20 clusters. Should use --eps=0.035 --min-pts=115 to query.

test_input_600k.txt is generated using --cluster-std=2.2 --n-samples=600000, with 20 clusters. Should use --eps=0.035 --min-pts=120 to query.

test_input_700k.txt is generated using --cluster-std=2.2 --n-samples=600000, with 20 clusters. Should use --eps=0.033 --min-pts=122 to query.

test_input_800k.txt is generated using --cluster-std=2.2 --n-samples=800000, with 20 clusters. Should use --eps=0.03 --min-pts=120 to query.

About

Density-Based Spatial Clustering of Applications with Noise

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published