diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index c9540a35..745da955 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2023-12-19T04:10:28","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2023-12-19T04:33:06","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/affprop.html b/dev/affprop.html index 717b5e95..7ddeaade 100644 --- a/dev/affprop.html +++ b/dev/affprop.html @@ -1,3 +1,3 @@ Affinity Propagation · Clustering.jl

Affinity Propagation

Affinity propagation is a clustering algorithm based on message passing between data points. Similar to K-medoids, it looks at the (dis)similarities in the data, picks one exemplar data point for each cluster, and assigns every point in the data set to the cluster with the closest exemplar.

Clustering.affinitypropFunction
affinityprop(S::AbstractMatrix; [maxiter=200], [tol=1e-6], [damp=0.5],
-             [display=:none]) -> AffinityPropResult

Perform affinity propagation clustering based on a similarity matrix S.

$S_{ij}$ ($i ≠ j$) is the similarity (or the negated distance) between the $i$-th and $j$-th points, $S_{ii}$ defines the availability of the $i$-th point as an exemplar.

Arguments

  • damp::Real: the dampening coefficient, $0 ≤ \mathrm{damp} < 1$. Larger values indicate slower (and probably more stable) update. $\mathrm{damp} = 0$ disables dampening.
  • maxiter, tol, display: see common options

References

Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.

source
Clustering.AffinityPropResultType
AffinityPropResult <: ClusteringResult

The output of affinity propagation clustering (affinityprop).

Fields

  • exemplars::Vector{Int}: indices of exemplars (cluster centers)
  • assignments::Vector{Int}: cluster assignments for each data point
  • iterations::Int: number of iterations executed
  • converged::Bool: converged or not
source
+ [display=:none]) -> AffinityPropResult

Perform affinity propagation clustering based on a similarity matrix S.

$S_{ij}$ ($i ≠ j$) is the similarity (or the negated distance) between the $i$-th and $j$-th points, $S_{ii}$ defines the availability of the $i$-th point as an exemplar.

Arguments

References

Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.

source
Clustering.AffinityPropResultType
AffinityPropResult <: ClusteringResult

The output of affinity propagation clustering (affinityprop).

Fields

  • exemplars::Vector{Int}: indices of exemplars (cluster centers)
  • assignments::Vector{Int}: cluster assignments for each data point
  • iterations::Int: number of iterations executed
  • converged::Bool: converged or not
source
diff --git a/dev/algorithms.html b/dev/algorithms.html index 2936859d..0d0dd805 100644 --- a/dev/algorithms.html +++ b/dev/algorithms.html @@ -1,3 +1,3 @@ -Basics · Clustering.jl

Basics

The package implements a variety of clustering algorithms:

Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (X[:, i]) is a data point (data sample) in $d$-dimensional space.
  • Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.

Common Options

Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:

  • maxiter::Integer: maximum number of iterations.
  • tol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.
  • display::Symbol: the level of information to be displayed. It may take one of the following values:
    • :none: nothing is shown
    • :final: only shows a brief summary when the algorithm ends
    • :iter: shows the progress at each iteration

Results

A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).

The following generic methods are supported by any subtype of ClusteringResult:

StatsBase.countsMethod
counts(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster sizes.

counts(R)[k] is the number of points assigned to the $k$-th cluster.

source
Clustering.wcountsMethod
wcounts(R::ClusteringResult) -> Vector{Float64}
-wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source
Clustering.assignmentsMethod
assignments(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster indices for each point.

assignments(R)[i] is the index of the cluster to which the $i$-th point is assigned.

source
+Basics · Clustering.jl

Basics

The package implements a variety of clustering algorithms:

Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (X[:, i]) is a data point (data sample) in $d$-dimensional space.
  • Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.

Common Options

Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:

  • maxiter::Integer: maximum number of iterations.
  • tol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.
  • display::Symbol: the level of information to be displayed. It may take one of the following values:
    • :none: nothing is shown
    • :final: only shows a brief summary when the algorithm ends
    • :iter: shows the progress at each iteration

Results

A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).

The following generic methods are supported by any subtype of ClusteringResult:

StatsBase.countsMethod
counts(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster sizes.

counts(R)[k] is the number of points assigned to the $k$-th cluster.

source
Clustering.wcountsMethod
wcounts(R::ClusteringResult) -> Vector{Float64}
+wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source
Clustering.assignmentsMethod
assignments(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster indices for each point.

assignments(R)[i] is the index of the cluster to which the $i$-th point is assigned.

source
diff --git a/dev/dbscan.html b/dev/dbscan.html index 35dac2cf..2ab94b62 100644 --- a/dev/dbscan.html +++ b/dev/dbscan.html @@ -4,4 +4,4 @@ [min_neighbors=1], [min_cluster_size=1], [nntree_kwargs...]) -> DbscanResult

Cluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

Arguments

Optional keyword arguments to control the algorithm:

Example

points = randn(3, 10000)
 # DBSCAN clustering, clusters with less than 20 points will be discarded:
-clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

source
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

  • clusters::Vector{DbscanCluster}: clusters, length K
  • seeds::Vector{Int}: indices of the first points of each cluster's core, length K
  • counts::Vector{Int}: cluster sizes (number of assigned points), length K
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
source
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
  • boundary_indices::Vector{Int}: indices of the cluster points outside of core
source
+clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

source
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

  • clusters::Vector{DbscanCluster}: clusters, length K
  • seeds::Vector{Int}: indices of the first points of each cluster's core, length K
  • counts::Vector{Int}: cluster sizes (number of assigned points), length K
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
source
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
  • boundary_indices::Vector{Int}: indices of the cluster points outside of core
source
diff --git a/dev/fuzzycmeans.html b/dev/fuzzycmeans.html index 6069d28a..ada98f55 100644 --- a/dev/fuzzycmeans.html +++ b/dev/fuzzycmeans.html @@ -1,8 +1,8 @@ Fuzzy C-means · Clustering.jl

Fuzzy C-means

Fuzzy C-means is a clustering method that provides cluster membership weights instead of "hard" classification (e.g. K-means).

From a mathematical standpoint, fuzzy C-means solves the following optimization problem:

\[\arg\min_\mathcal{C} \ \sum_{i=1}^n \sum_{j=1}^C w_{ij}^\mu \| \mathbf{x}_i - \mathbf{c}_j \|^2, \ \text{where}\ w_{ij} = \left(\sum_{k=1}^{C} \left(\frac{\left\|\mathbf{x}_i - \mathbf{c}_j \right\|}{\left\|\mathbf{x}_i - \mathbf{c}_k \right\|}\right)^{\frac{2}{\mu-1}}\right)^{-1}\]

Here, $\mathbf{c}_j$ is the center of the $j$-th cluster, $w_{ij}$ is the membership weight of the $i$-th point in the $j$-th cluster, and $\mu > 1$ is a user-defined fuzziness parameter.

Clustering.fuzzy_cmeansFunction
fuzzy_cmeans(data::AbstractMatrix, C::Integer, fuzziness::Real;
-             [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult

Perform Fuzzy C-means clustering over the given data.

Arguments

  • data::AbstractMatrix: $d×n$ data matrix. Each column represents one $d$-dimensional data point.
  • C::Integer: the number of fuzzy clusters, $2 ≤ C < n$.
  • fuzziness::Real: clusters fuzziness ($μ$ in the mathematical formulation), $μ > 1$.

Optional keyword arguments:

  • dist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points
  • maxiter, tol, display, rng: see common options
source
Clustering.FuzzyCMeansResultType
FuzzyCMeansResult{T<:AbstractFloat}

The output of fuzzy_cmeans function.

Fields

  • centers::Matrix{T}: the $d×C$ matrix with columns being the centers of resulting fuzzy clusters
  • weights::Matrix{Float64}: the $n×C$ matrix of assignment weights ($\mathrm{weights}_{ij}$ is the weight (probability) of assigning $i$-th point to the $j$-th cluster)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source
Clustering.wcountsFunction
wcounts(R::ClusteringResult) -> Vector{Float64}
-wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source

Examples

using Clustering
+             [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult

Perform Fuzzy C-means clustering over the given data.

Arguments

  • data::AbstractMatrix: $d×n$ data matrix. Each column represents one $d$-dimensional data point.
  • C::Integer: the number of fuzzy clusters, $2 ≤ C < n$.
  • fuzziness::Real: clusters fuzziness ($μ$ in the mathematical formulation), $μ > 1$.

Optional keyword arguments:

  • dist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points
  • maxiter, tol, display, rng: see common options
source
Clustering.FuzzyCMeansResultType
FuzzyCMeansResult{T<:AbstractFloat}

The output of fuzzy_cmeans function.

Fields

  • centers::Matrix{T}: the $d×C$ matrix with columns being the centers of resulting fuzzy clusters
  • weights::Matrix{Float64}: the $n×C$ matrix of assignment weights ($\mathrm{weights}_{ij}$ is the weight (probability) of assigning $i$-th point to the $j$-th cluster)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source
Clustering.wcountsFunction
wcounts(R::ClusteringResult) -> Vector{Float64}
+wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source

Examples

using Clustering
 
 # make a random dataset with 1000 points
 # each point is a 5-dimensional vector
@@ -21,23 +21,23 @@
 # get the point memberships over all the clusters
 # memberships is a 20x3 matrix
 memberships = R.weights
1000×3 Matrix{Float64}:
- 0.3349    0.334459  0.330641
- 0.333347  0.334727  0.331926
- 0.331388  0.333858  0.334754
- 0.332156  0.334126  0.333717
- 0.331455  0.333581  0.334964
- 0.334799  0.332813  0.332388
- 0.332468  0.332904  0.334628
- 0.33594   0.332559  0.331501
- 0.332488  0.33498   0.332532
- 0.335633  0.330689  0.333677
+ 0.33406   0.334528  0.331412
+ 0.332505  0.332657  0.334838
+ 0.33481   0.33336   0.33183
+ 0.332275  0.332665  0.335059
+ 0.327807  0.33451   0.337683
+ 0.334801  0.334398  0.330801
+ 0.333061  0.334498  0.332442
+ 0.333149  0.334944  0.331906
+ 0.3305    0.332512  0.336988
+ 0.331442  0.333599  0.334958
  ⋮                   
- 0.33581   0.331057  0.333134
- 0.33406   0.332529  0.333411
- 0.333504  0.332745  0.333752
- 0.334141  0.334817  0.331042
- 0.33557   0.333355  0.331075
- 0.334093  0.332344  0.333564
- 0.330815  0.334822  0.334363
- 0.330679  0.332207  0.337114
- 0.33304   0.332917  0.334043
+ 0.332512 0.332994 0.334494 + 0.333461 0.332953 0.333586 + 0.329862 0.334123 0.336016 + 0.332913 0.333622 0.333465 + 0.331471 0.333205 0.335324 + 0.330328 0.333264 0.336408 + 0.336638 0.333544 0.329819 + 0.335453 0.332626 0.331921 + 0.336177 0.333567 0.330256 diff --git a/dev/hclust.html b/dev/hclust.html index 038db61b..54b31dbe 100644 --- a/dev/hclust.html +++ b/dev/hclust.html @@ -1,5 +1,5 @@ -Hierarchical Clustering · Clustering.jl

Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclustFunction
hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

  • d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
  • linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
    • :single (the default): use the minimum distance between any of the cluster members
    • :average: use the mean distance between any of the cluster members
    • :complete: use the maximum distance between any of the members
    • :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
    • :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
  • uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
  • branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
    • :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
    • :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
Clustering.HclustType
Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

  • merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
    • each row specifies the left and right subtrees (referenced by their $id$s) that are merged
    • negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
    • positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
  • linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
  • heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
  • order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
+Hierarchical Clustering · Clustering.jl

Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclustFunction
hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

  • d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
  • linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
    • :single (the default): use the minimum distance between any of the cluster members
    • :average: use the mean distance between any of the cluster members
    • :complete: use the maximum distance between any of the members
    • :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
    • :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
  • uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
  • branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
    • :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
    • :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
Clustering.HclustType
Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

  • merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
    • each row specifies the left and right subtrees (referenced by their $id$s) that are merged
    • negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
    • positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
  • linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
  • heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
  • order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
 D = rand(1000, 1000);
 D += D'; # symmetric distance matrix (optional)
-result = hclust(D, linkage=:single)
Hclust{Float64}([-445 -956; -21 1; … ; -52 997; -394 998], [0.0015087699301117308, 0.0017070081709188445, 0.0021077449998320175, 0.00225195556476232, 0.003479193937789171, 0.00388347435463543, 0.004590043575440239, 0.005592465080951903, 0.0058577090508917795, 0.005925830373306518  …  0.09427721839399683, 0.09541184944109415, 0.09682170929188849, 0.09695705555178735, 0.10056365753929242, 0.1047564688521172, 0.10838606616722912, 0.10999980968664125, 0.11283359682455474, 0.11447052174134031], [394, 52, 491, 667, 704, 646, 609, 859, 938, 222  …  536, 18, 582, 181, 972, 339, 45, 408, 399, 599], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutreeFunction
cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

  • k::Integer (optional) the number of desired clusters.
  • h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.

See also: hclust

source
+result = hclust(D, linkage=:single)
Hclust{Float64}([-690 -813; -554 -732; … ; -16 997; -195 998], [0.0035044882382555542, 0.004548714784891494, 0.004950483509694847, 0.005373743972241107, 0.005886254023239723, 0.0063835889565238, 0.0065741127822909196, 0.006700805844352842, 0.006809675623588918, 0.007200475155618946  …  0.09944908633303073, 0.1004498060257667, 0.10136046307512614, 0.10594225340631724, 0.10744167473659694, 0.10969608920802298, 0.11276149705686223, 0.11297511526379667, 0.12343594323017404, 0.12775784522521183], [195, 16, 735, 657, 987, 367, 339, 142, 8, 844  …  190, 603, 18, 903, 755, 944, 120, 148, 128, 995], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutreeFunction
cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

  • k::Integer (optional) the number of desired clusters.
  • h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.

See also: hclust

source
diff --git a/dev/index.html b/dev/index.html index c337a564..337bf824 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Introduction · Clustering.jl
+Introduction · Clustering.jl
diff --git a/dev/init.html b/dev/init.html index 336fbde0..6548de37 100644 --- a/dev/init.html +++ b/dev/init.html @@ -1,6 +1,6 @@ -Initialization · Clustering.jl

Initialization

A clustering algorithm usually requires initialization before it could be started.

Seeding

Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).

Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:

Clustering.initseeds!Function
initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
-           X::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.

source
Clustering.initseeds_by_costs!Function
initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
-                    costs::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.

Here, costs[i, j] is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

source

There are several seeding methods described in the literature. Clustering.jl implements three popular ones:

Clustering.KmppAlgType
KmppAlg <: SeedingAlgorithm

Kmeans++ seeding (:kmpp).

Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.

References

D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.

source
Clustering.KmCentralityAlgType
KmCentralityAlg <: SeedingAlgorithm

K-medoids initialization based on centrality (:kmcen).

Choose the $k$ points with the highest centrality as seeds.

References

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039

source
Clustering.RandSeedAlgType
RandSeedAlg <: SeedingAlgorithm

Random seeding (:rand).

Chooses an arbitrary subset of $k$ data points as cluster seeds.

source

In practice, we have found that Kmeans++ is the most effective choice.

For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:

Clustering.initseedsFunction
initseeds(alg::Union{SeedingAlgorithm, Symbol},
-          X::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from a $d×n$ data matrix X using the alg algorithm.

alg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.

Returns the vector of k seed indices.

source
Clustering.initseeds_by_costsFunction
initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
-                   costs::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from the $n×n$ costs matrix using algorithm alg.

Here, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

Returns the vector of k seed indices.

source
+Initialization · Clustering.jl

Initialization

A clustering algorithm usually requires initialization before it could be started.

Seeding

Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).

Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:

Clustering.initseeds!Function
initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
+           X::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.

source
Clustering.initseeds_by_costs!Function
initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
+                    costs::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.

Here, costs[i, j] is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

source

There are several seeding methods described in the literature. Clustering.jl implements three popular ones:

Clustering.KmppAlgType
KmppAlg <: SeedingAlgorithm

Kmeans++ seeding (:kmpp).

Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.

References

D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.

source
Clustering.KmCentralityAlgType
KmCentralityAlg <: SeedingAlgorithm

K-medoids initialization based on centrality (:kmcen).

Choose the $k$ points with the highest centrality as seeds.

References

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039

source
Clustering.RandSeedAlgType
RandSeedAlg <: SeedingAlgorithm

Random seeding (:rand).

Chooses an arbitrary subset of $k$ data points as cluster seeds.

source

In practice, we have found that Kmeans++ is the most effective choice.

For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:

Clustering.initseedsFunction
initseeds(alg::Union{SeedingAlgorithm, Symbol},
+          X::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from a $d×n$ data matrix X using the alg algorithm.

alg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.

Returns the vector of k seed indices.

source
Clustering.initseeds_by_costsFunction
initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
+                   costs::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from the $n×n$ costs matrix using algorithm alg.

Here, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

Returns the vector of k seed indices.

source
diff --git a/dev/kmeans-018c1a08.svg b/dev/kmeans-ccfdcd40.svg similarity index 70% rename from dev/kmeans-018c1a08.svg rename to dev/kmeans-ccfdcd40.svg index d29d4a15..7b15f6ef 100644 --- a/dev/kmeans-018c1a08.svg +++ b/dev/kmeans-ccfdcd40.svg @@ -1,196 +1,196 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/dev/kmeans.html b/dev/kmeans.html index c7e9050a..3726d012 100644 --- a/dev/kmeans.html +++ b/dev/kmeans.html @@ -1,5 +1,5 @@ -K-means · Clustering.jl

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering
+K-means · Clustering.jl

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering
 
 # make a random dataset with 1000 random 5-dimensional points
 X = rand(5, 1000)
@@ -12,11 +12,11 @@
 a = assignments(R) # get the assignments of points to clusters
 c = counts(R) # get the cluster sizes
 M = R.centers # get the cluster centers
5×20 Matrix{Float64}:
- 0.233921  0.248751  0.681185  0.339047  …  0.199974  0.781437  0.171391
- 0.78364   0.236461  0.22542   0.738529     0.24972   0.699164  0.454819
- 0.793358  0.229916  0.254547  0.310724     0.287097  0.227963  0.387556
- 0.719073  0.797893  0.593535  0.77088      0.164585  0.801478  0.700597
- 0.364485  0.364264  0.787123  0.285578     0.447572  0.831655  0.79441

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
+ 0.2117    0.491851  0.758457  0.602743  …  0.217038  0.852979  0.748562
+ 0.770759  0.309261  0.347395  0.692755     0.5091    0.213795  0.752339
+ 0.69697   0.730276  0.182796  0.707557     0.67861   0.692465  0.528834
+ 0.319476  0.263525  0.726214  0.257265     0.789839  0.729913  0.220093
+ 0.403965  0.148082  0.294128  0.788793     0.215303  0.208538  0.204926

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
 iris = dataset("datasets", "iris"); # load the data
 
 features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
@@ -24,4 +24,4 @@
 
 # plot with the point color mapped to the assigned cluster index
 scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
-        color=:lightrainbow, legend=false)
Example block output
+ color=:lightrainbow, legend=false)
Example block output
diff --git a/dev/kmedoids.html b/dev/kmedoids.html index 69a1d922..44f27e65 100644 --- a/dev/kmedoids.html +++ b/dev/kmedoids.html @@ -1,3 +1,3 @@ -K-medoids · Clustering.jl

K-medoids

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Clustering.kmedoidsFunction
kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult

Perform K-medoids clustering of $n$ points into k clusters, given the dist matrix ($n×n$, dist[i, j] is the distance between the j-th and i-th points).

Arguments

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:
    • a Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).
    • an integer vector of length k that provides the indices of points to use as initial medoids.
  • maxiter, tol, display: see common options

Note

The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

source
Clustering.kmedoids!Function
kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};
-          [kwargs...]) -> KmedoidsResult

Update the current cluster medoids using the dist matrix.

The medoids field of the returned KmedoidsResult points to the same array as medoids argument.

See kmedoids for the description of optional kwargs.

source
Clustering.KmedoidsResultType
KmedoidsResult{T} <: ClusteringResult

The output of kmedoids function.

Fields

  • medoids::Vector{Int}: the indices of $k$ medoids
  • assignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the $i$-th point
  • costs::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning $i$-th point to its medoid
  • counts::Vector{Int}: cluster sizes
  • totalcost::Float64: total assignment cost (the sum of costs)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source

References

  1. Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955
  2. Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16
+K-medoids · Clustering.jl

K-medoids

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Clustering.kmedoidsFunction
kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult

Perform K-medoids clustering of $n$ points into k clusters, given the dist matrix ($n×n$, dist[i, j] is the distance between the j-th and i-th points).

Arguments

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:
    • a Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).
    • an integer vector of length k that provides the indices of points to use as initial medoids.
  • maxiter, tol, display: see common options

Note

The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

source
Clustering.kmedoids!Function
kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};
+          [kwargs...]) -> KmedoidsResult

Update the current cluster medoids using the dist matrix.

The medoids field of the returned KmedoidsResult points to the same array as medoids argument.

See kmedoids for the description of optional kwargs.

source
Clustering.KmedoidsResultType
KmedoidsResult{T} <: ClusteringResult

The output of kmedoids function.

Fields

  • medoids::Vector{Int}: the indices of $k$ medoids
  • assignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the $i$-th point
  • costs::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning $i$-th point to its medoid
  • counts::Vector{Int}: cluster sizes
  • totalcost::Float64: total assignment cost (the sum of costs)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source

References

  1. Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955
  2. Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16
diff --git a/dev/mcl.html b/dev/mcl.html index 7c053fe9..4700348d 100644 --- a/dev/mcl.html +++ b/dev/mcl.html @@ -1,2 +1,2 @@ -MCL (Markov Cluster Algorithm) · Clustering.jl

MCL (Markov Cluster Algorithm)

Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).

Clustering.mclFunction
mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult

Perform MCL (Markov Cluster Algorithm) clustering using $n×n$ adjacency (points similarity) matrix adj.

Arguments

Keyword arguments to control the MCL algorithm:

  • add_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph
  • expansion::Number (defaults to 2): MCL expansion constant
  • inflation::Number (defaults to 2): MCL inflation constant
  • save_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge
  • prune_tol::Number: pruning threshold
  • display, maxiter, tol: see common options

References

Stijn van Dongen, "Graph clustering by flow simulation", 2001

Original MCL implementation.

source
Clustering.MCLResultType
MCLResult <: ClusteringResult

The output of mcl function.

Fields

  • mcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled
  • assignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the $i$-th point ($0$ if unassigned)
  • counts::Vector{Int}: the $k$-length vector of cluster sizes
  • nunassigned::Int: the number of standalone points not assigned to any cluster
  • iterations::Int: the number of elapsed iterations
  • rel_Δ::Float64: the final relative Δ
  • converged::Bool: whether the method converged
source
+MCL (Markov Cluster Algorithm) · Clustering.jl

MCL (Markov Cluster Algorithm)

Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).

Clustering.mclFunction
mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult

Perform MCL (Markov Cluster Algorithm) clustering using $n×n$ adjacency (points similarity) matrix adj.

Arguments

Keyword arguments to control the MCL algorithm:

  • add_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph
  • expansion::Number (defaults to 2): MCL expansion constant
  • inflation::Number (defaults to 2): MCL inflation constant
  • save_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge
  • prune_tol::Number: pruning threshold
  • display, maxiter, tol: see common options

References

Stijn van Dongen, "Graph clustering by flow simulation", 2001

Original MCL implementation.

source
Clustering.MCLResultType
MCLResult <: ClusteringResult

The output of mcl function.

Fields

  • mcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled
  • assignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the $i$-th point ($0$ if unassigned)
  • counts::Vector{Int}: the $k$-length vector of cluster sizes
  • nunassigned::Int: the number of standalone points not assigned to any cluster
  • iterations::Int: the number of elapsed iterations
  • rel_Δ::Float64: the final relative Δ
  • converged::Bool: whether the method converged
source
diff --git a/dev/search_index.js b/dev/search_index.js index 9d662241..8b2ee63a 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"hclust.html#Hierarchical-Clustering","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"","category":"section"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"hclust\nHclust","category":"page"},{"location":"hclust.html#Clustering.hclust","page":"Hierarchical Clustering","title":"Clustering.hclust","text":"hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust\n\nPerform hierarchical clustering using the distance matrix d and the cluster linkage function.\n\nReturns the dendrogram as a Hclust object.\n\nArguments\n\nd::AbstractMatrix: the pairwise distance matrix. d_ij is the distance between i-th and j-th points.\nlinkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:\n:single (the default): use the minimum distance between any of the cluster members\n:average: use the mean distance between any of the cluster members\n:complete: use the maximum distance between any of the members\n:ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters\n:ward_presquared: same as :ward, but assumes that the distances in d are already squared.\nuplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.\nbranchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:\n:r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)\n:barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the \"fast optimal leaf ordering\" algorithm from Bar-Joseph et. al. Bioinformatics (2001)\n\n\n\n\n\n","category":"function"},{"location":"hclust.html#Clustering.Hclust","page":"Hierarchical Clustering","title":"Clustering.Hclust","text":"Hclust{T<:Real}\n\nThe output of hclust, hierarchical clustering of data points.\n\nProvides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.\n\nThis type mostly follows R's hclust class.\n\nFields\n\nmerges::Matrix{Int}: N2 matrix encoding subtree merges:\neach row specifies the left and right subtrees (referenced by their ids) that are merged\nnegative subtree id denotes the leaf node and corresponds to the data point at position -id\npositive id denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)\nlinkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)\nheights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage\norder::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.\n\nSee also: hclust.\n\n\n\n\n\n","category":"type"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"Single-linkage clustering using distance matrix:","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"using Clustering\nD = rand(1000, 1000);\nD += D'; # symmetric distance matrix (optional)\nresult = hclust(D, linkage=:single)","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"cutree","category":"page"},{"location":"hclust.html#Clustering.cutree","page":"Hierarchical Clustering","title":"Clustering.cutree","text":"cutree(hclu::Hclust; [k], [h]) -> Vector{Int}\n\nCut the hclu dendrogram to produce clusters at the specified level of granularity.\n\nReturns the cluster assignments vector z (z_i is the index of the cluster for the i-th data point).\n\nArguments\n\nk::Integer (optional) the number of desired clusters.\nh::Real (optional) the height at which the tree is cut.\n\nIf both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.\n\nSee also: hclust\n\n\n\n\n\n","category":"function"},{"location":"init.html#clu_algo_init","page":"Initialization","title":"Initialization","text":"","category":"section"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"A clustering algorithm usually requires initialization before it could be started.","category":"page"},{"location":"init.html#Seeding","page":"Initialization","title":"Seeding","text":"","category":"section"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"SeedingAlgorithm\ninitseeds!\ninitseeds_by_costs!","category":"page"},{"location":"init.html#Clustering.SeedingAlgorithm","page":"Initialization","title":"Clustering.SeedingAlgorithm","text":"SeedingAlgorithm\n\nBase type for all seeding algorithms.\n\nEach seeding algorithm should implement the two functions: initseeds! and initseeds_by_costs!.\n\n\n\n\n\n","category":"type"},{"location":"init.html#Clustering.initseeds!","page":"Initialization","title":"Clustering.initseeds!","text":"initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,\n X::AbstractMatrix) -> iseeds\n\nInitialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.\n\n\n\n\n\n","category":"function"},{"location":"init.html#Clustering.initseeds_by_costs!","page":"Initialization","title":"Clustering.initseeds_by_costs!","text":"initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,\n costs::AbstractMatrix) -> iseeds\n\nInitialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.\n\nHere, costs[i, j] is the cost of assigning points i and j to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.\n\n\n\n\n\n","category":"function"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"There are several seeding methods described in the literature. Clustering.jl implements three popular ones:","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"KmppAlg\nKmCentralityAlg\nRandSeedAlg","category":"page"},{"location":"init.html#Clustering.KmppAlg","page":"Initialization","title":"Clustering.KmppAlg","text":"KmppAlg <: SeedingAlgorithm\n\nKmeans++ seeding (:kmpp).\n\nChooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.\n\nReferences\n\nD. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.\n\n\n\n\n\n","category":"type"},{"location":"init.html#Clustering.KmCentralityAlg","page":"Initialization","title":"Clustering.KmCentralityAlg","text":"KmCentralityAlg <: SeedingAlgorithm\n\nK-medoids initialization based on centrality (:kmcen).\n\nChoose the k points with the highest centrality as seeds.\n\nReferences\n\nHae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039\n\n\n\n\n\n","category":"type"},{"location":"init.html#Clustering.RandSeedAlg","page":"Initialization","title":"Clustering.RandSeedAlg","text":"RandSeedAlg <: SeedingAlgorithm\n\nRandom seeding (:rand).\n\nChooses an arbitrary subset of k data points as cluster seeds.\n\n\n\n\n\n","category":"type"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"In practice, we have found that Kmeans++ is the most effective choice.","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"initseeds\ninitseeds_by_costs","category":"page"},{"location":"init.html#Clustering.initseeds","page":"Initialization","title":"Clustering.initseeds","text":"initseeds(alg::Union{SeedingAlgorithm, Symbol},\n X::AbstractMatrix, k::Integer) -> Vector{Int}\n\nSelect k seeds from a dn data matrix X using the alg algorithm.\n\nalg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.\n\nReturns the vector of k seed indices.\n\n\n\n\n\n","category":"function"},{"location":"init.html#Clustering.initseeds_by_costs","page":"Initialization","title":"Clustering.initseeds_by_costs","text":"initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},\n costs::AbstractMatrix, k::Integer) -> Vector{Int}\n\nSelect k seeds from the nn costs matrix using algorithm alg.\n\nHere, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.\n\nReturns the vector of k seed indices.\n\n\n\n\n\n","category":"function"},{"location":"dbscan.html#DBSCAN","page":"DBSCAN","title":"DBSCAN","text":"","category":"section"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.","category":"page"},{"location":"dbscan.html#Density-Reachability","page":"DBSCAN","title":"Density Reachability","text":"","category":"section"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"DBSCAN's definition of a cluster is based on the concept of density reachability: a point q is said to be directly density reachable by another point p if the distance between them is below a specified threshold epsilon and p is surrounded by sufficiently many points. Then, q is considered to be density reachable by p if there exists a sequence p_1 p_2 ldots p_n such that p_1 = p and p_i+1 is directly density reachable from p_i.","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"The points within DBSCAN clusters are categorized into core (or seeds) and boundary:","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"All points of the cluster core are mutually density-connected, meaning that for any two distinct points p and q in a core, there exists a point o such that both p and q are density reachable from o.\nIf a point is density-connected to any point of a cluster core, it is also part of the core.\nAll points within the epsilon-neighborhood of any core point, but not belonging to that core (i.e. not density reachable from the core), are considered cluster boundary.","category":"page"},{"location":"dbscan.html#Interface","page":"DBSCAN","title":"Interface","text":"","category":"section"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"The implementation of DBSCAN algorithm provided by dbscan function supports the two ways of specifying clustering data:","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"The d times n matrix of point coordinates. This is the preferred method as it uses memory- and time-efficient neighboring points queries via NearestNeighbors.jl package.\nThe ntimes n matrix of precalculated pairwise point distances. It requires O(n^2) memory and time to run.","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"dbscan\nDbscanResult\nDbscanCluster","category":"page"},{"location":"dbscan.html#Clustering.dbscan","page":"DBSCAN","title":"Clustering.dbscan","text":"dbscan(points::AbstractMatrix, radius::Real;\n [metric=Euclidean()],\n [min_neighbors=1], [min_cluster_size=1],\n [nntree_kwargs...]) -> DbscanResult\n\nCluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.\n\nArguments\n\npoints: when metric is specified, the d×n matrix, where each column is a d-dimensional coordinate of a point; when metric=nothing, the n×n matrix of pairwise distances between the points\nradius::Real: neighborhood radius; points within this distance are considered neighbors\n\nOptional keyword arguments to control the algorithm:\n\nmetric (defaults to Euclidean()): the points distance metric to use, nothing means points is the n×n precalculated distance matrix\nmin_neighbors::Integer (defaults to 1): the minimal number of neighbors required to assign a point to a cluster \"core\"\nmin_cluster_size::Integer (defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded\nnntree_kwargs...: parameters (like leafsize) for the KDTree constructor\n\nExample\n\npoints = randn(3, 10000)\n# DBSCAN clustering, clusters with less than 20 points will be discarded:\nclustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)\n\nReferences:\n\nMartin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, \"A density-based algorithm for discovering clusters in large spatial databases with noise\", KDD-1996, pp. 226–231.\nErich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, \"DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN\", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335\n\n\n\n\n\n","category":"function"},{"location":"dbscan.html#Clustering.DbscanResult","page":"DBSCAN","title":"Clustering.DbscanResult","text":"DbscanResult <: ClusteringResult\n\nThe output of dbscan function.\n\nFields\n\nclusters::Vector{DbscanCluster}: clusters, length K\nseeds::Vector{Int}: indices of the first points of each cluster's core, length K\ncounts::Vector{Int}: cluster sizes (number of assigned points), length K\nassignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N\n\n\n\n\n\n","category":"type"},{"location":"dbscan.html#Clustering.DbscanCluster","page":"DBSCAN","title":"Clustering.DbscanCluster","text":"DbscanCluster\n\nDBSCAN cluster, part of DbscanResult returned by dbscan function.\n\nFields\n\nsize::Int: number of points in a cluster (core + boundary)\ncore_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)\nboundary_indices::Vector{Int}: indices of the cluster points outside of core\n\n\n\n\n\n","category":"type"},{"location":"kmeans.html#K-means","page":"K-means","title":"K-means","text":"","category":"section"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"textminimize sum_i=1^n mathbfx_i - boldsymbolmu_z_i ^2 textwrt (boldsymbolmu z)","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"Here, boldsymbolmu_k is the center of the k-th cluster, and z_i is an index of the cluster for i-th point mathbfx_i.","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"kmeans\nKmeansResult","category":"page"},{"location":"kmeans.html#Clustering.kmeans","page":"K-means","title":"Clustering.kmeans","text":"kmeans(X, k, [...]) -> KmeansResult\n\nK-means clustering of the dn data matrix X (each column of X is a d-dimensional data point) into k clusters.\n\nArguments\n\ninit (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:\na Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);\nan instance of SeedingAlgorithm;\nan integer vector of length k that provides the indices of points to use as initial seeds.\nweights: n-element vector of point weights (the cluster centers are the weighted means of cluster members)\nmaxiter, tol, display: see common options\n\n\n\n\n\n","category":"function"},{"location":"kmeans.html#Clustering.KmeansResult","page":"K-means","title":"Clustering.KmeansResult","text":"KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult\n\nThe output of kmeans and kmeans!.\n\nType parameters\n\nC<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix\nD<:Real: type of the assignment cost\nWC<:Real: type of the cluster weight\n\n\n\n\n\n","category":"type"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"If you already have a set of initial center vectors, kmeans! could be used:","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"kmeans!","category":"page"},{"location":"kmeans.html#Clustering.kmeans!","page":"K-means","title":"Clustering.kmeans!","text":"kmeans!(X, centers; [kwargs...]) -> KmeansResult\n\nUpdate the current cluster centers (dk matrix, where d is the dimension and k the number of centroids) using the dn data matrix X (each column of X is a d-dimensional data point).\n\nSee kmeans for the description of optional kwargs.\n\n\n\n\n\n","category":"function"},{"location":"kmeans.html#Examples","page":"K-means","title":"Examples","text":"","category":"section"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"using Clustering\n\n# make a random dataset with 1000 random 5-dimensional points\nX = rand(5, 1000)\n\n# cluster X into 20 clusters using K-means\nR = kmeans(X, 20; maxiter=200, display=:iter)\n\n@assert nclusters(R) == 20 # verify the number of clusters\n\na = assignments(R) # get the assignments of points to clusters\nc = counts(R) # get the cluster sizes\nM = R.centers # get the cluster centers","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"Scatter plot of the K-means clustering results:","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"using RDatasets, Clustering, Plots\niris = dataset(\"datasets\", \"iris\"); # load the data\n\nfeatures = collect(Matrix(iris[:, 1:4])'); # features to use for clustering\nresult = kmeans(features, 3); # run K-means for the 3 clusters\n\n# plot with the point color mapped to the assigned cluster index\nscatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,\n color=:lightrainbow, legend=false)","category":"page"},{"location":"algorithms.html#clu_algo_basics","page":"Basics","title":"Basics","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"The package implements a variety of clustering algorithms:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Pages = [\"kmeans.md\", \"kmedoids.md\", \"hclust.md\", \"mcl.md\",\n \"affprop.md\", \"dbscan.md\", \"fuzzycmeans.md\"]","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.","category":"page"},{"location":"algorithms.html#Inputs","page":"Basics","title":"Inputs","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Data matrix X of size d times n, the i-th column of X (X[:, i]) is a data point (data sample) in d-dimensional space.\nDistance matrix D of size n times n, where D_ij is the distance between the i-th and j-th points, or the cost of assigning them to the same cluster.","category":"page"},{"location":"algorithms.html#common_options","page":"Basics","title":"Common Options","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"maxiter::Integer: maximum number of iterations.\ntol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.\ndisplay::Symbol: the level of information to be displayed. It may take one of the following values:\n:none: nothing is shown\n:final: only shows a brief summary when the algorithm ends\n:iter: shows the progress at each iteration","category":"page"},{"location":"algorithms.html#Results","page":"Basics","title":"Results","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"ClusteringResult","category":"page"},{"location":"algorithms.html#Clustering.ClusteringResult","page":"Basics","title":"Clustering.ClusteringResult","text":"ClusteringResult\n\nBase type for the output of clustering algorithm.\n\n\n\n\n\n","category":"type"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"The following generic methods are supported by any subtype of ClusteringResult:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"nclusters(::ClusteringResult)\ncounts(::ClusteringResult)\nwcounts(::ClusteringResult)\nassignments(::ClusteringResult)","category":"page"},{"location":"algorithms.html#Clustering.nclusters-Tuple{ClusteringResult}","page":"Basics","title":"Clustering.nclusters","text":"nclusters(R::ClusteringResult) -> Int\n\nGet the number of clusters.\n\n\n\n\n\n","category":"method"},{"location":"algorithms.html#StatsBase.counts-Tuple{ClusteringResult}","page":"Basics","title":"StatsBase.counts","text":"counts(R::ClusteringResult) -> Vector{Int}\n\nGet the vector of cluster sizes.\n\ncounts(R)[k] is the number of points assigned to the k-th cluster.\n\n\n\n\n\n","category":"method"},{"location":"algorithms.html#Clustering.wcounts-Tuple{ClusteringResult}","page":"Basics","title":"Clustering.wcounts","text":"wcounts(R::ClusteringResult) -> Vector{Float64}\nwcounts(R::FuzzyCMeansResult) -> Vector{Float64}\n\nGet the weighted cluster sizes as the sum of weights of points assigned to each cluster.\n\nFor non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).\n\n\n\n\n\n","category":"method"},{"location":"algorithms.html#Clustering.assignments-Tuple{ClusteringResult}","page":"Basics","title":"Clustering.assignments","text":"assignments(R::ClusteringResult) -> Vector{Int}\n\nGet the vector of cluster indices for each point.\n\nassignments(R)[i] is the index of the cluster to which the i-th point is assigned.\n\n\n\n\n\n","category":"method"},{"location":"kmedoids.html#K-medoids","page":"K-medoids","title":"K-medoids","text":"","category":"section"},{"location":"kmedoids.html","page":"K-medoids","title":"K-medoids","text":"K-medoids is a clustering algorithm that works by finding k data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.","category":"page"},{"location":"kmedoids.html","page":"K-medoids","title":"K-medoids","text":"kmedoids\nkmedoids!\nKmedoidsResult","category":"page"},{"location":"kmedoids.html#Clustering.kmedoids","page":"K-medoids","title":"Clustering.kmedoids","text":"kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult\n\nPerform K-medoids clustering of n points into k clusters, given the dist matrix (nn, dist[i, j] is the distance between the j-th and i-th points).\n\nArguments\n\ninit (defaults to :kmpp): how medoids should be initialized, could be one of the following:\na Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).\nan integer vector of length k that provides the indices of points to use as initial medoids.\nmaxiter, tol, display: see common options\n\nNote\n\nThe function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).\n\n\n\n\n\n","category":"function"},{"location":"kmedoids.html#Clustering.kmedoids!","page":"K-medoids","title":"Clustering.kmedoids!","text":"kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};\n [kwargs...]) -> KmedoidsResult\n\nUpdate the current cluster medoids using the dist matrix.\n\nThe medoids field of the returned KmedoidsResult points to the same array as medoids argument.\n\nSee kmedoids for the description of optional kwargs.\n\n\n\n\n\n","category":"function"},{"location":"kmedoids.html#Clustering.KmedoidsResult","page":"K-medoids","title":"Clustering.KmedoidsResult","text":"KmedoidsResult{T} <: ClusteringResult\n\nThe output of kmedoids function.\n\nFields\n\nmedoids::Vector{Int}: the indices of k medoids\nassignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the i-th point\ncosts::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning i-th point to its medoid\ncounts::Vector{Int}: cluster sizes\ntotalcost::Float64: total assignment cost (the sum of costs)\niterations::Int: the number of executed algorithm iterations\nconverged::Bool: whether the procedure converged\n\n\n\n\n\n","category":"type"},{"location":"kmedoids.html#kmedoid_refs","page":"K-medoids","title":"References","text":"","category":"section"},{"location":"kmedoids.html","page":"K-medoids","title":"K-medoids","text":"Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955\nSchubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16","category":"page"},{"location":"affprop.html#Affinity-Propagation","page":"Affinity Propagation","title":"Affinity Propagation","text":"","category":"section"},{"location":"affprop.html","page":"Affinity Propagation","title":"Affinity Propagation","text":"Affinity propagation is a clustering algorithm based on message passing between data points. Similar to K-medoids, it looks at the (dis)similarities in the data, picks one exemplar data point for each cluster, and assigns every point in the data set to the cluster with the closest exemplar.","category":"page"},{"location":"affprop.html","page":"Affinity Propagation","title":"Affinity Propagation","text":"affinityprop\nAffinityPropResult","category":"page"},{"location":"affprop.html#Clustering.affinityprop","page":"Affinity Propagation","title":"Clustering.affinityprop","text":"affinityprop(S::AbstractMatrix; [maxiter=200], [tol=1e-6], [damp=0.5],\n [display=:none]) -> AffinityPropResult\n\nPerform affinity propagation clustering based on a similarity matrix S.\n\nS_ij (i j) is the similarity (or the negated distance) between the i-th and j-th points, S_ii defines the availability of the i-th point as an exemplar.\n\nArguments\n\ndamp::Real: the dampening coefficient, 0 mathrmdamp 1. Larger values indicate slower (and probably more stable) update. mathrmdamp = 0 disables dampening.\nmaxiter, tol, display: see common options\n\nReferences\n\nBrendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.\n\n\n\n\n\n","category":"function"},{"location":"affprop.html#Clustering.AffinityPropResult","page":"Affinity Propagation","title":"Clustering.AffinityPropResult","text":"AffinityPropResult <: ClusteringResult\n\nThe output of affinity propagation clustering (affinityprop).\n\nFields\n\nexemplars::Vector{Int}: indices of exemplars (cluster centers)\nassignments::Vector{Int}: cluster assignments for each data point\niterations::Int: number of iterations executed\nconverged::Bool: converged or not\n\n\n\n\n\n","category":"type"},{"location":"validate.html#clu_validate","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Clustering.jl package provides a number of methods to evaluate the results of a clustering algorithm and/or to validate its correctness.","category":"page"},{"location":"validate.html#Cross-tabulation","page":"Evaluation & Validation","title":"Cross tabulation","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Clustering.jl extends StatsBase.counts() with methods that accept ClusteringResult arguments:","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"counts(a::ClusteringResult, b::ClusteringResult)","category":"page"},{"location":"validate.html#StatsBase.counts-Tuple{ClusteringResult, ClusteringResult}","page":"Evaluation & Validation","title":"StatsBase.counts","text":"counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}\ncounts(a::ClusteringResult, b::AbstractVector{<:Integer}) -> Matrix{Int}\ncounts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}\n\nCalculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.\n\nReturns the n_a n_b matrix C, where n_a and n_b are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.\n\nThe clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.\n\n\n\n\n\n","category":"method"},{"location":"validate.html#Rand-index","page":"Evaluation & Validation","title":"Rand index","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"randindex","category":"page"},{"location":"validate.html#Clustering.randindex","page":"Evaluation & Validation","title":"Clustering.randindex","text":"randindex(a, b) -> NTuple{4, Float64}\n\nCompute the tuple of Rand-related indices between the clusterings c1 and c2.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nReturns a tuple of indices:\n\nHubert & Arabie Adjusted Rand index\nRand index (agreement probability)\nMirkin's index (disagreement probability)\nHubert's index (P(mathrmagree) - P(mathrmdisagree))\n\nReferences\n\nLawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218\n\nMeila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.\n\nSteinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Silhouettes","page":"Evaluation & Validation","title":"Silhouettes","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"The Silhouette value for the i-th data point is:","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"s_i = fracb_i - a_imax(a_i b_i) textwhere","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"a_i is the average distance from the i-th point to the other points in the same cluster z_i,\nb_i min_k ne z_i b_ik, where b_ik is the average distance from the i-th point to the points in the k-th cluster.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Note that s_i le 1, and that s_i is close to 1 when the i-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"silhouettes","category":"page"},{"location":"validate.html#Clustering.silhouettes","page":"Evaluation & Validation","title":"Clustering.silhouettes","text":"silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}\nsilhouettes(assignments::Union{AbstractVector, ClusteringResult}, points::Matrix;\n metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}\n\nCompute silhouette values for individual points w.r.t. given clustering.\n\nReturns the n-length vector of silhouette values for each individual point.\n\nArguments\n\nassignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)\npoints::AbstractMatrix: if metric is nothing it is an nn matrix of pairwise distances between the points, otherwise it is an dn matrix of d dimensional clustered data points.\nmetric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.\nbatch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.\n\nReferences\n\nPeter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Variation-of-Information","page":"Evaluation & Validation","title":"Variation of Information","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"varinfo","category":"page"},{"location":"validate.html#Clustering.varinfo","page":"Evaluation & Validation","title":"Clustering.varinfo","text":"varinfo(a, b) -> Float64\n\nCompute the variation of information between the two clusterings of the same data points.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nReferences\n\nMeila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.\n\n\n\n\n\n","category":"function"},{"location":"validate.html#V-measure","page":"Evaluation & Validation","title":"V-measure","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity (h) and completeness (c) of the clustering:","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"V_beta = (1+beta)frach cdot cbeta cdot h + c","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Both h and c can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity (h) is maximized when each cluster contains elements of as few different classes as possible. Completeness (c) aims to put all elements of each class in single clusters. The beta parameter (beta 0) could used to control the weights of h and c in the final measure. If beta 1, completeness has more weight, and when beta 1 it's homogeneity.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"vmeasure","category":"page"},{"location":"validate.html#Clustering.vmeasure","page":"Evaluation & Validation","title":"Clustering.vmeasure","text":"vmeasure(a, b; [β = 1.0]) -> Float64\n\nV-measure between the two clusterings.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nThe β parameter defines trade-off between homogeneity and completeness:\n\nif β 1, completeness is weighted more strongly,\nif β 1, homogeneity is weighted more strongly.\n\nReferences\n\nAndrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Mutual-information","page":"Evaluation & Validation","title":"Mutual information","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Mutual information quantifies the \"amount of information\" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"mutualinfo","category":"page"},{"location":"validate.html#Clustering.mutualinfo","page":"Evaluation & Validation","title":"Clustering.mutualinfo","text":"mutualinfo(a, b; normed=true) -> Float64\n\nCompute the mutual information between the two clusterings of the same data points.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nIf normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see \"Data Mining Practical Machine Tools and Techniques\", Witten & Frank 2005.\n\nReferences\n\nVinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.\n\nProceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Confusion-matrix","page":"Evaluation & Validation","title":"Confusion matrix","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Pair confusion matrix arising from two clusterings is a 2×2 contingency table representation of the partition co-occurrence, see counts.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"confusion","category":"page"},{"location":"validate.html#Clustering.confusion","page":"Evaluation & Validation","title":"Clustering.confusion","text":"confusion([T = Int],\n a::Union{ClusteringResult, AbstractVector},\n b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}\n\nCalculate the confusion matrix of the two clusterings.\n\nReturns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.\n\nConsidering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:\n\n Positive Negative\nPositive C₁₁ C₁₂\nNegative C₂₁ C₂₂\n\n\n\n\n\n","category":"function"},{"location":"mcl.html#MCL-(Markov-Cluster-Algorithm)","page":"MCL (Markov Cluster Algorithm)","title":"MCL (Markov Cluster Algorithm)","text":"","category":"section"},{"location":"mcl.html","page":"MCL (Markov Cluster Algorithm)","title":"MCL (Markov Cluster Algorithm)","text":"Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).","category":"page"},{"location":"mcl.html","page":"MCL (Markov Cluster Algorithm)","title":"MCL (Markov Cluster Algorithm)","text":"mcl\nMCLResult","category":"page"},{"location":"mcl.html#Clustering.mcl","page":"MCL (Markov Cluster Algorithm)","title":"Clustering.mcl","text":"mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult\n\nPerform MCL (Markov Cluster Algorithm) clustering using nn adjacency (points similarity) matrix adj.\n\nArguments\n\nKeyword arguments to control the MCL algorithm:\n\nadd_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph\nexpansion::Number (defaults to 2): MCL expansion constant\ninflation::Number (defaults to 2): MCL inflation constant\nsave_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge\nprune_tol::Number: pruning threshold\ndisplay, maxiter, tol: see common options\n\nReferences\n\nStijn van Dongen, \"Graph clustering by flow simulation\", 2001\n\nOriginal MCL implementation.\n\n\n\n\n\n","category":"function"},{"location":"mcl.html#Clustering.MCLResult","page":"MCL (Markov Cluster Algorithm)","title":"Clustering.MCLResult","text":"MCLResult <: ClusteringResult\n\nThe output of mcl function.\n\nFields\n\nmcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled\nassignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the i-th point (0 if unassigned)\ncounts::Vector{Int}: the k-length vector of cluster sizes\nnunassigned::Int: the number of standalone points not assigned to any cluster\niterations::Int: the number of elapsed iterations\nrel_Δ::Float64: the final relative Δ\nconverged::Bool: whether the method converged\n\n\n\n\n\n","category":"type"},{"location":"index.html#Clustering.jl-package","page":"Introduction","title":"Clustering.jl package","text":"","category":"section"},{"location":"index.html","page":"Introduction","title":"Introduction","text":"Clustering.jl is a Julia package for data clustering. It covers the two aspects of data clustering:","category":"page"},{"location":"index.html","page":"Introduction","title":"Introduction","text":"Clustering Algorithms, e.g. K-means, K-medoids, Affinity propagation, and DBSCAN, etc.\nClustering Evaluation, e.g. Silhouettes and variational information.","category":"page"},{"location":"fuzzycmeans.html#fuzzy_cmeans_def","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"","category":"section"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"Fuzzy C-means is a clustering method that provides cluster membership weights instead of \"hard\" classification (e.g. K-means).","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"From a mathematical standpoint, fuzzy C-means solves the following optimization problem:","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"argmin_mathcalC sum_i=1^n sum_j=1^C w_ij^mu mathbfx_i - mathbfc_j ^2 \ntextwhere w_ij = left(sum_k=1^C left(fracleftmathbfx_i - mathbfc_j rightleftmathbfx_i - mathbfc_k rightright)^frac2mu-1right)^-1","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"Here, mathbfc_j is the center of the j-th cluster, w_ij is the membership weight of the i-th point in the j-th cluster, and mu 1 is a user-defined fuzziness parameter.","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"fuzzy_cmeans\nFuzzyCMeansResult\nwcounts","category":"page"},{"location":"fuzzycmeans.html#Clustering.fuzzy_cmeans","page":"Fuzzy C-means","title":"Clustering.fuzzy_cmeans","text":"fuzzy_cmeans(data::AbstractMatrix, C::Integer, fuzziness::Real;\n [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult\n\nPerform Fuzzy C-means clustering over the given data.\n\nArguments\n\ndata::AbstractMatrix: dn data matrix. Each column represents one d-dimensional data point.\nC::Integer: the number of fuzzy clusters, 2 C n.\nfuzziness::Real: clusters fuzziness (μ in the mathematical formulation), μ 1.\n\nOptional keyword arguments:\n\ndist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points\nmaxiter, tol, display, rng: see common options\n\n\n\n\n\n","category":"function"},{"location":"fuzzycmeans.html#Clustering.FuzzyCMeansResult","page":"Fuzzy C-means","title":"Clustering.FuzzyCMeansResult","text":"FuzzyCMeansResult{T<:AbstractFloat}\n\nThe output of fuzzy_cmeans function.\n\nFields\n\ncenters::Matrix{T}: the dC matrix with columns being the centers of resulting fuzzy clusters\nweights::Matrix{Float64}: the nC matrix of assignment weights (mathrmweights_ij is the weight (probability) of assigning i-th point to the j-th cluster)\niterations::Int: the number of executed algorithm iterations\nconverged::Bool: whether the procedure converged\n\n\n\n\n\n","category":"type"},{"location":"fuzzycmeans.html#Clustering.wcounts","page":"Fuzzy C-means","title":"Clustering.wcounts","text":"wcounts(R::ClusteringResult) -> Vector{Float64}\nwcounts(R::FuzzyCMeansResult) -> Vector{Float64}\n\nGet the weighted cluster sizes as the sum of weights of points assigned to each cluster.\n\nFor non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).\n\n\n\n\n\n","category":"function"},{"location":"fuzzycmeans.html#Examples","page":"Fuzzy C-means","title":"Examples","text":"","category":"section"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"using Clustering\n\n# make a random dataset with 1000 points\n# each point is a 5-dimensional vector\nX = rand(5, 1000)\n\n# performs Fuzzy C-means over X, trying to group them into 3 clusters\n# with a fuzziness factor of 2. Set maximum number of iterations to 200\n# set display to :iter, so it shows progressive info at each iteration\nR = fuzzy_cmeans(X, 3, 2, maxiter=200, display=:iter)\n\n# get the centers (i.e. weighted mean vectors)\n# M is a 5x3 matrix\n# M[:, k] is the center of the k-th cluster\nM = R.centers\n\n# get the point memberships over all the clusters\n# memberships is a 20x3 matrix\nmemberships = R.weights","category":"page"}] +[{"location":"hclust.html#Hierarchical-Clustering","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"","category":"section"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"hclust\nHclust","category":"page"},{"location":"hclust.html#Clustering.hclust","page":"Hierarchical Clustering","title":"Clustering.hclust","text":"hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust\n\nPerform hierarchical clustering using the distance matrix d and the cluster linkage function.\n\nReturns the dendrogram as a Hclust object.\n\nArguments\n\nd::AbstractMatrix: the pairwise distance matrix. d_ij is the distance between i-th and j-th points.\nlinkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:\n:single (the default): use the minimum distance between any of the cluster members\n:average: use the mean distance between any of the cluster members\n:complete: use the maximum distance between any of the members\n:ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters\n:ward_presquared: same as :ward, but assumes that the distances in d are already squared.\nuplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.\nbranchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:\n:r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)\n:barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the \"fast optimal leaf ordering\" algorithm from Bar-Joseph et. al. Bioinformatics (2001)\n\n\n\n\n\n","category":"function"},{"location":"hclust.html#Clustering.Hclust","page":"Hierarchical Clustering","title":"Clustering.Hclust","text":"Hclust{T<:Real}\n\nThe output of hclust, hierarchical clustering of data points.\n\nProvides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.\n\nThis type mostly follows R's hclust class.\n\nFields\n\nmerges::Matrix{Int}: N2 matrix encoding subtree merges:\neach row specifies the left and right subtrees (referenced by their ids) that are merged\nnegative subtree id denotes the leaf node and corresponds to the data point at position -id\npositive id denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)\nlinkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)\nheights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage\norder::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.\n\nSee also: hclust.\n\n\n\n\n\n","category":"type"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"Single-linkage clustering using distance matrix:","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"using Clustering\nD = rand(1000, 1000);\nD += D'; # symmetric distance matrix (optional)\nresult = hclust(D, linkage=:single)","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.","category":"page"},{"location":"hclust.html","page":"Hierarchical Clustering","title":"Hierarchical Clustering","text":"cutree","category":"page"},{"location":"hclust.html#Clustering.cutree","page":"Hierarchical Clustering","title":"Clustering.cutree","text":"cutree(hclu::Hclust; [k], [h]) -> Vector{Int}\n\nCut the hclu dendrogram to produce clusters at the specified level of granularity.\n\nReturns the cluster assignments vector z (z_i is the index of the cluster for the i-th data point).\n\nArguments\n\nk::Integer (optional) the number of desired clusters.\nh::Real (optional) the height at which the tree is cut.\n\nIf both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.\n\nSee also: hclust\n\n\n\n\n\n","category":"function"},{"location":"init.html#clu_algo_init","page":"Initialization","title":"Initialization","text":"","category":"section"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"A clustering algorithm usually requires initialization before it could be started.","category":"page"},{"location":"init.html#Seeding","page":"Initialization","title":"Seeding","text":"","category":"section"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"SeedingAlgorithm\ninitseeds!\ninitseeds_by_costs!","category":"page"},{"location":"init.html#Clustering.SeedingAlgorithm","page":"Initialization","title":"Clustering.SeedingAlgorithm","text":"SeedingAlgorithm\n\nBase type for all seeding algorithms.\n\nEach seeding algorithm should implement the two functions: initseeds! and initseeds_by_costs!.\n\n\n\n\n\n","category":"type"},{"location":"init.html#Clustering.initseeds!","page":"Initialization","title":"Clustering.initseeds!","text":"initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,\n X::AbstractMatrix) -> iseeds\n\nInitialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.\n\n\n\n\n\n","category":"function"},{"location":"init.html#Clustering.initseeds_by_costs!","page":"Initialization","title":"Clustering.initseeds_by_costs!","text":"initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,\n costs::AbstractMatrix) -> iseeds\n\nInitialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.\n\nHere, costs[i, j] is the cost of assigning points i and j to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.\n\n\n\n\n\n","category":"function"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"There are several seeding methods described in the literature. Clustering.jl implements three popular ones:","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"KmppAlg\nKmCentralityAlg\nRandSeedAlg","category":"page"},{"location":"init.html#Clustering.KmppAlg","page":"Initialization","title":"Clustering.KmppAlg","text":"KmppAlg <: SeedingAlgorithm\n\nKmeans++ seeding (:kmpp).\n\nChooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.\n\nReferences\n\nD. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.\n\n\n\n\n\n","category":"type"},{"location":"init.html#Clustering.KmCentralityAlg","page":"Initialization","title":"Clustering.KmCentralityAlg","text":"KmCentralityAlg <: SeedingAlgorithm\n\nK-medoids initialization based on centrality (:kmcen).\n\nChoose the k points with the highest centrality as seeds.\n\nReferences\n\nHae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039\n\n\n\n\n\n","category":"type"},{"location":"init.html#Clustering.RandSeedAlg","page":"Initialization","title":"Clustering.RandSeedAlg","text":"RandSeedAlg <: SeedingAlgorithm\n\nRandom seeding (:rand).\n\nChooses an arbitrary subset of k data points as cluster seeds.\n\n\n\n\n\n","category":"type"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"In practice, we have found that Kmeans++ is the most effective choice.","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:","category":"page"},{"location":"init.html","page":"Initialization","title":"Initialization","text":"initseeds\ninitseeds_by_costs","category":"page"},{"location":"init.html#Clustering.initseeds","page":"Initialization","title":"Clustering.initseeds","text":"initseeds(alg::Union{SeedingAlgorithm, Symbol},\n X::AbstractMatrix, k::Integer) -> Vector{Int}\n\nSelect k seeds from a dn data matrix X using the alg algorithm.\n\nalg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.\n\nReturns the vector of k seed indices.\n\n\n\n\n\n","category":"function"},{"location":"init.html#Clustering.initseeds_by_costs","page":"Initialization","title":"Clustering.initseeds_by_costs","text":"initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},\n costs::AbstractMatrix, k::Integer) -> Vector{Int}\n\nSelect k seeds from the nn costs matrix using algorithm alg.\n\nHere, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.\n\nReturns the vector of k seed indices.\n\n\n\n\n\n","category":"function"},{"location":"dbscan.html#DBSCAN","page":"DBSCAN","title":"DBSCAN","text":"","category":"section"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.","category":"page"},{"location":"dbscan.html#Density-Reachability","page":"DBSCAN","title":"Density Reachability","text":"","category":"section"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"DBSCAN's definition of a cluster is based on the concept of density reachability: a point q is said to be directly density reachable by another point p if the distance between them is below a specified threshold epsilon and p is surrounded by sufficiently many points. Then, q is considered to be density reachable by p if there exists a sequence p_1 p_2 ldots p_n such that p_1 = p and p_i+1 is directly density reachable from p_i.","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"The points within DBSCAN clusters are categorized into core (or seeds) and boundary:","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"All points of the cluster core are mutually density-connected, meaning that for any two distinct points p and q in a core, there exists a point o such that both p and q are density reachable from o.\nIf a point is density-connected to any point of a cluster core, it is also part of the core.\nAll points within the epsilon-neighborhood of any core point, but not belonging to that core (i.e. not density reachable from the core), are considered cluster boundary.","category":"page"},{"location":"dbscan.html#Interface","page":"DBSCAN","title":"Interface","text":"","category":"section"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"The implementation of DBSCAN algorithm provided by dbscan function supports the two ways of specifying clustering data:","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"The d times n matrix of point coordinates. This is the preferred method as it uses memory- and time-efficient neighboring points queries via NearestNeighbors.jl package.\nThe ntimes n matrix of precalculated pairwise point distances. It requires O(n^2) memory and time to run.","category":"page"},{"location":"dbscan.html","page":"DBSCAN","title":"DBSCAN","text":"dbscan\nDbscanResult\nDbscanCluster","category":"page"},{"location":"dbscan.html#Clustering.dbscan","page":"DBSCAN","title":"Clustering.dbscan","text":"dbscan(points::AbstractMatrix, radius::Real;\n [metric=Euclidean()],\n [min_neighbors=1], [min_cluster_size=1],\n [nntree_kwargs...]) -> DbscanResult\n\nCluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.\n\nArguments\n\npoints: when metric is specified, the d×n matrix, where each column is a d-dimensional coordinate of a point; when metric=nothing, the n×n matrix of pairwise distances between the points\nradius::Real: neighborhood radius; points within this distance are considered neighbors\n\nOptional keyword arguments to control the algorithm:\n\nmetric (defaults to Euclidean()): the points distance metric to use, nothing means points is the n×n precalculated distance matrix\nmin_neighbors::Integer (defaults to 1): the minimal number of neighbors required to assign a point to a cluster \"core\"\nmin_cluster_size::Integer (defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded\nnntree_kwargs...: parameters (like leafsize) for the KDTree constructor\n\nExample\n\npoints = randn(3, 10000)\n# DBSCAN clustering, clusters with less than 20 points will be discarded:\nclustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)\n\nReferences:\n\nMartin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, \"A density-based algorithm for discovering clusters in large spatial databases with noise\", KDD-1996, pp. 226–231.\nErich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, \"DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN\", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335\n\n\n\n\n\n","category":"function"},{"location":"dbscan.html#Clustering.DbscanResult","page":"DBSCAN","title":"Clustering.DbscanResult","text":"DbscanResult <: ClusteringResult\n\nThe output of dbscan function.\n\nFields\n\nclusters::Vector{DbscanCluster}: clusters, length K\nseeds::Vector{Int}: indices of the first points of each cluster's core, length K\ncounts::Vector{Int}: cluster sizes (number of assigned points), length K\nassignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N\n\n\n\n\n\n","category":"type"},{"location":"dbscan.html#Clustering.DbscanCluster","page":"DBSCAN","title":"Clustering.DbscanCluster","text":"DbscanCluster\n\nDBSCAN cluster, part of DbscanResult returned by dbscan function.\n\nFields\n\nsize::Int: number of points in a cluster (core + boundary)\ncore_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)\nboundary_indices::Vector{Int}: indices of the cluster points outside of core\n\n\n\n\n\n","category":"type"},{"location":"kmeans.html#K-means","page":"K-means","title":"K-means","text":"","category":"section"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"textminimize sum_i=1^n mathbfx_i - boldsymbolmu_z_i ^2 textwrt (boldsymbolmu z)","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"Here, boldsymbolmu_k is the center of the k-th cluster, and z_i is an index of the cluster for i-th point mathbfx_i.","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"kmeans\nKmeansResult","category":"page"},{"location":"kmeans.html#Clustering.kmeans","page":"K-means","title":"Clustering.kmeans","text":"kmeans(X, k, [...]) -> KmeansResult\n\nK-means clustering of the dn data matrix X (each column of X is a d-dimensional data point) into k clusters.\n\nArguments\n\ninit (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:\na Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);\nan instance of SeedingAlgorithm;\nan integer vector of length k that provides the indices of points to use as initial seeds.\nweights: n-element vector of point weights (the cluster centers are the weighted means of cluster members)\nmaxiter, tol, display: see common options\n\n\n\n\n\n","category":"function"},{"location":"kmeans.html#Clustering.KmeansResult","page":"K-means","title":"Clustering.KmeansResult","text":"KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult\n\nThe output of kmeans and kmeans!.\n\nType parameters\n\nC<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix\nD<:Real: type of the assignment cost\nWC<:Real: type of the cluster weight\n\n\n\n\n\n","category":"type"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"If you already have a set of initial center vectors, kmeans! could be used:","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"kmeans!","category":"page"},{"location":"kmeans.html#Clustering.kmeans!","page":"K-means","title":"Clustering.kmeans!","text":"kmeans!(X, centers; [kwargs...]) -> KmeansResult\n\nUpdate the current cluster centers (dk matrix, where d is the dimension and k the number of centroids) using the dn data matrix X (each column of X is a d-dimensional data point).\n\nSee kmeans for the description of optional kwargs.\n\n\n\n\n\n","category":"function"},{"location":"kmeans.html#Examples","page":"K-means","title":"Examples","text":"","category":"section"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"using Clustering\n\n# make a random dataset with 1000 random 5-dimensional points\nX = rand(5, 1000)\n\n# cluster X into 20 clusters using K-means\nR = kmeans(X, 20; maxiter=200, display=:iter)\n\n@assert nclusters(R) == 20 # verify the number of clusters\n\na = assignments(R) # get the assignments of points to clusters\nc = counts(R) # get the cluster sizes\nM = R.centers # get the cluster centers","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"Scatter plot of the K-means clustering results:","category":"page"},{"location":"kmeans.html","page":"K-means","title":"K-means","text":"using RDatasets, Clustering, Plots\niris = dataset(\"datasets\", \"iris\"); # load the data\n\nfeatures = collect(Matrix(iris[:, 1:4])'); # features to use for clustering\nresult = kmeans(features, 3); # run K-means for the 3 clusters\n\n# plot with the point color mapped to the assigned cluster index\nscatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,\n color=:lightrainbow, legend=false)","category":"page"},{"location":"algorithms.html#clu_algo_basics","page":"Basics","title":"Basics","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"The package implements a variety of clustering algorithms:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Pages = [\"kmeans.md\", \"kmedoids.md\", \"hclust.md\", \"mcl.md\",\n \"affprop.md\", \"dbscan.md\", \"fuzzycmeans.md\"]","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.","category":"page"},{"location":"algorithms.html#Inputs","page":"Basics","title":"Inputs","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Data matrix X of size d times n, the i-th column of X (X[:, i]) is a data point (data sample) in d-dimensional space.\nDistance matrix D of size n times n, where D_ij is the distance between the i-th and j-th points, or the cost of assigning them to the same cluster.","category":"page"},{"location":"algorithms.html#common_options","page":"Basics","title":"Common Options","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"maxiter::Integer: maximum number of iterations.\ntol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.\ndisplay::Symbol: the level of information to be displayed. It may take one of the following values:\n:none: nothing is shown\n:final: only shows a brief summary when the algorithm ends\n:iter: shows the progress at each iteration","category":"page"},{"location":"algorithms.html#Results","page":"Basics","title":"Results","text":"","category":"section"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"ClusteringResult","category":"page"},{"location":"algorithms.html#Clustering.ClusteringResult","page":"Basics","title":"Clustering.ClusteringResult","text":"ClusteringResult\n\nBase type for the output of clustering algorithm.\n\n\n\n\n\n","category":"type"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"The following generic methods are supported by any subtype of ClusteringResult:","category":"page"},{"location":"algorithms.html","page":"Basics","title":"Basics","text":"nclusters(::ClusteringResult)\ncounts(::ClusteringResult)\nwcounts(::ClusteringResult)\nassignments(::ClusteringResult)","category":"page"},{"location":"algorithms.html#Clustering.nclusters-Tuple{ClusteringResult}","page":"Basics","title":"Clustering.nclusters","text":"nclusters(R::ClusteringResult) -> Int\n\nGet the number of clusters.\n\n\n\n\n\n","category":"method"},{"location":"algorithms.html#StatsBase.counts-Tuple{ClusteringResult}","page":"Basics","title":"StatsBase.counts","text":"counts(R::ClusteringResult) -> Vector{Int}\n\nGet the vector of cluster sizes.\n\ncounts(R)[k] is the number of points assigned to the k-th cluster.\n\n\n\n\n\n","category":"method"},{"location":"algorithms.html#Clustering.wcounts-Tuple{ClusteringResult}","page":"Basics","title":"Clustering.wcounts","text":"wcounts(R::ClusteringResult) -> Vector{Float64}\nwcounts(R::FuzzyCMeansResult) -> Vector{Float64}\n\nGet the weighted cluster sizes as the sum of weights of points assigned to each cluster.\n\nFor non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).\n\n\n\n\n\n","category":"method"},{"location":"algorithms.html#Clustering.assignments-Tuple{ClusteringResult}","page":"Basics","title":"Clustering.assignments","text":"assignments(R::ClusteringResult) -> Vector{Int}\n\nGet the vector of cluster indices for each point.\n\nassignments(R)[i] is the index of the cluster to which the i-th point is assigned.\n\n\n\n\n\n","category":"method"},{"location":"kmedoids.html#K-medoids","page":"K-medoids","title":"K-medoids","text":"","category":"section"},{"location":"kmedoids.html","page":"K-medoids","title":"K-medoids","text":"K-medoids is a clustering algorithm that works by finding k data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.","category":"page"},{"location":"kmedoids.html","page":"K-medoids","title":"K-medoids","text":"kmedoids\nkmedoids!\nKmedoidsResult","category":"page"},{"location":"kmedoids.html#Clustering.kmedoids","page":"K-medoids","title":"Clustering.kmedoids","text":"kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult\n\nPerform K-medoids clustering of n points into k clusters, given the dist matrix (nn, dist[i, j] is the distance between the j-th and i-th points).\n\nArguments\n\ninit (defaults to :kmpp): how medoids should be initialized, could be one of the following:\na Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).\nan integer vector of length k that provides the indices of points to use as initial medoids.\nmaxiter, tol, display: see common options\n\nNote\n\nThe function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).\n\n\n\n\n\n","category":"function"},{"location":"kmedoids.html#Clustering.kmedoids!","page":"K-medoids","title":"Clustering.kmedoids!","text":"kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};\n [kwargs...]) -> KmedoidsResult\n\nUpdate the current cluster medoids using the dist matrix.\n\nThe medoids field of the returned KmedoidsResult points to the same array as medoids argument.\n\nSee kmedoids for the description of optional kwargs.\n\n\n\n\n\n","category":"function"},{"location":"kmedoids.html#Clustering.KmedoidsResult","page":"K-medoids","title":"Clustering.KmedoidsResult","text":"KmedoidsResult{T} <: ClusteringResult\n\nThe output of kmedoids function.\n\nFields\n\nmedoids::Vector{Int}: the indices of k medoids\nassignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the i-th point\ncosts::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning i-th point to its medoid\ncounts::Vector{Int}: cluster sizes\ntotalcost::Float64: total assignment cost (the sum of costs)\niterations::Int: the number of executed algorithm iterations\nconverged::Bool: whether the procedure converged\n\n\n\n\n\n","category":"type"},{"location":"kmedoids.html#kmedoid_refs","page":"K-medoids","title":"References","text":"","category":"section"},{"location":"kmedoids.html","page":"K-medoids","title":"K-medoids","text":"Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955\nSchubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16","category":"page"},{"location":"affprop.html#Affinity-Propagation","page":"Affinity Propagation","title":"Affinity Propagation","text":"","category":"section"},{"location":"affprop.html","page":"Affinity Propagation","title":"Affinity Propagation","text":"Affinity propagation is a clustering algorithm based on message passing between data points. Similar to K-medoids, it looks at the (dis)similarities in the data, picks one exemplar data point for each cluster, and assigns every point in the data set to the cluster with the closest exemplar.","category":"page"},{"location":"affprop.html","page":"Affinity Propagation","title":"Affinity Propagation","text":"affinityprop\nAffinityPropResult","category":"page"},{"location":"affprop.html#Clustering.affinityprop","page":"Affinity Propagation","title":"Clustering.affinityprop","text":"affinityprop(S::AbstractMatrix; [maxiter=200], [tol=1e-6], [damp=0.5],\n [display=:none]) -> AffinityPropResult\n\nPerform affinity propagation clustering based on a similarity matrix S.\n\nS_ij (i j) is the similarity (or the negated distance) between the i-th and j-th points, S_ii defines the availability of the i-th point as an exemplar.\n\nArguments\n\ndamp::Real: the dampening coefficient, 0 mathrmdamp 1. Larger values indicate slower (and probably more stable) update. mathrmdamp = 0 disables dampening.\nmaxiter, tol, display: see common options\n\nReferences\n\nBrendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.\n\n\n\n\n\n","category":"function"},{"location":"affprop.html#Clustering.AffinityPropResult","page":"Affinity Propagation","title":"Clustering.AffinityPropResult","text":"AffinityPropResult <: ClusteringResult\n\nThe output of affinity propagation clustering (affinityprop).\n\nFields\n\nexemplars::Vector{Int}: indices of exemplars (cluster centers)\nassignments::Vector{Int}: cluster assignments for each data point\niterations::Int: number of iterations executed\nconverged::Bool: converged or not\n\n\n\n\n\n","category":"type"},{"location":"validate.html#clu_validate","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Clustering.jl package provides a number of methods to evaluate the results of a clustering algorithm and/or to validate its correctness.","category":"page"},{"location":"validate.html#Cross-tabulation","page":"Evaluation & Validation","title":"Cross tabulation","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Clustering.jl extends StatsBase.counts() with methods that accept ClusteringResult arguments:","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"counts(a::ClusteringResult, b::ClusteringResult)","category":"page"},{"location":"validate.html#StatsBase.counts-Tuple{ClusteringResult, ClusteringResult}","page":"Evaluation & Validation","title":"StatsBase.counts","text":"counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}\ncounts(a::ClusteringResult, b::AbstractVector{<:Integer}) -> Matrix{Int}\ncounts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}\n\nCalculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.\n\nReturns the n_a n_b matrix C, where n_a and n_b are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.\n\nThe clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.\n\n\n\n\n\n","category":"method"},{"location":"validate.html#Rand-index","page":"Evaluation & Validation","title":"Rand index","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"randindex","category":"page"},{"location":"validate.html#Clustering.randindex","page":"Evaluation & Validation","title":"Clustering.randindex","text":"randindex(a, b) -> NTuple{4, Float64}\n\nCompute the tuple of Rand-related indices between the clusterings c1 and c2.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nReturns a tuple of indices:\n\nHubert & Arabie Adjusted Rand index\nRand index (agreement probability)\nMirkin's index (disagreement probability)\nHubert's index (P(mathrmagree) - P(mathrmdisagree))\n\nReferences\n\nLawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218\n\nMeila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.\n\nSteinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Silhouettes","page":"Evaluation & Validation","title":"Silhouettes","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"The Silhouette value for the i-th data point is:","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"s_i = fracb_i - a_imax(a_i b_i) textwhere","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"a_i is the average distance from the i-th point to the other points in the same cluster z_i,\nb_i min_k ne z_i b_ik, where b_ik is the average distance from the i-th point to the points in the k-th cluster.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Note that s_i le 1, and that s_i is close to 1 when the i-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"silhouettes","category":"page"},{"location":"validate.html#Clustering.silhouettes","page":"Evaluation & Validation","title":"Clustering.silhouettes","text":"silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}\nsilhouettes(assignments::Union{AbstractVector, ClusteringResult}, points::Matrix;\n metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}\n\nCompute silhouette values for individual points w.r.t. given clustering.\n\nReturns the n-length vector of silhouette values for each individual point.\n\nArguments\n\nassignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)\npoints::AbstractMatrix: if metric is nothing it is an nn matrix of pairwise distances between the points, otherwise it is an dn matrix of d dimensional clustered data points.\nmetric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.\nbatch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.\n\nReferences\n\nPeter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Variation-of-Information","page":"Evaluation & Validation","title":"Variation of Information","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"varinfo","category":"page"},{"location":"validate.html#Clustering.varinfo","page":"Evaluation & Validation","title":"Clustering.varinfo","text":"varinfo(a, b) -> Float64\n\nCompute the variation of information between the two clusterings of the same data points.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nReferences\n\nMeila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.\n\n\n\n\n\n","category":"function"},{"location":"validate.html#V-measure","page":"Evaluation & Validation","title":"V-measure","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity (h) and completeness (c) of the clustering:","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"V_beta = (1+beta)frach cdot cbeta cdot h + c","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Both h and c can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity (h) is maximized when each cluster contains elements of as few different classes as possible. Completeness (c) aims to put all elements of each class in single clusters. The beta parameter (beta 0) could used to control the weights of h and c in the final measure. If beta 1, completeness has more weight, and when beta 1 it's homogeneity.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"vmeasure","category":"page"},{"location":"validate.html#Clustering.vmeasure","page":"Evaluation & Validation","title":"Clustering.vmeasure","text":"vmeasure(a, b; [β = 1.0]) -> Float64\n\nV-measure between the two clusterings.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nThe β parameter defines trade-off between homogeneity and completeness:\n\nif β 1, completeness is weighted more strongly,\nif β 1, homogeneity is weighted more strongly.\n\nReferences\n\nAndrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Mutual-information","page":"Evaluation & Validation","title":"Mutual information","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Mutual information quantifies the \"amount of information\" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"mutualinfo","category":"page"},{"location":"validate.html#Clustering.mutualinfo","page":"Evaluation & Validation","title":"Clustering.mutualinfo","text":"mutualinfo(a, b; normed=true) -> Float64\n\nCompute the mutual information between the two clusterings of the same data points.\n\na and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).\n\nIf normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see \"Data Mining Practical Machine Tools and Techniques\", Witten & Frank 2005.\n\nReferences\n\nVinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.\n\nProceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Confusion-matrix","page":"Evaluation & Validation","title":"Confusion matrix","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"Pair confusion matrix arising from two clusterings is a 2×2 contingency table representation of the partition co-occurrence, see counts.","category":"page"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"confusion","category":"page"},{"location":"validate.html#Clustering.confusion","page":"Evaluation & Validation","title":"Clustering.confusion","text":"confusion([T = Int],\n a::Union{ClusteringResult, AbstractVector},\n b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}\n\nCalculate the confusion matrix of the two clusterings.\n\nReturns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.\n\nConsidering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:\n\n Positive Negative\nPositive C₁₁ C₁₂\nNegative C₂₁ C₂₂\n\n\n\n\n\n","category":"function"},{"location":"validate.html#Other-packages","page":"Evaluation & Validation","title":"Other packages","text":"","category":"section"},{"location":"validate.html","page":"Evaluation & Validation","title":"Evaluation & Validation","text":"ClusteringBenchmarks.jl provides benchmark datasets and implements additional methods for evaluating clustering performance.","category":"page"},{"location":"mcl.html#MCL-(Markov-Cluster-Algorithm)","page":"MCL (Markov Cluster Algorithm)","title":"MCL (Markov Cluster Algorithm)","text":"","category":"section"},{"location":"mcl.html","page":"MCL (Markov Cluster Algorithm)","title":"MCL (Markov Cluster Algorithm)","text":"Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).","category":"page"},{"location":"mcl.html","page":"MCL (Markov Cluster Algorithm)","title":"MCL (Markov Cluster Algorithm)","text":"mcl\nMCLResult","category":"page"},{"location":"mcl.html#Clustering.mcl","page":"MCL (Markov Cluster Algorithm)","title":"Clustering.mcl","text":"mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult\n\nPerform MCL (Markov Cluster Algorithm) clustering using nn adjacency (points similarity) matrix adj.\n\nArguments\n\nKeyword arguments to control the MCL algorithm:\n\nadd_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph\nexpansion::Number (defaults to 2): MCL expansion constant\ninflation::Number (defaults to 2): MCL inflation constant\nsave_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge\nprune_tol::Number: pruning threshold\ndisplay, maxiter, tol: see common options\n\nReferences\n\nStijn van Dongen, \"Graph clustering by flow simulation\", 2001\n\nOriginal MCL implementation.\n\n\n\n\n\n","category":"function"},{"location":"mcl.html#Clustering.MCLResult","page":"MCL (Markov Cluster Algorithm)","title":"Clustering.MCLResult","text":"MCLResult <: ClusteringResult\n\nThe output of mcl function.\n\nFields\n\nmcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled\nassignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the i-th point (0 if unassigned)\ncounts::Vector{Int}: the k-length vector of cluster sizes\nnunassigned::Int: the number of standalone points not assigned to any cluster\niterations::Int: the number of elapsed iterations\nrel_Δ::Float64: the final relative Δ\nconverged::Bool: whether the method converged\n\n\n\n\n\n","category":"type"},{"location":"index.html#Clustering.jl-package","page":"Introduction","title":"Clustering.jl package","text":"","category":"section"},{"location":"index.html","page":"Introduction","title":"Introduction","text":"Clustering.jl is a Julia package for data clustering. It covers the two aspects of data clustering:","category":"page"},{"location":"index.html","page":"Introduction","title":"Introduction","text":"Clustering Algorithms, e.g. K-means, K-medoids, Affinity propagation, and DBSCAN, etc.\nClustering Evaluation, e.g. Silhouettes and variational information.","category":"page"},{"location":"fuzzycmeans.html#fuzzy_cmeans_def","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"","category":"section"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"Fuzzy C-means is a clustering method that provides cluster membership weights instead of \"hard\" classification (e.g. K-means).","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"From a mathematical standpoint, fuzzy C-means solves the following optimization problem:","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"argmin_mathcalC sum_i=1^n sum_j=1^C w_ij^mu mathbfx_i - mathbfc_j ^2 \ntextwhere w_ij = left(sum_k=1^C left(fracleftmathbfx_i - mathbfc_j rightleftmathbfx_i - mathbfc_k rightright)^frac2mu-1right)^-1","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"Here, mathbfc_j is the center of the j-th cluster, w_ij is the membership weight of the i-th point in the j-th cluster, and mu 1 is a user-defined fuzziness parameter.","category":"page"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"fuzzy_cmeans\nFuzzyCMeansResult\nwcounts","category":"page"},{"location":"fuzzycmeans.html#Clustering.fuzzy_cmeans","page":"Fuzzy C-means","title":"Clustering.fuzzy_cmeans","text":"fuzzy_cmeans(data::AbstractMatrix, C::Integer, fuzziness::Real;\n [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult\n\nPerform Fuzzy C-means clustering over the given data.\n\nArguments\n\ndata::AbstractMatrix: dn data matrix. Each column represents one d-dimensional data point.\nC::Integer: the number of fuzzy clusters, 2 C n.\nfuzziness::Real: clusters fuzziness (μ in the mathematical formulation), μ 1.\n\nOptional keyword arguments:\n\ndist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points\nmaxiter, tol, display, rng: see common options\n\n\n\n\n\n","category":"function"},{"location":"fuzzycmeans.html#Clustering.FuzzyCMeansResult","page":"Fuzzy C-means","title":"Clustering.FuzzyCMeansResult","text":"FuzzyCMeansResult{T<:AbstractFloat}\n\nThe output of fuzzy_cmeans function.\n\nFields\n\ncenters::Matrix{T}: the dC matrix with columns being the centers of resulting fuzzy clusters\nweights::Matrix{Float64}: the nC matrix of assignment weights (mathrmweights_ij is the weight (probability) of assigning i-th point to the j-th cluster)\niterations::Int: the number of executed algorithm iterations\nconverged::Bool: whether the procedure converged\n\n\n\n\n\n","category":"type"},{"location":"fuzzycmeans.html#Clustering.wcounts","page":"Fuzzy C-means","title":"Clustering.wcounts","text":"wcounts(R::ClusteringResult) -> Vector{Float64}\nwcounts(R::FuzzyCMeansResult) -> Vector{Float64}\n\nGet the weighted cluster sizes as the sum of weights of points assigned to each cluster.\n\nFor non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).\n\n\n\n\n\n","category":"function"},{"location":"fuzzycmeans.html#Examples","page":"Fuzzy C-means","title":"Examples","text":"","category":"section"},{"location":"fuzzycmeans.html","page":"Fuzzy C-means","title":"Fuzzy C-means","text":"using Clustering\n\n# make a random dataset with 1000 points\n# each point is a 5-dimensional vector\nX = rand(5, 1000)\n\n# performs Fuzzy C-means over X, trying to group them into 3 clusters\n# with a fuzziness factor of 2. Set maximum number of iterations to 200\n# set display to :iter, so it shows progressive info at each iteration\nR = fuzzy_cmeans(X, 3, 2, maxiter=200, display=:iter)\n\n# get the centers (i.e. weighted mean vectors)\n# M is a 5x3 matrix\n# M[:, k] is the center of the k-th cluster\nM = R.centers\n\n# get the point memberships over all the clusters\n# memberships is a 20x3 matrix\nmemberships = R.weights","category":"page"}] } diff --git a/dev/validate.html b/dev/validate.html index bd35fcf8..410fb651 100644 --- a/dev/validate.html +++ b/dev/validate.html @@ -1,8 +1,8 @@ -Evaluation & Validation · Clustering.jl

Evaluation & Validation

Clustering.jl package provides a number of methods to evaluate the results of a clustering algorithm and/or to validate its correctness.

Cross tabulation

Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.

Clustering.jl extends StatsBase.counts() with methods that accept ClusteringResult arguments:

StatsBase.countsMethod
counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}
+Evaluation & Validation · Clustering.jl

Evaluation & Validation

Clustering.jl package provides a number of methods to evaluate the results of a clustering algorithm and/or to validate its correctness.

Cross tabulation

Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.

Clustering.jl extends StatsBase.counts() with methods that accept ClusteringResult arguments:

StatsBase.countsMethod
counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}
 counts(a::ClusteringResult, b::AbstractVector{<:Integer}) -> Matrix{Int}
-counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}

Calculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix C, where $n_a$ and $n_b$ are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.

The clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.

source

Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

Clustering.randindexFunction
randindex(a, b) -> NTuple{4, Float64}

Compute the tuple of Rand-related indices between the clusterings c1 and c2.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

Returns a tuple of indices:

  • Hubert & Arabie Adjusted Rand index
  • Rand index (agreement probability)
  • Mirkin's index (disagreement probability)
  • Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

References

Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.

Steinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396

source

Silhouettes

Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.

The Silhouette value for the $i$-th data point is:

\[s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}\]

  • $a_i$ is the average distance from the $i$-th point to the other points in the same cluster $z_i$,
  • $b_i ≝ \min_{k \ne z_i} b_{ik}$, where $b_{ik}$ is the average distance from the $i$-th point to the points in the $k$-th cluster.

Note that $s_i \le 1$, and that $s_i$ is close to $1$ when the $i$-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.

Clustering.silhouettesFunction
silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}
+counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}

Calculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix C, where $n_a$ and $n_b$ are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.

The clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.

source

Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

Clustering.randindexFunction
randindex(a, b) -> NTuple{4, Float64}

Compute the tuple of Rand-related indices between the clusterings c1 and c2.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

Returns a tuple of indices:

  • Hubert & Arabie Adjusted Rand index
  • Rand index (agreement probability)
  • Mirkin's index (disagreement probability)
  • Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

References

Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.

Steinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396

source

Silhouettes

Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.

The Silhouette value for the $i$-th data point is:

\[s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}\]

  • $a_i$ is the average distance from the $i$-th point to the other points in the same cluster $z_i$,
  • $b_i ≝ \min_{k \ne z_i} b_{ik}$, where $b_{ik}$ is the average distance from the $i$-th point to the points in the $k$-th cluster.

Note that $s_i \le 1$, and that $s_i$ is close to $1$ when the $i$-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.

Clustering.silhouettesFunction
silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}
 silhouettes(assignments::Union{AbstractVector, ClusteringResult}, points::Matrix;
-            metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}

Compute silhouette values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

Arguments

  • assignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)
  • points::AbstractMatrix: if metric is nothing it is an $n×n$ matrix of pairwise distances between the points, otherwise it is an $d×n$ matrix of d dimensional clustered data points.
  • metric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.
  • batch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.

References

Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

source

Variation of Information

Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

Clustering.varinfoFunction
varinfo(a, b) -> Float64

Compute the variation of information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

References

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.

source

V-measure

V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

\[V_{\beta} = (1+\beta)\frac{h \cdot c}{\beta \cdot h + c}.\]

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, completeness has more weight, and when $\beta < 1$ it's homogeneity.

Clustering.vmeasureFunction
vmeasure(a, b; [β = 1.0]) -> Float64

V-measure between the two clusterings.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

The β parameter defines trade-off between homogeneity and completeness:

  • if $β > 1$, completeness is weighted more strongly,
  • if $β < 1$, homogeneity is weighted more strongly.

References

Andrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure

source

Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfoFunction
mutualinfo(a, b; normed=true) -> Float64

Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.

Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.

source

Confusion matrix

Pair confusion matrix arising from two clusterings is a 2×2 contingency table representation of the partition co-occurrence, see counts.

Clustering.confusionFunction
confusion([T = Int],
+            metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}

Compute silhouette values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

Arguments

  • assignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)
  • points::AbstractMatrix: if metric is nothing it is an $n×n$ matrix of pairwise distances between the points, otherwise it is an $d×n$ matrix of d dimensional clustered data points.
  • metric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.
  • batch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.

References

Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

source

Variation of Information

Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

Clustering.varinfoFunction
varinfo(a, b) -> Float64

Compute the variation of information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

References

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.

source

V-measure

V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

\[V_{\beta} = (1+\beta)\frac{h \cdot c}{\beta \cdot h + c}.\]

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, completeness has more weight, and when $\beta < 1$ it's homogeneity.

Clustering.vmeasureFunction
vmeasure(a, b; [β = 1.0]) -> Float64

V-measure between the two clusterings.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

The β parameter defines trade-off between homogeneity and completeness:

  • if $β > 1$, completeness is weighted more strongly,
  • if $β < 1$, homogeneity is weighted more strongly.

References

Andrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure

source

Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfoFunction
mutualinfo(a, b; normed=true) -> Float64

Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.

Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.

source

Confusion matrix

Pair confusion matrix arising from two clusterings is a 2×2 contingency table representation of the partition co-occurrence, see counts.

Clustering.confusionFunction
confusion([T = Int],
           a::Union{ClusteringResult, AbstractVector},
-          b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}

Calculate the confusion matrix of the two clusterings.

Returns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.

Considering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:

PositiveNegative
PositiveC₁₁C₁₂
NegativeC₂₁C₂₂
source
+ b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}

Calculate the confusion matrix of the two clusterings.

Returns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.

Considering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:

PositiveNegative
PositiveC₁₁C₁₂
NegativeC₂₁C₂₂
source

Other packages

  • ClusteringBenchmarks.jl provides benchmark datasets and implements additional methods for evaluating clustering performance.