Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the kmeans algorithm column-oriented instead of row-oriented? #79

Open
paulhendricks opened this issue Jan 3, 2017 · 20 comments
Labels

Comments

@paulhendricks
Copy link

In the docs (below), the kmeans algorithm takes a matrix where each column X[:, i] corresponds to an observed sample. This implementation goes against the idea of tidy data as well as differs from Python's scikit-learn implementation of kmeans and R's base implementation of kmeans.

Is there a good reason for this? Should this algorithm be changed from column-oriented to row-oriented so as to be consistent with R and Python as well as with the concept of tidy data?

URL: http://clusteringjl.readthedocs.io/en/stable/overview.html

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Sample matrix X, where each column X[:,i] corresponds to an observed sample.
  • Distance matrix D, where D[i,j] indicates the distance between samples i and j, or the cost of assigning one to the other.
@ararslan
Copy link
Member

ararslan commented Jan 3, 2017

There were discussions about this a while back (though I'm struggling to locate one now), and ultimately it came down to the original package author's personal preference. Apparently observations-as-columns is the norm in machine learning, at least according to some, and observations-as-rows is the norm in classical statistics. Personally I'm in the camp of observations being rows and have been bitten by this package's use of columns.

The problem with changing it at this point is that existing code will suddenly become very wrong, and it's not something we'd be able to catch and emit a deprecation warning for. We'd have to redesign the API and deprecate the existing API in favor of the new one.

@nalimilan
Copy link
Member

nalimilan commented Jan 3, 2017

Unless we can find another package where clustering works that way, I'd favor moving to rows as observations. A possible deprecation path would be to add a vardim argument, and require passing e.g. vardim=1 (with a default of 2) to get the current behavior without a warning. Then we would switch the default in the next release, and possibly deprecate the option later.

EDIT: see also JuliaStats/Distances.jl#35

@ararslan
Copy link
Member

ararslan commented Jan 3, 2017

@nalimilan That sounds perfect. Simple and largely non-disruptive. Nice idea! 💯

@paulhendricks
Copy link
Author

Sounds like a plan. I can take a first stab at implementing this and submit a pull request for your review.

@ararslan
Copy link
Member

ararslan commented Jan 4, 2017

That sounds great, @paulhendricks. Thanks!

@paulhendricks
Copy link
Author

Awesome. As far as I can tell, kmeans is the only clustering algorithm that needs changed to close this issue.

Below is a first pass at implementing @nalimilan 's suggestion:

const _kmeans_default_init = :kmpp
const _kmeans_default_maxiter = 100
const _kmeans_default_tol = 1.0e-6
const _kmeans_default_display = :none
const _vardim = 2

function kmeans!{T<:AbstractFloat}(X::Matrix{T}, centers::Matrix{T};
                                   weights=nothing,
                                   maxiter::Integer=_kmeans_default_maxiter,
                                   tol::Real=_kmeans_default_tol,
                                   display::Symbol=_kmeans_default_display,
                                   vardim::Integer=_vardim)

    # Here, we use vardim to assess if matrix X is
    # column-oriented (vardim == 1) or row-oriented (vardim == 2).
    # By default, X should be row-oriented. Use vardim=1
    # to recover the old behavior where X is column-oriented.
    #
    # See this issue for more details:
    # https://github.com/JuliaStats/Clustering.jl/issues/79
    info("vardim defaults to 2 in version x.x.x and will default to 1 in x.x.x.")
    info("vardim will then be deprecated in version x.x.x.")
    if vardim == 2:
      # If matrix X is row-oriented (vardim == 2), transpose
      # it to be column-oriented. When vardim is deprecated,
      # this transposition will be the first operation.
      X = transpose(X)

    # Once X is column-oriented, the rest of algorithm
    # can run as normal with no need for modification.
    m, n = size(X)
    m2, k = size(centers)
    m == m2 || throw(DimensionMismatch("Inconsistent array dimensions."))
    (2 <= k < n) || error("k must have 2 <= k < n.")

    assignments = zeros(Int, n)
    costs = zeros(T, n)
    counts = Array(Int, k)
    cweights = Array(Float64, k)

Does this look good 👍 ?

@nalimilan
Copy link
Member

Use Base.depwarn to print warnings; and no need to say in what version it will be removed (it's always in the next major version). Also, you need to check that vardim is equal either to 1 or to 2.

@kmsquire
Copy link
Contributor

Just as a point of reference, one of the original arguments for having observations as columns was that Julia arrays are in column-major order, so this allows data for individual observations to be stored contiguously.

See, e.g., http://julialang.org/blog/2013/09/fast-numeric, especially under "Write cache-friendly codes".

@nalimilan
Copy link
Member

Right, that's probably how Dahua chose the arrangement to use. Benchmarks are presented in Distances.jl README: https://github.com/JuliaStats/Distances.jl#benchmarks

Though I wonder whether it really makes a difference since the text says the computations are done by calling BLAS's gemm. Since the array is contiguous anyway, BLAS should be able to choose the best way to perform the computation efficiently. I can't believe SciKit would use a deliberately slow convention.

Somebody would have to run the benchmarks using the rowwise convention to check this.

@kmsquire
Copy link
Contributor

I can't believe SciKit would use a deliberately slow convention.

SciKit Learn has a great interface, but in my experience, it doesn't always have the fastest implementation.

More importantly, Python/numpy are row-major order by default, so matching that convention already keeps the Dara contiguous/cache friendly.

@kmsquire
Copy link
Contributor

(But running the benchmarks is definitely the way to go!)

Since Julia just obtained a Transpose type, this actually shouldn't be that costly.

@nalimilan
Copy link
Member

More importantly, Python/numpy are row-major order by default, so matching that convention already keeps the Dara contiguous/cache friendly.

I didn't know that. It's indeed a good explanation. OTOH, R is column-major, but it doesn't care so much about speed...

@nalimilan
Copy link
Member

A confirmation that R's dist function is slow due to storing observations in rows despite being column-major: https://stat.ethz.ch/pipermail/r-devel/2017-June/074492.html

@andreasnoack
Copy link
Member

Most relevant comments have already been made here but just wanted to point out that it seems that ML algorithms are often expressed and written by observation, i.e. for OLS that would be (∑x*x')\(∑x*y) instead of the matrix notation (X'X)\(X'y). The former can be convenient if data is just considered a stream. However, it is often wrongly argued that the former is also faster when each x is contiguous but because of (mainly) BLAS3 calls, the matrix formulation, if available, will typically be quite a bit faster than the "streaming"/"vector" formulation.

@nalimilan
Copy link
Member

nalimilan commented Feb 13, 2019

pairwise in Distances has just gained support for a dims keyword argument, plus a deprecation of the default method which assumes observations are stored as columns (JuliaStats/Distances.jl#121). kmeans should probably do the same thing.

@alyst
Copy link
Member

alyst commented Feb 13, 2019

It's fine to allow dims, especially if the implementation can choose the optimal algorithm depending on the data layout. But many methods are only optimized for the "observations as columns" case. Shouldn't we still keep it as the default dims= to guide users to the proper layout?
I'd also try to keep the signatures of all clustering methods in sync. So if we are deprecating the default dims= for kmeans(), we should also do it for the other methods.

I assume that the major motivation of the change is Tables.jl support, which has "observations as rows" layout. We definitely need to support Tables.jl, but maybe the default strategy should be to transpose the table before clustering?

@nalimilan
Copy link
Member

I assume that the major motivation of the change is Tables.jl support, which has "observations as rows" layout. We definitely need to support Tables.jl, but maybe the default strategy should be to transpose the table before clustering?

Tables.jl really isn't my motivation for this change, as it can easily gain an option to transpose the columns, which should be relatively efficient compared with the cost of computing pairwise distances and clustering (JuliaData/Tables.jl#66). The main reason is that I want to ensure consistency of functions that take a dimension argument across the ecosystem, and cov in Statistics assumes dims=1 (observations as rows) by default. This is compounded by the fact that most other implementations of distances/clustering use observations as rows.

Shouldn't we still keep it as the default dims= to guide users to the proper layout?

Given what I said above, I'd rather require users to choose explicitly between consistency with the rest of Julia and the world (dims=1) and performance (dims=2). Hopefully, clearly documenting what's the best solution for performance is OK. Ironically, the fact that the dims argument is so hard to get right without reading the docs ensures people will read that warning. ;-)

We should also provide a Tables.jl-based interface which will automatically transpose the data under the hood, so that the efficient approach is used and people don't have to think about it.

@nalimilan
Copy link
Member

JuliaStats/Distances.jl#123 implements the convenience Tables.jl interface.

I've tried measuring the performance difference between dims=2 and dims=1 on permutedims(a), and I couldn't observe any significant difference for Euclidean. Does anybody have an idea of a benchmark which could show a difference?

@mkborregaard
Copy link

mkborregaard commented Feb 3, 2020

As far as I can see, MultivariateStats is unfortunately also implemented columns-are-observations - see e.g. the PCA example

julia> using MultivariateStats, RDatasets
julia> iris = dataset("datasets", "iris");
julia> # split half to training set
       Xtr = convert(Matrix,iris[:,1:4])'
4×150 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 5.1  4.9  4.7  4.6  5.0  5.4  4.6  5.0    6.8  6.7  6.7  6.3  6.5  6.2  5.9
 3.5  3.0  3.2  3.1  3.6  3.9  3.4  3.4     3.2  3.3  3.0  2.5  3.0  3.4  3.0
 1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5     5.9  5.7  5.2  5.0  5.2  5.4  5.1
 0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2     2.3  2.5  2.3  1.9  2.0  2.3  1.8

julia> fit(PCA, Xtr)
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)

julia> transform(ans, Xtr)
3×150 Array{Float64,2}:
 2.68413     2.71414    2.88899      -1.76435    -1.90094   -1.39019
 0.319397   -0.177001  -0.144949       0.0788589   0.116628  -0.282661
 0.0279148   0.210464  -0.0179003     -0.130482   -0.723252  -0.36291

In my view, observations-as-columns are practically never found in the wild outside some parts of Julia's ecosystem, and it would seem quite necessary to converge on observations-as-rows throughout, in spite of the issues with deprecations.

@jebej
Copy link

jebej commented May 26, 2020

As an potential additional data point (against the observations-as-row view), observations as columns is nice when converting between a complex number vector and the array used for clustering. Since complex numbers are stored as (real,imag), converting to an observations-as-columns array is as easy as reinterpreting:

reshape(reinterpret(Float64,cvals::Vector{Complex{Float64}}),2,length(cvals))

No transpose is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants