This repository contains naive implementations of the k-NN algorithm in several languages (for k = 1), and an extra one in Golang with some optimizations.
This was inspired by a chain of blog posts:
- the last one implemented it in Factor
- the one before that did it in Rust.
- that one was inspired by a blog post which had it in F# and OCaml, and a follow-up which improves the first OCaml implementation.
This repository adds the naive implementation in Golang and Haskell, an additional implementation in Golang with some optimizations, and thanks to this guy, a couple implementations in Julia and Octave.
Performance comparisons between the naive implementations in each language were performed on a freshly spun up c3.xlarge EC2 instance as follows:
- Install Golang, Haskell, Rust, F#, OCaml and Octave, either with
apt-get
or building from source./configure && make && make install # ...etc
. Install the latest nightly Julia build withdpkg
. Download Factor. - Write the (naive) code for Golang and Haskell. Copy-paste the code for Rust, F#, OCaml, Julia, Octave, and Factor.
- Compile executable binaries for Haskell, Rust, F#, and OCaml. Run the Factor code in the
scratchpad
REPL and the Octave code in the Octave REPL. Run the Golang code withgo run
and the Julia code withjulia
.
- Julia: 1.557s - 3.432s*
- Octave: 2.540s
- Golang: 4.701s
- Factor: 6.358s
- OCaml: 12.757s
- F#: 23.507s
- Rust: 78.138s
- Haskell: 91.581s
Julia has an asterisk because its initial run is slower than all subsequent runs. I'll have to rethink how to make this a fair experiment...
$ time go run golang-k-nn.go
0.944
real 0m4.701s
user 0m4.582s
sys 0m0.136s
$ time ./haskell-k-nn
Percentage correct: 472
real 1m31.581s
user 1m29.191s
sys 0m2.384s
$ time ./rust-k-nn
Percentage correct: 94.4%
real 1m18.138s
user 1m17.980s
sys 0m0.155s
$ time ./fsharp-k-nn.exe
start...
Percentage correct:94.400000
real 0m23.507s
user 0m22.751s
sys 0m0.798s
$ julia julia/julia-k-nn.jl
Percentage correct: 94.4%
elapsed time: 3.438748733 seconds (656566752 bytes allocated, 2.03% gc time)
Percentage correct: 94.4%
elapsed time: 1.560125781 seconds (566037680 bytes allocated, 6.06% gc time)
$ time ./ocaml-k-nn
Percentage correct:94.400000
real 0m12.757s
user 0m12.500s
sys 0m0.257s
$ cp octave/octave-k-nn.m octaveknn.m
$ octave
octave:1> octaveknn;
Percentage correct: 94.400000%
Elapsed time is 2.53968 seconds.
$ rm octaveknn.m
$ mkdir -p $FACTOR_HOME/work/k-nn
$ cp factor/factor-k-nn.factor $FACTOR_HOME/work/k-nn
$ cp *.csv $FACTOR_HOME/work/k-nn
$ $FACTOR_HOME/factor
IN: scratchpad USE: factor-k-nn
Loading resource:work/k-nn/factor-k-nn.factor
Loading resource:basis/formatting/formatting.factor
Loading resource:basis/formatting/formatting-docs.factor
IN: scratchpad gc [ k-nn ] time
Percentage correct: 94.400000
Running time: 6.357621145 seconds
For Golang, an additional implementation is given which is signficantly faster, but suffers no loss in accuracy. It involves two optimizations:
- Short-circuit distance calculations between a test case and a training case that are necessarily suboptimal. In other words, if you know the distance to one potential nearest neighbour is 100, and half-way through calculating the distance to another potential nearest neighbour you already have a distance-so-far of 105, stop calculating and move on to the next candidate for nearest neighbour.
- Use goroutines to parallelize the computations. The way this was done was not ideal, because the parallelism isn't in the classification algorithm itself, instead it parellelizes the classification of the members of the validation sample. However, it's is easy enough to "do it right", and what's currently there is good enough to see how significant the gains are when firing on all your cores.
$ time go run golang-k-nn-speedup.go
0.944
real 0m1.375s
user 0m3.314s
sys 0m0.117s
- Tell me if I should use special compiler flags to improve performance for some of the languages.
- Tell my why this experiment is invalid.
- Improve naive implementations without changing the spirit of the algorithm (e.g. use eager evaluation in Haskell).
- Add optimized implementations of k-NN which improve performance at no cost to accuracy.
- Add implementations for other languages (with compilation and/or run instructions).
- Make the Go stuff a useable package
- Explore Accuracy vs Time tradeoffs of only considering some of the training set or some of the pixels when classifying a test case
- Make it easy to experiment with a matrix of different options
- Do it in C
- Do it in Python with scikit-learn