Implement K-Nearest Neighbours #87

ablaom · 2019-02-20T23:05:48Z

The current built-in KNNRegressor is only a toy I wrote for testing purposes. Would be nice to replace with something more serious.

Possibilities:

(1) The package NearestNeighbors.jl provides a well-featured search engine for finding neighbours but does actually implement regressors/classifiers as far as I can tell. I've asked the author if he knows of an implementation.

(2) Expedient option: wrap the sklearn models

(3) Existing package I missed?

tlienart · 2019-02-21T01:49:12Z

there's also FLANN.jl a wrapper for FLANN. Possibly out of date.
NearestNeighbours.jl looks quite solid.

datnamer · 2019-02-21T02:02:49Z

There's also this: https://github.com/dillondaudert/NearestNeighborDescent.jl

(And as an aside, Umap based on it: (https://github.com/dillondaudert/UMAP.jl )

KristofferC · 2019-02-22T14:31:46Z

If it helps, the following packages uses NearestNeighbors.jl:

 CausalityTools
 Celeste
 ChaosTools
 Clustering
 ConcaveHull
 ConceptnetNumberbatch
 ConvexSwitch
 CrossMappings
 DelayEmbeddings
 DynamicalSystems
 DynamicalSystemsBase
 GeoStatsDevTools
 InverseDistanceWeighting
 Kriging
 LinearAlgebraicRepresentation
 LocalFunctionApproximation
 LocallyWeightedRegression
 MovingWeightedLeastSquares
 QuickShiftClustering
 RealNeuralNetworks
 ScatteredInterpolation
 StateSpaceReconstruction
 TimeseriesPrediction
 TransferEntropy

kirtsar · 2019-02-25T22:07:19Z

This is my toy implementation:

struct KNN_classifier{T, S}
    tree :: T
    label :: S
    k :: Int
end

function knn_clf(X, y, k = 3; metric = Euclidean(), leafsize = 10)
    rawX = Array(convert(Matrix, X)')
    kd = KDTree(rawX, metric; leafsize = leafsize)
    return KNN_classifier(kd, y, k)
end

function predict(clf, X_test)
    Xraw = Matrix(X_test)'
    m = size(X_test)[1]
    neighbors = knn(clf.tree, Xraw, clf.k)[1]
    y_pred = [mode(clf.label[ns]) for ns in neighbors]
    return y_pred
end

ablaom · 2019-02-26T19:35:49Z

So, it seems to me that while we have an excellent package on which to base our search for nearest neighbours - NearestNeighbors.jl - we do not have complete classifier/regressor algorithms implemented anywhere. But I don't think filling in the gaps is too hard, and we'll just do this ourselves.

I suggest we provide the same options as in sk-learn, which include an option for the search algorithm: kd, ball, brute or auto, although I don't know exactly how "auto" decides which to use. Here are the docs:

classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

regressor:

For the classifier and https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

In the regression case one can probably just edit my existing “toy” at MLJ/src/builtins/KNN. In the classifier case one additional needs to remember to ensure the predictions preserve the categorical pool of the target passed for fit, as explained in MLJ/doc/adding_new_models.md . As we are writing the algorithm and not wrapping one, this is straightforward, ie, we don't need any of the encoder/decoder business.

@kirtsar Are you happy for me to assign you to this issue? I'm happy to answer further questions on slack. Small note: your toy above is not in the MLJ mould as you have bundled hyperparameters and learned parameters into the one object. In MLJ the "model" only contains hyperparamters, while the learned parameters are part of the output of fit called the fitresult. For details refer to the aforementioned guide and look at the existing KNN code. Also, it is not enough to return a mode for classifiers, as the points get weighted according to inverse distance (or other custom function).

Technical notes:

(i) As we already have a KNN in MLJ/src/builtins/KNN.jl, let's put any new code in the same place. I may move it to MLJModels later after sorting out testing dependencies.

(ii) Perhaps we use Distances.jl as our source for distance functions, and to avoid extra type parameters we could skip the custom weight option?

kirtsar · 2019-02-26T20:01:18Z

@ablaom
Sorry that was my first attemp, i've read the docs and now this example looks like this :

module KNN

export KNNClassifier

import MLJBase

using LinearAlgebra

#> needed for all classifiers:
using CategoricalArrays

# to be extended:

KNNFitResultType{T} = Tuple{Matrix{T},Vector{T}}

# TODO: introduce type parameters for the function fields (metric, kernel)


mutable struct KNNClassifier{F} <: MLJBase.Deterministic{KNNFitResultType{F}}
    target_type::Type{F}
    K::Int                # number of local target values averaged
    metric :: Distances.Metric
    tree_type :: Symbol   # BallTree, KDTree or BruteTree
end


function KNNClassifier(;
                        target_type = Int,
                        K = 5,
                        metric = Euclidean(),
                        tree_type = :KDTree,
                        leafsize = 10
                        )
    model = KNNClassifier{target_type}(
        target_type,
        K,
        metric,
        tree_type,
        leafsize
    )
    message = MLJBase.clean!(model)       #> future proof by including these
    isempty(message) || @warn message #> two lines even if no clean! defined below

    return model
end


    
function MLJBase.clean!(model::KNNClassifier)
    message = ""
    if model.K <= 0
        model.K = 1
        message *= "K cannot be negative. K set to 1."
    end
    if  (model.tree == :KDTree) && 
        !(type(model.metric) in [Euclidean, Chebyshev, Minkowski, Cityblock])
        model.tree = :BallTree
        message *= "KDTree supports only these types of metric : "
        message *= "Euclidean, Chebyshev, Minkowski, Cityblock"
        message *= "tree was set to BallTree"
    end
    if leafsize <= 0
        model.leafsize = 10
        message *= "Leafsize cannot be less than 1. Leafsize was set to 10."
    end
    return message
end


function MLJBase.fit(model::KNNClassifier{F}
                     , verbosity::Int
                     , X
                     , y::Vector{F2}) where {F,F2}

    F == F2 || error("Model specifies target_type=$F but target type is $F2.")

    Xraw = MLJBase.matrix(X)
    # bad place : right now KDtree and others accept only 2d-matrix, no adjoint allowed
    Xmatrix = Matrix(Xraw')
    # poor choise because now the function is type-unstable
    # it can return any tree type
    if model.tree == :KDTree
        tree = KDTree(Xmatrix, model.metric; leafsize = model.leafsize)
    elseif model.tree == :BallTree
        tree = BallTree(Xmatrix, model.metric; leafsize = model.leafsize)
    else # BruteTree
        tree = BruteTree(Xmatrix, model.metric; leafsize = model.leafsize)
    end
    fitresult = (tree, y)
    cache = nothing
    report = nothing
    
    return fitresult, cache, report 
end


function MLJBase.predict(model::KNNClassifier, fitresult, Xnew)
   # to be done 
end

# metadata to be done

kirtsar · 2019-02-26T20:06:51Z

I believe there will be a lot of pain with categorical arrays. As you can see, "fit" now only accepts Vector{T}, which is not Categorical. I don't know how should I declare types, because for example for Iris dataset the type for y is this monster:

CategoricalArray{String,1,UInt8,String,CategoricalString{UInt8},Union{}}

and this is one of the easiest one, without any "missing".
So now i'm a little puzzled how we should treat categorical arrays.

kirtsar · 2019-02-26T20:11:01Z

By the way, some kernels may be found here:
http://trthatcher.github.io/MLKernels.jl/dev/
and here:
https://github.com/johnmyleswhite/SmoothingKernels.jl
the latter is really out of date, but implementation is OK i think

ablaom · 2019-02-26T20:55:57Z

@kirstar Code is looking great, thanks!

Re: type instability. Suggest you introduce tree_type as type parameter of model

Re: no adjoint for NearestNeighbors. Probably have to live with this, at least for now, because X is a generic table (could be row or column based). That is, we could supplement MLJBase.matrix method with an MLJ.adjoint method, but its not clear this will actually be an improvement for most tables.

Re: categorical typing: I don't understand why you want to annotate the type of y in fit. As I understand it, this makes no difference to compilation. FYI, y will always be a categorical vector because that is what the machine interface will pass in its call to fit whenever your model is classifier (detected by output_kind trait, soon to be renamed target_scitype). No need to do type-checking here as machine interface will do that (eventually :-)).

That said, categorical arrays do make it a pain do declare a concrete fitresult type (the FitresultType in code above needs to be changed). See, for example MLJModels/src/GaussianProcesses.jl for a worst case scenario. However, you can relax this to an abstract type (eg, Any) and the only loss of performance at present is in large ensembles of your model; see the guide and here. I'm not sure this pain can be avoided without performance loss but am open to suggestions [Edit: But see now #93 ]. BTW, at present the fitresult field of a machine struct is not typed, but I viewed typing this as "premature optimisation."

tlienart · 2019-02-27T03:37:22Z

Re: no adjoint for NearestNeighbors. Probably have to live with this, at least for now, because X is a generic table (could be row or column based). That is, we could supplement MLJBase.matrix method with an MLJ.adjoint method, but its not clear this will actually be an improvement for most tables.

This PR JuliaData/Tables.jl#66 will introduce a keyword in Tables.matrix to allow for transposition i.e. it takes any table object and materialize a 2d matrix either with row-based or column-based filling depending on the keyword (likely transpose=true), this should remove the issue we've had of having to transpose, copy transpose, permutedims etc internally.

ablaom · 2019-02-28T19:46:11Z

Further to our slack conversation, I suggest you dump the input eltype F in your code suggestion above. Just use Any for the fit-result type (see #93).

To keep your fit type-stable, you will need a TreeType parameter, so something like:

mutable struct KNNClassifier{TreeType} <: MLJBase.Deterministic{Any} 
    K::Int # number of local target values
    averaged metric::Distances.Metric 
    tree_type::Symbol # BallTree, KDTree or BruteTree 
end

Your keyword constructor can determine the value of TreeType from
tree_type. You write three fit methods, one for each tree,
type. So the fit for :KDTree will look something like:

function MLJBase.fit(model::KNNClassifier{KDTree},
                     , verbosity::Int
                     , X
                     , y)
   ...

   fitresult = (tree, y)
   return fitresult, nothing, nothing
end

(You can wrap any duplicate code shared by the three methods in
subroutines if you want, or just do one fit and check the compiler is
now smart enough to infer the correct return type.)

Note that there is no need to annotate the type of y here, but the
machine (or task) interface will ensure that this is a
CategoricalVector (because you are going to make the trait
declaration output_kind(KNNClassifier) = [:multiclass, :ordered_factor_infinite] or, after upcoming revisions, something
similar).

As to whether or not the elements of y are
CategricalString or CategoricalValue{In} or whatever, is
immaterial. Your predict method can return a categorical vector of
the same kind without knowing what this type is. So, something like
this:

function MLJBase.predict(model::KNNClassifier, fitresult, Xnew)
    tree, y = fitresult
	
    Xraw = MLJBase.matrix(Xnew)
    Xmatrix = Matrix(Xraw')  # do this better after JuliaData/Tables.jl#66 
	
    function predict_on_column(i) # `i` will be index of a column of `Xmatrix`
	    < code to get vector `indxs` of indices of K-nearest neighbors, using `tree` >
	    ys = y[indxs]
	    < code to compute "weighted mode" `ymode` of vector `ys` >
	    return ymode
    end
    # predictions as ordinary `Vector` (of CategoricalStrings or CategoricalValues):
    yhat = [predict_on_pattern(i) for i in 1:size(Xmatrix, 2)] 
 	
    # to return a categorical vector with the same pool as y:
    null = categorical(levels(y))[1:0]  # empty categorical vector with all levels of `y`
    return vcat(yhat, null) 

end

Essentially, the only extra work in handling the categorical target "properly" is the last two lines of code.

The regressor code will look much the same (without the last two lines).

Hope this helps.

tlienart · 2019-10-07T17:44:37Z

Now done

ablaom added the enhancement New feature or request label Feb 20, 2019

tlienart mentioned this issue Sep 3, 2019

More (julia) packages to interface with JuliaAI/MLJModels.jl#42

Open

13 tasks

tlienart closed this as completed Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement K-Nearest Neighbours #87

Implement K-Nearest Neighbours #87

ablaom commented Feb 20, 2019

tlienart commented Feb 21, 2019

datnamer commented Feb 21, 2019

KristofferC commented Feb 22, 2019

kirtsar commented Feb 25, 2019

ablaom commented Feb 26, 2019 •

edited

Loading

kirtsar commented Feb 26, 2019

kirtsar commented Feb 26, 2019

kirtsar commented Feb 26, 2019

ablaom commented Feb 26, 2019 •

edited

Loading

tlienart commented Feb 27, 2019

ablaom commented Feb 28, 2019

tlienart commented Oct 7, 2019

Implement K-Nearest Neighbours #87

Implement K-Nearest Neighbours #87

Comments

ablaom commented Feb 20, 2019

tlienart commented Feb 21, 2019

datnamer commented Feb 21, 2019

KristofferC commented Feb 22, 2019

kirtsar commented Feb 25, 2019

ablaom commented Feb 26, 2019 • edited Loading

classifier:

regressor:

kirtsar commented Feb 26, 2019

kirtsar commented Feb 26, 2019

kirtsar commented Feb 26, 2019

ablaom commented Feb 26, 2019 • edited Loading

tlienart commented Feb 27, 2019

ablaom commented Feb 28, 2019

tlienart commented Oct 7, 2019

ablaom commented Feb 26, 2019 •

edited

Loading

ablaom commented Feb 26, 2019 •

edited

Loading