Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement K-Nearest Neighbours #87

Closed
ablaom opened this issue Feb 20, 2019 · 12 comments
Closed

Implement K-Nearest Neighbours #87

ablaom opened this issue Feb 20, 2019 · 12 comments
Labels
enhancement New feature or request

Comments

@ablaom
Copy link
Member

ablaom commented Feb 20, 2019

The current built-in KNNRegressor is only a toy I wrote for testing purposes. Would be nice to replace with something more serious.

Possibilities:

(1) The package NearestNeighbors.jl provides a well-featured search engine for finding neighbours but does actually implement regressors/classifiers as far as I can tell. I've asked the author if he knows of an implementation.

(2) Expedient option: wrap the sklearn models

(3) Existing package I missed?

@ablaom ablaom added the enhancement New feature or request label Feb 20, 2019
@tlienart
Copy link
Collaborator

there's also FLANN.jl a wrapper for FLANN. Possibly out of date.
NearestNeighbours.jl looks quite solid.

@datnamer
Copy link

@KristofferC
Copy link
Contributor

If it helps, the following packages uses NearestNeighbors.jl:

 CausalityTools
 Celeste
 ChaosTools
 Clustering
 ConcaveHull
 ConceptnetNumberbatch
 ConvexSwitch
 CrossMappings
 DelayEmbeddings
 DynamicalSystems
 DynamicalSystemsBase
 GeoStatsDevTools
 InverseDistanceWeighting
 Kriging
 LinearAlgebraicRepresentation
 LocalFunctionApproximation
 LocallyWeightedRegression
 MovingWeightedLeastSquares
 QuickShiftClustering
 RealNeuralNetworks
 ScatteredInterpolation
 StateSpaceReconstruction
 TimeseriesPrediction
 TransferEntropy

@kirtsar
Copy link
Collaborator

kirtsar commented Feb 25, 2019

This is my toy implementation:

struct KNN_classifier{T, S}
    tree :: T
    label :: S
    k :: Int
end

function knn_clf(X, y, k = 3; metric = Euclidean(), leafsize = 10)
    rawX = Array(convert(Matrix, X)')
    kd = KDTree(rawX, metric; leafsize = leafsize)
    return KNN_classifier(kd, y, k)
end

function predict(clf, X_test)
    Xraw = Matrix(X_test)'
    m = size(X_test)[1]
    neighbors = knn(clf.tree, Xraw, clf.k)[1]
    y_pred = [mode(clf.label[ns]) for ns in neighbors]
    return y_pred
end

@ablaom
Copy link
Member Author

ablaom commented Feb 26, 2019

So, it seems to me that while we have an excellent package on which to base our search for nearest neighbours - NearestNeighbors.jl - we do not have complete classifier/regressor algorithms implemented anywhere. But I don't think filling in the gaps is too hard, and we'll just do this ourselves.

I suggest we provide the same options as in sk-learn, which include an option for the search algorithm: kd, ball, brute or auto, although I don't know exactly how "auto" decides which to use. Here are the docs:

classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

regressor:

For the classifier and https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

In the regression case one can probably just edit my existing “toy” at MLJ/src/builtins/KNN. In the classifier case one additional needs to remember to ensure the predictions preserve the categorical pool of the target passed for fit, as explained in MLJ/doc/adding_new_models.md . As we are writing the algorithm and not wrapping one, this is straightforward, ie, we don't need any of the encoder/decoder business.

@kirtsar Are you happy for me to assign you to this issue? I'm happy to answer further questions on slack. Small note: your toy above is not in the MLJ mould as you have bundled hyperparameters and learned parameters into the one object. In MLJ the "model" only contains hyperparamters, while the learned parameters are part of the output of fit called the fitresult. For details refer to the aforementioned guide and look at the existing KNN code. Also, it is not enough to return a mode for classifiers, as the points get weighted according to inverse distance (or other custom function).

Technical notes:

(i) As we already have a KNN in MLJ/src/builtins/KNN.jl, let's put any new code in the same place. I may move it to MLJModels later after sorting out testing dependencies.

(ii) Perhaps we use Distances.jl as our source for distance functions, and to avoid extra type parameters we could skip the custom weight option?

@kirtsar
Copy link
Collaborator

kirtsar commented Feb 26, 2019

@ablaom
Sorry that was my first attemp, i've read the docs and now this example looks like this :

module KNN

export KNNClassifier

import MLJBase

using LinearAlgebra

#> needed for all classifiers:
using CategoricalArrays

# to be extended:

KNNFitResultType{T} = Tuple{Matrix{T},Vector{T}}

# TODO: introduce type parameters for the function fields (metric, kernel)


mutable struct KNNClassifier{F} <: MLJBase.Deterministic{KNNFitResultType{F}}
    target_type::Type{F}
    K::Int                # number of local target values averaged
    metric :: Distances.Metric
    tree_type :: Symbol   # BallTree, KDTree or BruteTree
end


function KNNClassifier(;
                        target_type = Int,
                        K = 5,
                        metric = Euclidean(),
                        tree_type = :KDTree,
                        leafsize = 10
                        )
    model = KNNClassifier{target_type}(
        target_type,
        K,
        metric,
        tree_type,
        leafsize
    )
    message = MLJBase.clean!(model)       #> future proof by including these
    isempty(message) || @warn message #> two lines even if no clean! defined below

    return model
end


    
function MLJBase.clean!(model::KNNClassifier)
    message = ""
    if model.K <= 0
        model.K = 1
        message *= "K cannot be negative. K set to 1."
    end
    if  (model.tree == :KDTree) && 
        !(type(model.metric) in [Euclidean, Chebyshev, Minkowski, Cityblock])
        model.tree = :BallTree
        message *= "KDTree supports only these types of metric : "
        message *= "Euclidean, Chebyshev, Minkowski, Cityblock"
        message *= "tree was set to BallTree"
    end
    if leafsize <= 0
        model.leafsize = 10
        message *= "Leafsize cannot be less than 1. Leafsize was set to 10."
    end
    return message
end


function MLJBase.fit(model::KNNClassifier{F}
                     , verbosity::Int
                     , X
                     , y::Vector{F2}) where {F,F2}

    F == F2 || error("Model specifies target_type=$F but target type is $F2.")

    Xraw = MLJBase.matrix(X)
    # bad place : right now KDtree and others accept only 2d-matrix, no adjoint allowed
    Xmatrix = Matrix(Xraw')
    # poor choise because now the function is type-unstable
    # it can return any tree type
    if model.tree == :KDTree
        tree = KDTree(Xmatrix, model.metric; leafsize = model.leafsize)
    elseif model.tree == :BallTree
        tree = BallTree(Xmatrix, model.metric; leafsize = model.leafsize)
    else # BruteTree
        tree = BruteTree(Xmatrix, model.metric; leafsize = model.leafsize)
    end
    fitresult = (tree, y)
    cache = nothing
    report = nothing
    
    return fitresult, cache, report 
end


function MLJBase.predict(model::KNNClassifier, fitresult, Xnew)
   # to be done 
end

# metadata to be done 

@kirtsar
Copy link
Collaborator

kirtsar commented Feb 26, 2019

I believe there will be a lot of pain with categorical arrays. As you can see, "fit" now only accepts Vector{T}, which is not Categorical. I don't know how should I declare types, because for example for Iris dataset the type for y is this monster:

CategoricalArray{String,1,UInt8,String,CategoricalString{UInt8},Union{}}

and this is one of the easiest one, without any "missing".
So now i'm a little puzzled how we should treat categorical arrays.

@kirtsar
Copy link
Collaborator

kirtsar commented Feb 26, 2019

By the way, some kernels may be found here:
http://trthatcher.github.io/MLKernels.jl/dev/
and here:
https://github.com/johnmyleswhite/SmoothingKernels.jl
the latter is really out of date, but implementation is OK i think

@ablaom
Copy link
Member Author

ablaom commented Feb 26, 2019

@kirstar Code is looking great, thanks!

Re: type instability. Suggest you introduce tree_type as type parameter of model

Re: no adjoint for NearestNeighbors. Probably have to live with this, at least for now, because X is a generic table (could be row or column based). That is, we could supplement MLJBase.matrix method with an MLJ.adjoint method, but its not clear this will actually be an improvement for most tables.

Re: categorical typing: I don't understand why you want to annotate the type of y in fit. As I understand it, this makes no difference to compilation. FYI, y will always be a categorical vector because that is what the machine interface will pass in its call to fit whenever your model is classifier (detected by output_kind trait, soon to be renamed target_scitype). No need to do type-checking here as machine interface will do that (eventually :-)).

That said, categorical arrays do make it a pain do declare a concrete fitresult type (the FitresultType in code above needs to be changed). See, for example MLJModels/src/GaussianProcesses.jl for a worst case scenario. However, you can relax this to an abstract type (eg, Any) and the only loss of performance at present is in large ensembles of your model; see the guide and here. I'm not sure this pain can be avoided without performance loss but am open to suggestions [Edit: But see now #93 ]. BTW, at present the fitresult field of a machine struct is not typed, but I viewed typing this as "premature optimisation."

@tlienart
Copy link
Collaborator

Re: no adjoint for NearestNeighbors. Probably have to live with this, at least for now, because X is a generic table (could be row or column based). That is, we could supplement MLJBase.matrix method with an MLJ.adjoint method, but its not clear this will actually be an improvement for most tables.

This PR JuliaData/Tables.jl#66 will introduce a keyword in Tables.matrix to allow for transposition i.e. it takes any table object and materialize a 2d matrix either with row-based or column-based filling depending on the keyword (likely transpose=true), this should remove the issue we've had of having to transpose, copy transpose, permutedims etc internally.

@ablaom
Copy link
Member Author

ablaom commented Feb 28, 2019

Further to our slack conversation, I suggest you dump the input eltype F in your code suggestion above. Just use Any for the fit-result type (see #93).

To keep your fit type-stable, you will need a TreeType parameter, so something like:

mutable struct KNNClassifier{TreeType} <: MLJBase.Deterministic{Any} 
    K::Int # number of local target values
    averaged metric::Distances.Metric 
    tree_type::Symbol # BallTree, KDTree or BruteTree 
end 

Your keyword constructor can determine the value of TreeType from
tree_type. You write three fit methods, one for each tree,
type. So the fit for :KDTree will look something like:

function MLJBase.fit(model::KNNClassifier{KDTree},
                     , verbosity::Int
                     , X
                     , y)
   ...

   fitresult = (tree, y)
   return fitresult, nothing, nothing
end

(You can wrap any duplicate code shared by the three methods in
subroutines if you want, or just do one fit and check the compiler is
now smart enough to infer the correct return type.)

Note that there is no need to annotate the type of y here, but the
machine (or task) interface will ensure that this is a
CategoricalVector (because you are going to make the trait
declaration output_kind(KNNClassifier) = [:multiclass, :ordered_factor_infinite] or, after upcoming revisions, something
similar).

As to whether or not the elements of y are
CategricalString or CategoricalValue{In} or whatever, is
immaterial. Your predict method can return a categorical vector of
the same kind without knowing what this type is. So, something like
this:

function MLJBase.predict(model::KNNClassifier, fitresult, Xnew)
    tree, y = fitresult
	
    Xraw = MLJBase.matrix(Xnew)
    Xmatrix = Matrix(Xraw')  # do this better after JuliaData/Tables.jl#66 
	
    function predict_on_column(i) # `i` will be index of a column of `Xmatrix`
	    < code to get vector `indxs` of indices of K-nearest neighbors, using `tree` >
	    ys = y[indxs]
	    < code to compute "weighted mode" `ymode` of vector `ys` >
	    return ymode
    end
    # predictions as ordinary `Vector` (of CategoricalStrings or CategoricalValues):
    yhat = [predict_on_pattern(i) for i in 1:size(Xmatrix, 2)] 
 	
    # to return a categorical vector with the same pool as y:
    null = categorical(levels(y))[1:0]  # empty categorical vector with all levels of `y`
    return vcat(yhat, null) 

end

Essentially, the only extra work in handling the categorical target "properly" is the last two lines of code.

The regressor code will look much the same (without the last two lines).

Hope this helps.

@tlienart
Copy link
Collaborator

tlienart commented Oct 7, 2019

Now done

@tlienart tlienart closed this as completed Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants