-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement K-Nearest Neighbours #87
Comments
there's also FLANN.jl a wrapper for FLANN. Possibly out of date. |
There's also this: https://github.com/dillondaudert/NearestNeighborDescent.jl (And as an aside, Umap based on it: (https://github.com/dillondaudert/UMAP.jl ) |
If it helps, the following packages uses NearestNeighbors.jl:
|
This is my toy implementation:
|
So, it seems to me that while we have an excellent package on which to base our search for nearest neighbours - NearestNeighbors.jl - we do not have complete classifier/regressor algorithms implemented anywhere. But I don't think filling in the gaps is too hard, and we'll just do this ourselves. I suggest we provide the same options as in sk-learn, which include an option for the search algorithm: kd, ball, brute or auto, although I don't know exactly how "auto" decides which to use. Here are the docs: classifier:regressor:For the classifier and https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor In the regression case one can probably just edit my existing “toy” at MLJ/src/builtins/KNN. In the classifier case one additional needs to remember to ensure the predictions preserve the categorical pool of the target passed for fit, as explained in MLJ/doc/adding_new_models.md . As we are writing the algorithm and not wrapping one, this is straightforward, ie, we don't need any of the encoder/decoder business. @kirtsar Are you happy for me to assign you to this issue? I'm happy to answer further questions on slack. Small note: your toy above is not in the MLJ mould as you have bundled hyperparameters and learned parameters into the one object. In MLJ the "model" only contains hyperparamters, while the learned parameters are part of the output of Technical notes: (i) As we already have a KNN in MLJ/src/builtins/KNN.jl, let's put any new code in the same place. I may move it to MLJModels later after sorting out testing dependencies. (ii) Perhaps we use Distances.jl as our source for distance functions, and to avoid extra type parameters we could skip the custom weight option? |
@ablaom
|
I believe there will be a lot of pain with categorical arrays. As you can see, "fit" now only accepts Vector{T}, which is not Categorical. I don't know how should I declare types, because for example for Iris dataset the type for y is this monster:
and this is one of the easiest one, without any "missing". |
By the way, some kernels may be found here: |
@kirstar Code is looking great, thanks! Re: type instability. Suggest you introduce tree_type as type parameter of model Re: no adjoint for NearestNeighbors. Probably have to live with this, at least for now, because Re: categorical typing: I don't understand why you want to annotate the type of That said, categorical arrays do make it a pain do declare a concrete fitresult type (the FitresultType in code above needs to be changed). See, for example MLJModels/src/GaussianProcesses.jl for a worst case scenario. However, you can relax this to an abstract type (eg, |
This PR JuliaData/Tables.jl#66 will introduce a keyword in |
Further to our slack conversation, I suggest you dump the input eltype To keep your mutable struct KNNClassifier{TreeType} <: MLJBase.Deterministic{Any}
K::Int # number of local target values
averaged metric::Distances.Metric
tree_type::Symbol # BallTree, KDTree or BruteTree
end Your keyword constructor can determine the value of function MLJBase.fit(model::KNNClassifier{KDTree},
, verbosity::Int
, X
, y)
...
fitresult = (tree, y)
return fitresult, nothing, nothing
end (You can wrap any duplicate code shared by the three methods in Note that there is no need to annotate the type of As to whether or not the elements of function MLJBase.predict(model::KNNClassifier, fitresult, Xnew)
tree, y = fitresult
Xraw = MLJBase.matrix(Xnew)
Xmatrix = Matrix(Xraw') # do this better after JuliaData/Tables.jl#66
function predict_on_column(i) # `i` will be index of a column of `Xmatrix`
< code to get vector `indxs` of indices of K-nearest neighbors, using `tree` >
ys = y[indxs]
< code to compute "weighted mode" `ymode` of vector `ys` >
return ymode
end
# predictions as ordinary `Vector` (of CategoricalStrings or CategoricalValues):
yhat = [predict_on_pattern(i) for i in 1:size(Xmatrix, 2)]
# to return a categorical vector with the same pool as y:
null = categorical(levels(y))[1:0] # empty categorical vector with all levels of `y`
return vcat(yhat, null)
end Essentially, the only extra work in handling the categorical target "properly" is the last two lines of code. The regressor code will look much the same (without the last two lines). Hope this helps. |
Now done |
The current built-in KNNRegressor is only a toy I wrote for testing purposes. Would be nice to replace with something more serious.
Possibilities:
(1) The package NearestNeighbors.jl provides a well-featured search engine for finding neighbours but does actually implement regressors/classifiers as far as I can tell. I've asked the author if he knows of an implementation.
(2) Expedient option: wrap the sklearn models
(3) Existing package I missed?
The text was updated successfully, but these errors were encountered: