Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikit learn #4

Merged
merged 9 commits into from
Feb 26, 2019
Merged

Scikit learn #4

merged 9 commits into from
Feb 26, 2019

Conversation

ysimillides
Copy link
Collaborator

Build is working (alongside the correct Python path with the change in Travis). 6 new models are implemented, but their traits still need correct defining.

@ablaom
Copy link
Member

ablaom commented Feb 18, 2019

Great. I have added some tests, which fail when we replace your Xr and yr with real data (the Boston dataset). I guess you probably need to apply MLJBase.matrix to your data in fit and predict. Can you have a look please?

Otherwise, looking good. Still to do:

  • add trait functions ✔
  • wrap in a SVM submodule (the load_paths are then,
    "MLJModels.ScikitLearn_.SVM.SVMClassifier" and so forth
  • add minimal doc string (can just link to the sklearn documentation
  • add clean! methods to check string hyper parameters make sense (eg, "auto" not "atou")

@ysimillides
Copy link
Collaborator Author

I'll have a closer look at this tomorrow so I can see the types of each task to see it's dispatched correct. Just a quick question, we seem to be mixing the Iris and Boston dataset? maybe this is introducing the error?

@ablaom
Copy link
Member

ablaom commented Feb 19, 2019

Yiannis,

I've added more rigorous testing (tests on artificial linear data) and resolved your bugs. The main issues were:

  • failure to test regression fit methods on tabular data (you were testing on matrix data)
  • confusion about what the regressor/classifier distinction means

For the second, here is a quote from the guide:

The form of the target data y passed to fit depends on the kind of
supervised model. For "regressors", y is a vector, for "classifiers"
it must be a categorical vector. More precisely, the form is
determined by value returned by the trait output_kind that each model
type should define (see below):

output_kind return value type of y
:continuous Vector{<:AbstractFloat}
:binary CategoricalVector
:multiclass CategoricalVector
:ordered_factor_finite CategoricalVector
:ordered_factor_infinite Vector{<:Integer}

So y for a regressor is a Vector{<:AbstractFloat} (and there is none of that decoding jazz), while for a classifier y is a CategoricalVector and we need to worry about preservation of levels as described later in the guide.

Small note: there is no need to annotate the types of X, y and Xnew in your methods. You will always know what they are. And there is no point writing methods to deal with alternative form of data because they will be invisible to the higher level API.

You probably want to go over the changes I have made to src/ScikitLearn.jl.

See earlier comments for what is still left for you to do.

Thanks

@ablaom
Copy link
Member

ablaom commented Feb 20, 2019

I'm sorry, but I am rethinking the "wrap in submodule" plan. If need be, we might need to rename the src file ScikitLearnSVM.jl but leave it for now, thanks.

@ysimillides
Copy link
Collaborator Author

That's fine about the module. I'll fix the other things and add the DT in due course. I'll also go over the XGBoost code and make similar changes, so classifiers work on Categorical data and regressors on plain vectors their also.

@ablaom
Copy link
Member

ablaom commented Feb 20, 2019

Okay. I'm ready to merge as soon as the doc string and clean! methods are done.

@sk_import svm: NuSVR
@sk_import svm: LinearSVR

mutable struct SVMClassifier{Any} <: MLJBase.Deterministic{Any}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why {Any} ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only stict requirement of the parameter R in Deterministic{R} or Probabilistic{R} is that the fitresult returned by fit is alway of type R. (In tests, one should do @test fitresult isa fitresult_type(model) . So Any always works.

To get optimal performance in ensembling, you want R to be concrete (the "ensemble" is a Vector{R}), although it sometimes has to be a "small" union type (see, eg, DecisionTreeRegressor). If you like, you could take a look at ensembling and see if the type can be inferred. This code (refactored from Koala) is quite old. Finding out R is a bit of a pain point. To make R concrete in the tree case you have to introduce target_type as explicit type parameter for the model struct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for the explanation, it makes sense

@ablaom
Copy link
Member

ablaom commented Feb 24, 2019

Please separate move the additions to DecisionTree.jl to a separate PR - they have nothing to do with the current Scikitlearn PR.

@ablaom
Copy link
Member

ablaom commented Feb 25, 2019

Thanks.

@ablaom ablaom merged commit 02497bb into master Feb 26, 2019
@ablaom ablaom deleted the ScikitLearn branch June 18, 2019 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants