-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New BetaML models (to be checked) #749
Comments
@sylvaticus Thanks for this extra work. I have not had a chance to review the interfaces but will already respond to your nice outline:
This is correct. I'm surmising that you found the relevant docs that exist, but FYI there is some here and here and existing implementations here and here
So far we have not implemented any "distribution-fitters" but the design was discussed and the decision made to implement them in different way, namely as supervised models (subtype There is a sample implementation linked from this section of the docs (recently tweaked). As this will be our first such model, and the API is marked experimental, there could be some teething issues, but I don't anticipate major issues. It will be great to finally have GMM!
I always imagined imputers should learn from the data to prevent data leakage. So, if imputing with mean value, then you learn the mean from the training data and use that for imputing the validation data. That is how the existing MLJ imputer is designed. This kind of imputer cannot be If however, the BetaML imputer is genuinely static, in the sense that it transforms data without ever seeing training data, then it should indeed be Hope this helps. When you're ready, let me know and I will try to find time to review the interfaces more carefully. |
I am trying to implement GMM as a
Let me comment that I think there are several way to evaluate the output of the cluster. In particular, if labels are actually available (but not used in the fitting) the way I implemented Concerning the MissingInputator, it uses GMM in the background, so missing values are computed as the expected values of the mixtures weighted by the probability of the specific record to be part of such mixture.
While I guess you are thinking to:
Right ?
Great, but please consider that the model is not computationally efficient (but for a few thousand of data it works fine). However it is very flexible, as it can use any mixture (gaussian ones being the only one implemented) and, contrary to scikit-learn gmm model, it accepts X matrices with missing data (hence I can use it for missing imputation ...). EDIT:
|
You need the latest version of MLJBase where I relaxed this check for just this case. Will look over your other comments Monday. |
Great, thanks... |
Well, almost. Rather:
I am not suggesting that
This is not compatible with the API. Your The workflow you are after, I'm guessing, is a "fit and transform in one hit" option, like sk-learn's mach = machine(MissingImputator(), Xtrain) |> fit!
Xtrain_full = transform(mach, Xtrain)
mach = machine(MissingImputator(), Xtest) |> fit!
Xtest_full = transform(mach, Xtest) Whereas, a non-leakage workflow would replace the second block with Xtest_full = transform(mach, Xtest) (Or write the whole thing as mach = machine(MissingImputator(), X)
fit!(mach, rows=train)
Xfull = transform(mach, X)
... ) Does that address your question? |
Yes, thanks I think I got it this time. I implemented fit(m::MissingImputator, verbosity, X) --> (fitResults, cache, report)
transform(m::MissingImputator, fitResults, X) --> table(X̂_full) I test them with: julia> import MLJBase
julia> const Mlj = MLJBase
MLJBase
julia> using BetaML
julia> X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4]
9×2 Matrix{Union{Missing, Float64}}:
1.0 10.5
1.5 missing
1.8 8.0
1.7 15.0
3.2 40.0
missing missing
3.3 38.0
missing -2.3
5.2 -2.4
julia> X = Mlj.table(X)
Tables.MatrixTable{Matrix{Union{Missing, Float64}}}: (x1 = Union{Missing, Float64}[1.0, 1.5, 1.8, 1.7, 3.2, missing, 3.3, missing, 5.2], x2 = Union{Missing, Float64}[10.5, missing, 8.0, 15.0, 40.0, missing, 38.0, -2.3, -2.4])
julia> model = MissingImputator()
MissingImputator(
K = 3,
p₀ = nothing,
mixtures = DiagonalGaussian{Float64}[DiagonalGaussian{Float64}(nothing, nothing), DiagonalGaussian{Float64}(nothing, nothing), DiagonalGaussian{Float64}(nothing, nothing)],
tol = 1.0000000000000004e-6,
minVariance = 0.05,
minCovariance = 0.0,
initStrategy = "kmeans",
rng = Random._GLOBAL_RNG()) @761
julia> modelMachine = Mlj.machine(model,X)
Machine{MissingImputator{DiagonalGaussian{Float64}},…} @335 trained 0 times; caches data
args:
1: Source @274 ⏎ `ScientificTypes.Table{AbstractVector{Union{Missing, ScientificTypes.Continuous}}}`
julia> (fitResults, cache, report) = Mlj.fit(model, 0, X)
((pₖ = [0.49997459736828426; 0.25001270131585945; 0.25001270131585634], mixtures = DiagonalGaussian{Float64}[DiagonalGaussian{Float64}([1.5, 11.166666666666666], [0.05, 0.05]), DiagonalGaussian{Float64}([3.2499999999999782, 39.0], [0.05, 0.05]), DiagonalGaussian{Float64}([5.199999999999994, -2.3499999999999996], [0.05, 0.05])]), nothing, ([2.886751345948105, 0.1814436846505966, 0.020160409405627352, 0.0022400454895143513], -275.7794472196741, 582.3200385220554, 579.5588944393483))
julia> XD = Mlj.transform(model,fitResults,X)
Tables.MatrixTable{Matrix{Float64}}: (x1 = [1.0, 1.5, 1.8, 1.7, 3.2, 2.8625692221714156, 3.3, 5.199999999999994, 5.2], x2 = [10.5, 11.166666666667362, 8.0, 15.0, 40.0, 14.746015173838764, 38.0, -2.3, -2.4])
julia> XDM = Mlj.matrix(XD)
9×2 Matrix{Float64}:
1.0 10.5
1.5 11.1667
1.8 8.0
1.7 15.0
3.2 40.0
2.86257 14.746
3.3 38.0
5.2 -2.3
5.2 -2.4
julia> # Use the previously learned structure to imput missings..
Xnew_withMissing = Mlj.table([1.5 missing; missing 38; missing -2.3; 5.1 -2.3])
Tables.MatrixTable{Matrix{Union{Missing, Float64}}}: (x1 = Union{Missing, Float64}[1.5, missing, missing, 5.1], x2 = Union{Missing, Float64}[missing, 38.0, -2.3, -2.3])
julia> XDNew = Mlj.transform(model,fitResults,Xnew_withMissing)
Tables.MatrixTable{Matrix{Float64}}: (x1 = [1.5, 3.2499999999999782, 5.199999999999994, 5.1], x2 = [11.166666666667362, 38.0, -2.3, -2.3])
julia> XDMNew = Mlj.matrix(XDNew)
4×2 Matrix{Float64}:
1.5 11.1667
3.25 38.0
5.2 -2.3
5.1 -2.3 Could you please check and validate the clusters MLJ interface and the Perceptron one ? |
Thanks for the work on these interfaces. I've just had a look at the KMeans and KMedoidsThe metadata looks good, except However, I notice some strange behaviour in the KMeans model noted below: using Random
Random.seed!(123)
X, _ = MLJBase.make_regression(20, 8);
mach = MLJBase.machine(KMeans(), X) |> MLJBase.fit!;
# It appears we have three clusters:
julia> MLJBase.schema(MLJBase.transform(mach, X))
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 20
# Bit surprised that all training data is assigned a single class:
yhat = MLJBase.predict(mach, X)
20-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
# And I would say the following is incorrect, as the predicted `yhat` should have
# three levels in its pool, not just the ones that are manifest for
# the particular input `X` in definition of `yhat`:
julia> MLJBase.classes(yhat)
1-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
2
For comparison, here is the behaviour of the Clustering.jl model: using MLJ
KMeans = @load KMeans pkg=Clustering
model = KMeans()
using Random
Random.seed!(123)
X, _ = make_regression(20, 8);
mach = machine(KMeans(), X) |> fit!;
schema(transform(mach, X))
# again, three clusters:
julia> schema(transform(mach, X))
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 20
yhat = predict(mach, X)
20-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
3
1
1
3
1
3
3
1
2
3
3
1
3
3
1
3
1
1
1
# All clusters are tracked, even if not explicitly manifest:
julia> ysmall = predict(mach, selectrows(X, 1:2))
2-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
3
julia> classes(ysmall)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
3 Finally, can I suggest you overload MLMBase.fitted_params(mach)
(fitresult = ([3, 3, 2, 3, 1, 1, 3, 1, 2, 2 … 2, 3, 3, 3, 3, 3, 2, 3, 1, 1], [-0.09844314369373444 -0.9399263119056788; -0.994364450422251 0.5485269272699657; 1.0271438614771895 0.7414183922931447], BetaML.Clustering.var"#39#41"()),) Maybe something like MMI.fitted_params(model::KMeans, fitresult) = (centers=fitesult[2], cluster_labels=CategoricalArrays.categorical(fitresults[1])) GMMThe metadata looks good. However, I get an error when I try to use using MLJBase
import BetaML
GMM = BetaML.Clustering.GMM
y, _ = make_regression(1000, 3, rng=123);
julia> schema(y)
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 1000
mach = machine(GMM(), nothing, y) |> fit!
# expecting a multivariate distrubution here:
julia> d = predict(mach, nothing)
ERROR: ArgumentError: Function `matrix` only supports AbstractMatrix or containers implementing the Tables interface.
Stacktrace:
[1] matrix(::MLJModelInterface.FullInterface, ::Val{:other}, ::Nothing; kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
[2] matrix(::MLJModelInterface.FullInterface, ::Val{:other}, ::Nothing) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
[3] matrix(::Nothing; kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
[4] matrix(::Nothing) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
[5] predict(::BetaML.Clustering.GMM{BetaML.Clustering.DiagonalGaussian{Float64}}, ::NamedTuple{(:pₖ, :mixtures),Tuple{Array{Float64,2},Array{BetaML.Clustering.DiagonalGaussian{Float64},1}}}, ::Nothing) at /Users/anthony/.julia/packages/BetaML/1uT9C/src/Clustering_MLJ.jl:149
[6] predict(::Machine{BetaML.Clustering.GMM{BetaML.Clustering.DiagonalGaussian{Float64}},true}, ::Nothing) at /Users/anthony/.julia/packages/MLJBase/pCCd7/src/operations.jl:83
[7] top-level scope at REPL[16]:1 |
Thanks, I'll have a look on this... |
It seems there could be a problem in the MLJ design/architecture (still a bit voodoo for me...) This works: using MLJBase, BetaML
y, _ = make_regression(1000, 3, rng=123);
ym = MLJBase.matrix(y)
model = GMM(rng=copy(BetaML.FIXEDRNG))
(fitResults, cache, report) = MLJBase.fit(model, 0, nothing, y)
yhat_prob = MLJBase.transform(model,fitResults,y) # ok
yhat_prob = MLJBase.predict(model, fitResults, y) # ok However if we run instead: modelMachine = MLJBase.machine(model, nothing, y)
mach = MLJBase.fit!(modelMachine)
yhat_prob = MLJBase.predict(mach, nothing) we have the error that you reported. How can I construct the interface so that the "version" of |
Thanks for these new changes. I'll look over these and Perceptron shortly.
Nah. I think you just need to delete this irrelevant line:
|
Arghh... yes, sure... sorry... EDIT Any how, I have now changed it, |
No, what I had in mind was that In any case, I can see there is some tension here between two competing conceptualisations of GMM. We want either:
In either case, interface points can be introduced to get the "secondary" objects, but those interface points will be less discoverable to the user. So we should probably choose according to the greater use-case. On these grounds, my push for case 1 may have been mistaken, and I suspect you agree, yes? If we revert to case 2, then GMM becomes Thanks for you patience. What are your thoughts? |
Please also see this side issue: sylvaticus/BetaML.jl#20 |
A quick reply: building the interface is a few lines of code. We can have two separate MLJ models that wrap BetaML.gmm for the two interpretations. We have to take care of the names however not to generate confusion. |
Yes, that's a very good idea, thanks. Maybe
In the "distribution" fitting case we could make the following pragmatic choice: the training "target" |
Hello, I did implement |
Does your object implement any of the Distributions.jl API, such as |
They implement only what is needed for the EM algo, that is the log pdf ( {Spherical|Diagonal|Full}Gaussian mixtures for
|
What about converting this to a |
Does |
If I remember correctly the "problem" was mosty with missing data. |
Second thoughts, I think it's just fine, maybe even better, if you implement the Distributions.jl interface for your object (the complete component distributions + weights object). The minimum would be Distribution.logpdf(::YourDist, single_observation) Here I think also And implementing |
Sorry for the name confusion, I don't have an object for the whole mixture, only for their components.. what I call "mixtures" are actually mixture components. I think for now it is best to stay with the cluster interpretation, that is the one I originally wrote the algorithm for. I hence removed |
Okay, that's fine. |
There is a subtle issue with the scitypes I have just discovered (which is detected by trying to bind the models to test data in a machine). The input_scitype = MMI.Table(Union{MMI.Continuous,MMI.Missing}) which states each column can have a element scitype |
model = BetaML.Clustering.MissingImputator()
X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38;
missing -2.3; 5.2 -2.4] |> MLJBase.table
model = BetaML.Clustering.GMMClusterer()
mach = machine(model, X) |> fit!
julia> transform(mach, X)
ERROR: X must me `nothing` in `transform(m::GMMClusterer,firResults,nothing)`. If you want the cluster predictions of new data using already learned structure use `predict(m::GMMClusterer,firResults,Xnew)` It looks like this error is a left-over from |
I've looked over the interfaces for the Percetron models and they look good, thanks. You still have |
I corrected the scitype and removed the Concerning the Alternatively, I could forget In both cases I would like to avoid that, in order to find the cluster probabilities, the user has to pass two times the X data, first in What do you prefer ? |
Thanks for looking into this further. I'm afraid neither option looks great to me.
According to the API, the Also, having a
I'm not fond of this option either, for the same reasons. I understand that you want to avoid the user having to re-enter using MLJ
model = (@load KMeans pkg=Clustering add=true)()
X, _ = make_moons();
mach = machine(model, X) |> fit!
predict(mach, rows=:) An alternative solution (that saves one line of code) would be to |
ok, I removed Please feel free to close this issue now, thank you for your patience.... (yes, I still think that a |
Okay. Let me know when you tag a release and I will register all the new models and close this when that's done. |
tagged.. thank you again for your support... /Antonello |
Will try to get to this sometime this week. |
@sylvaticus Could you please confirm this is the full list of BetaML models you have intended to add to MLJ? julia> models("BetaML")
11-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = GMMClusterer, package_name = BetaML, ... )
(name = KMeans, package_name = BetaML, ... )
(name = KMedoids, package_name = BetaML, ... )
(name = KernelPerceptronClassifier, package_name = BetaML, ... )
(name = MissingImputator, package_name = BetaML, ... )
(name = PegasosClassifier, package_name = BetaML, ... )
(name = PerceptronClassifier, package_name = BetaML, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestRegressor, package_name = BetaML, ... ) |
Yes, I confirm. I bit it is late now, but I have also some doubts on the name "MissingImputator". |
Yes this is always tricky. Actually, having two models with different names that happen to do the same thing is not a bad resolution to this problem (so two structs but common methods). For now, I'm going to proceed with the update as is. |
Hello, I have interfaced to MLJ the following BetaML models:
KMeans, KMedoids, GMM, MissingImputator
.The MLJ documentation is a bit sparse for unsupervised models, compared to supervised ones, so let me check I done it correctly.
KMeans and KMedoids:
<: MMI.Unsupervised
fit
trains the model (compute the centroids/assign the elements) --> fitResultstransform
returns a (nObservation, nClasses) matrix of distances from class centrespredict
returns aCategoricalArray
of the predicted classes given new data (the centroids don't move further)GMM:
<: MMI.Unsupervised
(there is no distinction in unsupervised models between deterministic and probabilistic)fit
trains the model (estimate the mixtures/assign the probabilities) --> fitResultspredict
returns a vector of UniVariateFinite distributionstransform
is just an alias of predict (not sure what else should it do)MissingImputator
<: MMI.Static
transform
return a copy of the X with the missing value imputedPlease let me know if this set-up makes sense for MLJ. The full interface for the new models is here.
The text was updated successfully, but these errors were encountered: