Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MultinomialNB for iris classification (Int -> Float) #41

Open
ayush1999 opened this issue Feb 21, 2019 · 6 comments
Open

Use MultinomialNB for iris classification (Int -> Float) #41

ayush1999 opened this issue Feb 21, 2019 · 6 comments

Comments

@ayush1999
Copy link

I'm trying to use MultinomialNB for iris classification, but it looks like fit for MultinomialNB expects X to be Matrix{Int64}, which is not the case for my dataset. (Similar for the predict function.). Won't it be better to consider the type to be Float64, since it'd be more generic?

@dfdx
Copy link
Owner

dfdx commented Feb 22, 2019

As far as I remember, my initial idea was that multinomial NB uses object counts as features, and counts can't be non-integer. But now I realize it might be too limiting, so I'll try to update the code and see if tests are passing.

@ablaom
Copy link
Collaborator

ablaom commented Feb 25, 2019

Perhaps I'm missing something, but I don't believe it makes sense to apply MultinomialNB to the iris data. As @dfdx implies, the input data for a MultinomialNB is categorical not continuous.

@dfdx
Copy link
Owner

dfdx commented Mar 4, 2019

Tried Float64/Real for feature "counts" in multinomial-nb-with-real-counts branch - technically it works, but I'm still somewhat suspicious about the concept. At least I need to read through the code once again checking what real counts may lead to.

@ablaom
Copy link
Collaborator

ablaom commented Mar 4, 2019

A related issue: It would be useful if the "regularisation" parameter alpha in Multinomial case is allowed to be continuous (Lidstone smoothing). Currently it must be discrete, I think. I understand that 0 < alpha < 1 is a common choice.

@dfdx
Copy link
Owner

dfdx commented Mar 5, 2019

After several iterations of updates I now think that it might be easier to rewrite the code to Julia 1.x entirely (current version is essentially written for Julia 0.4 with compatibility adapters). This leads to another question - whether keeping a separate repo is even reasonable. If I were to re-implement current algorithms in MLJModels, how would you estimate effort for someone new to MLJ infrastructure?

@ablaom
Copy link
Collaborator

ablaom commented Mar 6, 2019

Well, that would be awesome.

If I were to re-implement current algorithms in MLJModels, how would you estimate effort for someone new to MLJ infrastructure?

Well, pretty easy, I expect. We've already written the wrapping code for Multinomial and Gauss here. So, if you kept your API the same, this would be a copy and paste operation. There are some simplifications that could be made, if you are working from scratch, but I'd be happy to forgo these. And in any case, I will be happy to provide guidance.

(There is some dancing around in our wrap because the target is categorical and MLJ is fussy about preserving unseen classes in the pool in the (probabilistic) predictions. That is, the training target "election_winner" may not have "Green Party" in the training set, but if "Green Party" is in the target pool, then the predicted distribution must have "Green Party" in its support (with probability zero).)

This leads to another question - whether keeping a separate repo is even reasonable.

From our point of view, native implementations of the MLJ interface are preferred. One reason is to better isolate testing. Another, is to distribute responsibility for maintenance and documentation. Native implementation is simple. Your NaiveBayes.jl module imports MLJBase, implements the API, and then you raise an issue at MLJRegistry requesting we make your package accessible to all MLJ users (you don't need to do this to use NB with MLJ, just to make your models findable and loadable by MLJ users that don't know about your package). See the end of the guide for details.

Have you given any thought on how to handle sparse data? (For the sake of expediency, our wrap just converts the generic tabular input provided by the MLJ user into a matrix and feeds that to NB.) If I remember correctly, you presently allow dictionary input, but I guess there are better alternatives by now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants