-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack overflow for a not so big task #1738
Comments
It seems like this is a problem with svm and not mlr. See the stackoverflow question here |
And for problems like this, please always post a complete, reproducible example. |
The problem occurs because mlr calls f = getTaskFormula(.task)
e1071::svm(f, data = getTaskData(.task, .subset), probability = .learner$predict.type == "prob", ...) Instead, we could use td = getTaskData(.task, .subset, target.extra = TRUE)
e1071::svm(td$data, td$target, probability = .learner$predict.type == "prob", ...) which appears to work a bit better at first glance (though I only checked the training step as I have limited memory r/n.) I reproduced it like this: xt = cbind(as.data.frame(matrix(rnorm(298*22160), ncol=22160)), as.factor(sample(c('a', 'b'), 298, T)))
names(xt)[length(xt)] = 'x'
av_task = makeClassifTask("xt", xt, target='x')
r = resample(learner = makeLearner("classif.svm", predict.type = "prob"), task = av_task, resampling = makeResampleDesc(method = "CV", iters = 3), show.info = FALSE, measures = auc) |
Ah, well spotted! Could you make a PR using the other interface so we can quickly see if that breaks anything else please? |
I can do that tomorrow. I have seen quite a lot of learners that use |
Not as far as I know, but @berndbischl and @mllg would know better. |
Not really, just a "habit". Nearly all package work with a formula interface and only very few provide an alternative. It is usually safe to switch to a character/data.frame interface, but be aware that it is not unlikely that you will discover some bugs as these are less tested 😞 |
Is there any simple fix for this? I currently only need glmnet and svm. So, I try to create two new models by makeRLearnerClassif. What do I need to modify in the trainLearner.classif.glmnet function? Removing .formula in the args and add x, y in args? |
|
To answer my own question: the reason is that the formula interface accepts factors (converting them to dummy variables) while the data.frame interface (really a matrix interface) does not. To solve this, mlr would need to do the dummy encoding. It would also need to watch out that the " |
@ajing if you really need to use devtools::install_github("mlr-org/mlr", ref = "fix_1738_svm_no_formula") works); although I can't recommend it: I cannot guarantee correctness, and in particular, the |
@ajing the actual solution is to use the command line option
to increase the size of the protect call stack. The default is 50,000, the maximum is 500,000 (but leads to some slowdown due to slower GC), see |
Well, as an alternative, what do you think of having a second learner that doesn't use the formula interface and doesn't support factors? I don't think it's a good idea to do this kind of conversion manually. |
I guess it depends on the standards that mlr wants to hold itself to.
I agree that "manual" conversion is not the way to go. The best solution
I can think of (besides ignoring users who need 16k+ features...) is to
make the method conditional (pass data.frame if everything is numeric,
and formula otherwise). If we opt for that, the next logical step would
be to look at the other learners that use formula and do smth similar.
…On 2017-03-28 18:21, Lars Kotthoff wrote:
Well, as an alternative, what do you think of having a second learner
that doesn't use the formula interface and doesn't support factors? I
don't think it's a good idea to do this kind of conversion manually.
--
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub [1], or mute the
thread [2].
*
Links:
------
[1] #1738 (comment)
[2]
https://github.com/notifications/unsubscribe-auth/APEa-aRXhhcX3ilSMc6o9yxiOxsBxA-Gks5rqTN8gaJpZM4MqrLz
|
Right, it's the latter part that I'm worried about. At the moment, the mlr learners are relatively thin wrappers around the actual learners, and making this wrapper thicker would increase maintenance costs. As we have a lot of learners, I think we should avoid this. Thoughts @berndbischl @mllg @jakob-r |
Thanks for the prompt reply! Another relevant question... Is there any simple way to debug customized functions like makeRLearnerRegr(), makePreprocWrapper(), makeFilterWrapper()? Using setBreakpoint? When I was using traceback(), it usually returns lots of data information. Thanks! |
Well, that depends on your definition of "simple". You will get a lot of information, but usually the last stack frame(s) is the only thing you need to have a look at. |
Thanks @mb706 !
works well. |
the comment about that this is just a (bad) habit, that we use the formula interface, is REALLY not true. but guess what: i discovered exactly the same problems that @mb706 reports here.
so this is FAR from a simple switch. and the MAIN reason we use the formula interface is, that this is the only interface we can SAFELY rely on. i see this as an important point. and i would love to have the package here be even more performant. but if we do this, this will be a very big change. which needs to be
if somebody sees a clear and doable solution please post. |
what i always do if i have problems with client code in other packages i dont control: so like
b) when you have the source code locally, dont load library(mlr), use devtools::load_all("mlr") |
there are some important point in this thread we should really put into an FAQ |
We could modify (Ideally we would want to suppress this warning if the user is already using (We could make this warning appear only once per R session.) |
@mb706 you wrote R --max-ppsize 100000 to increase the size of the protect call stack. " But how does one do that using rstudio in UBUNTU. |
Ubuntu? |
I create a task with 298 samples and 22160 features, then do some resampling for CV. However, I got a stack overflow error. Is there any way to avoid this?
To reproduce the error (according to mb706's comment)
The text was updated successfully, but these errors were encountered: