Stack overflow for a not so big task #1738

ajing · 2017-03-27T18:52:10Z

I create a task with 298 samples and 22160 features, then do some resampling for CV. However, I got a stack overflow error. Is there any way to avoid this?

To reproduce the error (according to mb706's comment)

xt = cbind(as.data.frame(matrix(rnorm(298*22160), ncol=22160)), as.factor(sample(c('a', 'b'), 298, T)))
names(xt)[length(xt)] = 'x'
av_task = makeClassifTask("xt", xt, target='x')
r = resample(learner = makeLearner("classif.svm", predict.type = "prob"), task = av_task, resampling = makeResampleDesc(method = "CV", iters = 3), show.info = FALSE, measures = auc)

Error: protect(): protection stack overflow

The text was updated successfully, but these errors were encountered:

SteveBronder · 2017-03-27T19:10:56Z

It seems like this is a problem with svm and not mlr. See the stackoverflow question here

larskotthoff · 2017-03-27T19:15:17Z

And for problems like this, please always post a complete, reproducible example.

mb706 · 2017-03-27T19:19:46Z

The problem occurs because mlr calls svm via the formula interface:

f = getTaskFormula(.task)
e1071::svm(f, data = getTaskData(.task, .subset), probability = .learner$predict.type == "prob", ...)

Instead, we could use

td = getTaskData(.task, .subset, target.extra = TRUE)
e1071::svm(td$data, td$target, probability = .learner$predict.type == "prob", ...)

which appears to work a bit better at first glance (though I only checked the training step as I have limited memory r/n.)

I reproduced it like this:

xt = cbind(as.data.frame(matrix(rnorm(298*22160), ncol=22160)), as.factor(sample(c('a', 'b'), 298, T)))
names(xt)[length(xt)] = 'x'
av_task = makeClassifTask("xt", xt, target='x')
r = resample(learner = makeLearner("classif.svm", predict.type = "prob"), task = av_task, resampling = makeResampleDesc(method = "CV", iters = 3), show.info = FALSE, measures = auc)

larskotthoff · 2017-03-27T19:24:18Z

Ah, well spotted! Could you make a PR using the other interface so we can quickly see if that breaks anything else please?

mb706 · 2017-03-27T19:34:39Z

I can do that tomorrow. I have seen quite a lot of learners that use formula instead of data.frame. In many of these learners, the .formula generic only goes on to extract the data and then call the regular (.data.frame) generic anyways. Is there a reason mlr prefers formula over data.frame?

larskotthoff · 2017-03-27T19:39:18Z

Not as far as I know, but @berndbischl and @mllg would know better.

mllg · 2017-03-27T21:13:27Z

Is there a reason mlr prefers formula over data.frame?

Not really, just a "habit". Nearly all package work with a formula interface and only very few provide an alternative. It is usually safe to switch to a character/data.frame interface, but be aware that it is not unlikely that you will discover some bugs as these are less tested 😞
The speedup can be tremendous though.

ajing · 2017-03-27T23:13:36Z

Is there any simple fix for this? I currently only need glmnet and svm. So, I try to create two new models by makeRLearnerClassif. What do I need to modify in the trainLearner.classif.glmnet function? Removing .formula in the args and add x, y in args?

mllg · 2017-03-28T07:56:33Z

glmnet is not using the formula interface. As soon as @mb706 manages to create a PR for e1071's SVM, you should be all set after switching to the devel version.

mb706 · 2017-03-28T12:54:54Z

Is there a reason mlr prefers formula over data.frame?

To answer my own question: the reason is that the formula interface accepts factors (converting them to dummy variables) while the data.frame interface (really a matrix interface) does not.

To solve this, mlr would need to do the dummy encoding. It would also need to watch out that the "scale" parameter of svm must be padded to account for the different number of columns, and must be FALSE for the dummy columns.

mb706 · 2017-03-28T13:03:44Z

@ajing if you really need to use svm, you can try to use the fix_1738_svm_no_formula branch (maybe installing it with

devtools::install_github("mlr-org/mlr", ref = "fix_1738_svm_no_formula")

works); although I can't recommend it: I cannot guarantee correctness, and in particular, the svm learner in that branch does not handle categorical features.

mb706 · 2017-03-28T14:16:38Z

@ajing the actual solution is to use the command line option --max-ppsize, e.g.

R --max-ppsize 100000

to increase the size of the protect call stack. The default is 50,000, the maximum is 500,000 (but leads to some slowdown due to slower GC), see help(Memory).

larskotthoff · 2017-03-28T16:21:14Z

Well, as an alternative, what do you think of having a second learner that doesn't use the formula interface and doesn't support factors? I don't think it's a good idea to do this kind of conversion manually.

mb706 · 2017-03-28T17:15:26Z

I guess it depends on the standards that mlr wants to hold itself to. I agree that "manual" conversion is not the way to go. The best solution I can think of (besides ignoring users who need 16k+ features...) is to make the method conditional (pass data.frame if everything is numeric, and formula otherwise). If we opt for that, the next logical step would be to look at the other learners that use formula and do smth similar.

…

On 2017-03-28 18:21, Lars Kotthoff wrote: Well, as an alternative, what do you think of having a second learner that doesn't use the formula interface and doesn't support factors? I don't think it's a good idea to do this kind of conversion manually. -- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or mute the thread [2]. * Links: ------ [1] #1738 (comment) [2] https://github.com/notifications/unsubscribe-auth/APEa-aRXhhcX3ilSMc6o9yxiOxsBxA-Gks5rqTN8gaJpZM4MqrLz

larskotthoff · 2017-03-28T17:21:33Z

Right, it's the latter part that I'm worried about. At the moment, the mlr learners are relatively thin wrappers around the actual learners, and making this wrapper thicker would increase maintenance costs. As we have a lot of learners, I think we should avoid this.

Thoughts @berndbischl @mllg @jakob-r

ajing · 2017-03-28T18:24:58Z

Thanks for the prompt reply!

Another relevant question... Is there any simple way to debug customized functions like makeRLearnerRegr(), makePreprocWrapper(), makeFilterWrapper()? Using setBreakpoint? When I was using traceback(), it usually returns lots of data information.

Thanks!

larskotthoff · 2017-03-28T18:27:09Z

Well, that depends on your definition of "simple". You will get a lot of information, but usually the last stack frame(s) is the only thing you need to have a look at.

ajing · 2017-03-28T21:33:31Z

Thanks @mb706 !

R --max-ppsize 500000

works well.

berndbischl · 2017-03-29T10:34:41Z

Is there a reason mlr prefers formula over data.frame?
Not really, just a "habit". Nearly all package work with a formula interface and only very few provide an alternative. It is usually safe to switch to a character/data.frame interface, but be aware that it is not unlikely that you will discover some bugs as these are less tested
The speedup can be tremendous though

the comment about that this is just a (bad) habit, that we use the formula interface, is REALLY not true.
i have tought about changing to a matrix interface, or conditionally supporting this, quite often. and it also is NOT true that it is "a simple and safe" switch to a "char/df" interface. I dont even know that that means? Most packages support a NUMERICAL MATRIX interface.

but guess what: i discovered exactly the same problems that @mb706 reports here.

you have to convert factors now manually
and what is even worse: how will this affect now certain operations in the algorithm? like scale and so on. (EDIT: or think about algos that perform internal foward search?)
i also dont like that the learner interfaces become even longer

so this is FAR from a simple switch. and the MAIN reason we use the formula interface is, that this is the only interface we can SAFELY rely on.

i see this as an important point. and i would love to have the package here be even more performant. but if we do this, this will be a very big change. which needs to be

properly discussed
implemented correctly (for more than 1 learner)
tested correctly
and so on.

if somebody sees a clear and doable solution please post.

berndbischl · 2017-03-29T10:40:36Z

Thanks for the prompt reply!

Another relevant question... Is there any simple way to debug customized functions like makeRLearnerRegr(), makePreprocWrapper(), makeFilterWrapper()? Using setBreakpoint? When I was using traceback(), it usually returns lots of data information.

Thanks!

what i always do if i have problems with client code in other packages i dont control:
a) download the src. you can do that manually, from cran or github.
but much easier is to install the "rt" tool from @mllg and use the rclone command in the shell
https://github.com/rdatsci/rt

so like

rclone mlr

b) when you have the source code locally, dont load library(mlr), use devtools::load_all("mlr")
c) now you can set breakpoint in the mlr sources, or add prints, or change code....

berndbischl · 2017-03-29T10:41:27Z

there are some important point in this thread we should really put into an FAQ

mb706 · 2017-03-29T11:33:25Z

We could modify getTaskFormula to give a warning (instructing the user to use --max-ppsize) if the task has more than 16k features.

(Ideally we would want to suppress this warning if the user is already using --max-ppsize. The "ppsize" is stored in a variable called R_PPStackSize, but I have not found it in any of the R include files on my system, so I'm not sure what the options are in trying to query it with a C function.)

(We could make this warning appear only once per R session.)

jitendrashahiitb · 2018-05-31T13:43:54Z

@mb706 you wrote
"the actual solution is to use the command line option --max-ppsize, e.g.

R --max-ppsize 100000

to increase the size of the protect call stack. "

But how does one do that using rstudio in UBUNTU.
I have seen how this is done in windows.

Steviey · 2022-10-31T09:51:44Z

Ubuntu?

mb706 mentioned this issue Mar 28, 2017

e1071::svm(): Use formula interface only if factors are present #1740

Merged

larskotthoff closed this as completed in #1740 Jun 17, 2019

vlegoff mentioned this issue May 14, 2024

Fixing stackoverflow error on surv.glmboost mlr-org/mlr3extralearners#354

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow for a not so big task #1738

Stack overflow for a not so big task #1738

ajing commented Mar 27, 2017 •

edited

Loading

SteveBronder commented Mar 27, 2017

larskotthoff commented Mar 27, 2017

mb706 commented Mar 27, 2017 •

edited

Loading

larskotthoff commented Mar 27, 2017

mb706 commented Mar 27, 2017

larskotthoff commented Mar 27, 2017

mllg commented Mar 27, 2017

ajing commented Mar 27, 2017

mllg commented Mar 28, 2017

mb706 commented Mar 28, 2017

mb706 commented Mar 28, 2017

mb706 commented Mar 28, 2017

larskotthoff commented Mar 28, 2017

mb706 commented Mar 28, 2017 via email

larskotthoff commented Mar 28, 2017

ajing commented Mar 28, 2017

larskotthoff commented Mar 28, 2017

ajing commented Mar 28, 2017 •

edited

Loading

berndbischl commented Mar 29, 2017 •

edited

Loading

berndbischl commented Mar 29, 2017 •

edited

Loading

berndbischl commented Mar 29, 2017

mb706 commented Mar 29, 2017

jitendrashahiitb commented May 31, 2018

Steviey commented Oct 31, 2022

Stack overflow for a not so big task #1738

Stack overflow for a not so big task #1738

Comments

ajing commented Mar 27, 2017 • edited Loading

SteveBronder commented Mar 27, 2017

larskotthoff commented Mar 27, 2017

mb706 commented Mar 27, 2017 • edited Loading

larskotthoff commented Mar 27, 2017

mb706 commented Mar 27, 2017

larskotthoff commented Mar 27, 2017

mllg commented Mar 27, 2017

ajing commented Mar 27, 2017

mllg commented Mar 28, 2017

mb706 commented Mar 28, 2017

mb706 commented Mar 28, 2017

mb706 commented Mar 28, 2017

larskotthoff commented Mar 28, 2017

mb706 commented Mar 28, 2017 via email

larskotthoff commented Mar 28, 2017

ajing commented Mar 28, 2017

larskotthoff commented Mar 28, 2017

ajing commented Mar 28, 2017 • edited Loading

berndbischl commented Mar 29, 2017 • edited Loading

berndbischl commented Mar 29, 2017 • edited Loading

berndbischl commented Mar 29, 2017

mb706 commented Mar 29, 2017

jitendrashahiitb commented May 31, 2018

Steviey commented Oct 31, 2022

ajing commented Mar 27, 2017 •

edited

Loading

mb706 commented Mar 27, 2017 •

edited

Loading

ajing commented Mar 28, 2017 •

edited

Loading

berndbischl commented Mar 29, 2017 •

edited

Loading

berndbischl commented Mar 29, 2017 •

edited

Loading