Java based learners fail with parallelMap multicore #1898

mb706 · 2017-07-16T19:10:41Z

This is because fork(), which multicore is ultimately based on, and the java VM don't play along well if java is started before the forking happens. Loading java based packages, e.g. "RWeka", seems to start the java VM, so if the package gets loaded outside of the parallelMap call. it fails.

> library("mlr")
Loading required package: ParamHelpers
> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> resample("classif.IBk", pid.task, cv5)  # loads RWeka, then calls parallelMap
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2:
# hang

If, on the other hand, the fork is before loading the java vm, it works fine:

> library("mlr")
Loading required package: ParamHelpers
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> parallelMap(function(x) resample("classif.IBk", pid.task, cv5), 1:2, simplify=FALSE)
Mapping in parallel: mode = multicore; cpus = 2; elements = 2.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 1: mmce.test.mean=0.275
[Resample] cross-validation iter 2: mmce.test.mean=0.331
[Resample] cross-validation iter 2: mmce.test.mean=0.273
# ...
# no hang

I therefore suggest to have a configureMlr option to defer loading of packages until a learner's train or predict function gets called. The user would still need to be careful not to load "RWeka" when he wants to use multicore, but this at least would give him the option. When a learner gets constructed, instead of loading a learner's package, mlr should simply check whether the requested package exists.

The text was updated successfully, but these errors were encountered:

mb706 · 2017-07-16T19:18:18Z

A current workaround is to load a learner from a savefile. E.g. if a learner is loaded from the .RData file at start, resampling with multicore works.

> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> library("mlr")
Loading required package: ParamHelpers
> lrn = makeLearner("classif.IBk")
> resample(lrn, pid.task, cv5)
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2: ^C^C^C^C^C
> q("yes")
$ R
R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
[...]
> library("mlr")
Loading required package: ParamHelpers
> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> resample(lrn, pid.task, cv5)
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2: mmce.test.mean=0.266
mmce.test.mean=0.318
# no hang

berndbischl · 2017-07-17T09:48:49Z

We did have that issue before. But not with the insights you presented here.
It is also more a parallelmap issue right?
so the problem is that we load RWeka on the master, on learner construction, that is what makes the bug appear?

mb706 · 2017-07-17T12:03:30Z

AFAICS parallelMap can not do much about it; when using "multicore" and the JVM is already loaded, java can not be used (link). It also appears impossible to load a new JVM or unload the old one (link).
Basically loading anything that uses java in the main process, be it RWeka, extraTrees, or rJava itself, will make it impossible to run a java based learner parallelized with parallelMap + multicore afterwards.
The best we can do is not load rjava on purpose on the main process. If the user loaded it before for some other reason there is nothing I can see we could do, except maybe check for this stuff in the trainLearner function to prevent hanging.

Masutani · 2018-09-14T08:43:11Z

I ran my own rJava based custom learner. It works find single thread, however with parallelStartSocket()
I got some time out of session like this :

Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 20; elements = 1.
Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) :
Errors occurred in 1 slave jobs, displaying at most 10 of them:

00001: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.IllegalStateException: This trial session has expired.
Each trial session is limited to 120 minutes.

Is this caused by same restriction on mcapply (parallelMap) compatibility with JVM as you stated here ?

mb706 · 2018-09-14T14:48:24Z

parallelStartSocket is not based on and should not call mclapply, so I am pretty sure it is not because of this issue.

(Note that parallelMap in "socket" mode behaves slightly different from "multicore" mode in that the worker jobs are executed in a (kind of) vanilla environment with sockets; you might have to call parallelExport and parallelLibrary with "socket" when you wouldn't need to with "multicore".)

Masutani · 2018-10-03T05:06:53Z

Hi I confirmed the time out is caused by something different with this issue though the single thread didn't take such duration.
However parallelStartSocket is good alternative for parallelStartMulticore. What is a drawback of Socket compared to Multicore ? Only overhead , and necessity of export libraries ?

mb706 · 2018-10-04T18:51:25Z

Multicore uses the operating system's fork() to create child processes that have copy-on-write access to the parent process's memory. If you're working with a big dataset this means you can potentially have many processes operating on this data while only using up memory for the dataset once. (I think sometimes R's garbage collection messes this up and more memory gets used than needed, but usually it works). When you're using sockets, every individual worker process needs to separately load the data, so you have the overhead of (1) serialising the data from the main process and sending it to the worker processes and (2) keeping the data in memory for each process separately.

(I don't know parallelStartSocket that well however, so don't take my word for it.)

Masutani · 2018-10-05T03:36:04Z

Thanks for such general question. I understood parallelStartSocket has significant overhead compared to parallelStartMultiCore. In my case, 40 core CPU cannot be available without multi thread/process, and MultiCore option cannot be used for my Java based code (because of the original issue in this thread).
Socket solution seems to be alternative in case such incompatibility / scalability problem and only option for Windows.
By the way I hope multi-level parallel (ex. Benchmark * Resample) will be supported.

stale · 2019-12-18T18:21:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mb706 mentioned this issue Jul 17, 2017

Enabling github templates for Issues and PRs #1893

Merged

mb706 mentioned this issue Jul 25, 2017

Make loading of packages when loading a learner optional #1942

Closed

mb706 mentioned this issue Apr 6, 2018

loading FSelector breaks parallelization? #1802

Closed

pat-s added type-bug prio-medium project - base labels Nov 14, 2018

stale bot added the stale label Dec 18, 2019

stale bot closed this as completed Dec 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java based learners fail with parallelMap multicore #1898

Java based learners fail with parallelMap multicore #1898

mb706 commented Jul 16, 2017

mb706 commented Jul 16, 2017

berndbischl commented Jul 17, 2017

mb706 commented Jul 17, 2017

Masutani commented Sep 14, 2018

mb706 commented Sep 14, 2018

Masutani commented Oct 3, 2018

mb706 commented Oct 4, 2018 •

edited

Loading

Masutani commented Oct 5, 2018 •

edited

Loading

stale bot commented Dec 18, 2019

Java based learners fail with parallelMap multicore #1898

Java based learners fail with parallelMap multicore #1898

Comments

mb706 commented Jul 16, 2017

mb706 commented Jul 16, 2017

berndbischl commented Jul 17, 2017

mb706 commented Jul 17, 2017

Masutani commented Sep 14, 2018

00001: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalStateException: This trial session has expired. Each trial session is limited to 120 minutes.

mb706 commented Sep 14, 2018

Masutani commented Oct 3, 2018

mb706 commented Oct 4, 2018 • edited Loading

Masutani commented Oct 5, 2018 • edited Loading

stale bot commented Dec 18, 2019

00001: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.IllegalStateException: This trial session has expired.
Each trial session is limited to 120 minutes.

mb706 commented Oct 4, 2018 •

edited

Loading

Masutani commented Oct 5, 2018 •

edited

Loading