Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java based learners fail with parallelMap multicore #1898

Closed
mb706 opened this issue Jul 16, 2017 · 9 comments
Closed

Java based learners fail with parallelMap multicore #1898

mb706 opened this issue Jul 16, 2017 · 9 comments

Comments

@mb706
Copy link
Contributor

mb706 commented Jul 16, 2017

This is because fork(), which multicore is ultimately based on, and the java VM don't play along well if java is started before the forking happens. Loading java based packages, e.g. "RWeka", seems to start the java VM, so if the package gets loaded outside of the parallelMap call. it fails.

> library("mlr")
Loading required package: ParamHelpers
> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> resample("classif.IBk", pid.task, cv5)  # loads RWeka, then calls parallelMap
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2:
# hang

If, on the other hand, the fork is before loading the java vm, it works fine:

> library("mlr")
Loading required package: ParamHelpers
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> parallelMap(function(x) resample("classif.IBk", pid.task, cv5), 1:2, simplify=FALSE)
Mapping in parallel: mode = multicore; cpus = 2; elements = 2.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 1: mmce.test.mean=0.275
[Resample] cross-validation iter 2: mmce.test.mean=0.331
[Resample] cross-validation iter 2: mmce.test.mean=0.273
# ...
# no hang

I therefore suggest to have a configureMlr option to defer loading of packages until a learner's train or predict function gets called. The user would still need to be careful not to load "RWeka" when he wants to use multicore, but this at least would give him the option. When a learner gets constructed, instead of loading a learner's package, mlr should simply check whether the requested package exists.

@mb706
Copy link
Contributor Author

mb706 commented Jul 16, 2017

A current workaround is to load a learner from a savefile. E.g. if a learner is loaded from the .RData file at start, resampling with multicore works.

> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> library("mlr")
Loading required package: ParamHelpers
> lrn = makeLearner("classif.IBk")
> resample(lrn, pid.task, cv5)
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2: ^C^C^C^C^C
> q("yes")
$ R
R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
[...]
> library("mlr")
Loading required package: ParamHelpers
> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> resample(lrn, pid.task, cv5)
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2: mmce.test.mean=0.266
mmce.test.mean=0.318
# no hang

@berndbischl
Copy link
Member

  1. We did have that issue before. But not with the insights you presented here.
    It is also more a parallelmap issue right?

  2. so the problem is that we load RWeka on the master, on learner construction, that is what makes the bug appear?

@mb706
Copy link
Contributor Author

mb706 commented Jul 17, 2017

  1. AFAICS parallelMap can not do much about it; when using "multicore" and the JVM is already loaded, java can not be used (link). It also appears impossible to load a new JVM or unload the old one (link).
  2. Basically loading anything that uses java in the main process, be it RWeka, extraTrees, or rJava itself, will make it impossible to run a java based learner parallelized with parallelMap + multicore afterwards.
    The best we can do is not load rjava on purpose on the main process. If the user loaded it before for some other reason there is nothing I can see we could do, except maybe check for this stuff in the trainLearner function to prevent hanging.

@Masutani
Copy link

I ran my own rJava based custom learner. It works find single thread, however with parallelStartSocket()
I got some time out of session like this :


Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 20; elements = 1.
Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) :
Errors occurred in 1 slave jobs, displaying at most 10 of them:

00001: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.IllegalStateException: This trial session has expired.
Each trial session is limited to 120 minutes.

Is this caused by same restriction on mcapply (parallelMap) compatibility with JVM as you stated here ?

@mb706
Copy link
Contributor Author

mb706 commented Sep 14, 2018

parallelStartSocket is not based on and should not call mclapply, so I am pretty sure it is not because of this issue.

(Note that parallelMap in "socket" mode behaves slightly different from "multicore" mode in that the worker jobs are executed in a (kind of) vanilla environment with sockets; you might have to call parallelExport and parallelLibrary with "socket" when you wouldn't need to with "multicore".)

@Masutani
Copy link

Masutani commented Oct 3, 2018

Hi I confirmed the time out is caused by something different with this issue though the single thread didn't take such duration.
However parallelStartSocket is good alternative for parallelStartMulticore. What is a drawback of Socket compared to Multicore ? Only overhead , and necessity of export libraries ?

@mb706
Copy link
Contributor Author

mb706 commented Oct 4, 2018

Multicore uses the operating system's fork() to create child processes that have copy-on-write access to the parent process's memory. If you're working with a big dataset this means you can potentially have many processes operating on this data while only using up memory for the dataset once. (I think sometimes R's garbage collection messes this up and more memory gets used than needed, but usually it works). When you're using sockets, every individual worker process needs to separately load the data, so you have the overhead of (1) serialising the data from the main process and sending it to the worker processes and (2) keeping the data in memory for each process separately.

(I don't know parallelStartSocket that well however, so don't take my word for it.)

@Masutani
Copy link

Masutani commented Oct 5, 2018

Thanks for such general question. I understood parallelStartSocket has significant overhead compared to parallelStartMultiCore. In my case, 40 core CPU cannot be available without multi thread/process, and MultiCore option cannot be used for my Java based code (because of the original issue in this thread).
Socket solution seems to be alternative in case such incompatibility / scalability problem and only option for Windows.
By the way I hope multi-level parallel (ex. Benchmark * Resample) will be supported.

@stale
Copy link

stale bot commented Dec 18, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 18, 2019
@stale stale bot closed this as completed Dec 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants