-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ubuntu 20.04 mclapply "child process has died" #366
Comments
Are there any messages in your |
Sorry no, I have a long list of apparmor warnings as below, but failed OpenCPU requests don't seem to log any new entry into
|
Our models do spawn additional child processes with |
Only logs I see are in
|
Yeah maybe try without mclapply. Also in your |
No change when turning off
|
Hmm so it only happens if you open lots of files? And there is nothing in |
Yes that's correct )-: Only entries in
and my All I can say is we have updated package versions between our 16.04 and 20.04 configs (data.table 1.12.9 vs. 1.13.1, raster 3.3.7 vs. raster 3.3.17). These are the main deps. I have whitelisted all required local system paths in Been randomly accessing between 500-1100 files, and so far get consistent failure over ~1000. This is on a 8 cores / 32GB VM. |
In addition to |
Not that I am aware of, but ubuntu switched to systemd in 18.04 so perhaps there are other changes somewhere... |
Got a few extra lines in
|
Hmmm interesting. Try allowing that read in the apparmor profile?
|
no go. Trying to downgrade R package versions to our prior config, just in case... Also there's a fair amount of randomness in the error (yesterday could pull ~1000 files, today seems to fail at ~800). Puzzling. I'm excluding hardware issues since everything runs as usual outside of OpenCPU. |
OK finally fixed that by removing all calls to |
That's very strange. You're not using rJava, are you? |
Not using rJava. Here is our local packages' deps:
I have Here is a snippet of that call (calling
|
Can you try raising |
Jeroen, yes indeed, had to raise Hum correction, it worked once, then failed consistently. I rebooted the VM, worked again one time and failed thereafter. So I'm reverting to |
Are you using |
Don't think I can use mc.cores = if(mc.cores > 1 && length(y) > 3) mc.cores else 1L
tmp = mclapply(y,
mc.cores=mc.cores, mc.silent=FALSE, mc.allow.recursive=FALSE, mc.cleanup=TRUE,
function(x) {
#rasterOptions(timer=FALSE, chunksize=2e+08, datatype=switch(code, rfe="INT1U", "FLT4S"))
r = stack(as.list(catalog[year(date)==x, file]), quick=TRUE, native=native)
names(r) = catalog[year(date)==x, layer]
res = extract(r, data.grp)
return(res)
}) I'm getting really inconsistent behavior now, even with Also how can I check ocpu's environment variables with curl, just to be sure? |
|
OK thx, so yes I can confirm, with the code above and |
There might be recent changes in |
OK, if you have time (not urgent) it would be helpful if you can create a small example for me to reproduce this. Maybe it has nothing to do with raster, and just a problem with files not getting closed in concurrent processes? |
Yeh thought about that as well, slow I/O with attached SSD drives since we moved from Azure to AWS. Will work on a repex all right. Thx! But wouldn't explain why that fails through OpenCPU only. |
Jeroen is a repex https://gist.github.com/mbacou/41c2fa42fa36cef2a19d9291432b8560 That reproduced the error on my Ubuntu 20.04 VM. Use Let me know. Thx! |
I run into the same error message by doing http://104.131.40.122/ocpu/library/V8/R/ . Then fixed by turning off SELinux, https://linuxize.com/post/how-to-disable-selinux-on-centos-8/
|
FYI on our Ubuntu 20.04 system, SELinux is already disabled. Also tried to disable AppArmor with |
Is there any update on this? We are in the middle of migrating from Ubuntu 16.04 / OpenCPU 2.0.8.1 to Unbuntu 20.04 / OpenCPU 2.2.2. During our testing with the new version, we ran into the same problem frequently with the same code. I tried all the suggestions mentioned in this thread including setting MC_CORES=1, the problem keeps happening. But setting MC_CORES=1 defeats the purpose to use mclapply to begin with. |
I'm sorry I have never been able to discover the problem. It sounds like an issue with the parallel package in R. Could you try adding this line to your assign("setup_strategy", "sequential", parallel:::defaultClusterOptions) And see if that makes any difference? |
Unfortunately, it didn't make any difference. I checked these log files: I don't know this means anything. If I retry the failed call many times (as many as 15 times), the call would eventually succeed. Questions:
Let me know anything I can help to find out more information to get to the bottom of this. |
Chiming in here as we've run into this issue too. With 1 core everything is fine, but when we parallelize across multiple we see Ideally we'd prefer not to limit |
It really shouldn't conflict, but I have not been able to figure out yet where this bug comes from. I suspect it is related to the changes that were made in R 4.0 wrt how R manages the worker subprocesses. There were several bugs in that which would appear under only certain circumstances, some of them were only fixed very recently for R 4.2. I'll try to find time to get this sorted out soon. |
Thanks @jeroen, I can give it a try on devel and report back on if that changes things at all |
Tried using |
|
I still haven't been able to figure out the issue, but I have disabled a feature that tries to kill orphaned processes. Perhaps it is related. I'll push out opencpu 2.2.4.1 with this change, can you try if this changes anything at all? |
@jeroen thank you for that release. I was struggling with this since version 2.1.5, and was able to run it only on our linux boxes, Azure VM's and AKS containers were always crashing on one of our NGS pipeline scripts. So I was thinking it is some problem with containerization engine (as I could see difference in ocpu/info) or R in version 4 was to blame, but this release solved my issues. I was observing that thread for some time and during that time I was trying to get to the bottom of this which exact operation has the "precess has died" problem, to submit a similar issue with full report. I have one more request, please build and push to docker hub an opencpu/rstudio image based on that 2.2.4.1 version. We are using rstudio images in our development machines, for the ease of debugging and development of our packages. |
@pilare ok i'll tag a docker release opencpu/rstudio:2.2.4.1. I'll wait for others to see if this really fixes the problem. |
So far in my testing I have not seen any I don't know if this is useful information, but I've noticed that one of the calls that sometimes produced Edited to note that I've tested this with our codebase now using future/future.apply, not mclapply. Before the new opencpu version we were still getting |
I've pushed a new release opencpu 2.2.5 (which includes this fix) to cran and all docker images. Also overhauled the docker build infrastructure so that it will be easier again to publish new releases if needed. Hopefully the error will disappear for everyone. |
@jeroen we've migrated our OpenCPU config from Ubuntu 16.04 to 20.04 (opencpu-server 2.2.0). Been stable in 16.04 over the past 3+ years, but now we have requests that fail intermittently with the error below, and no additional log message that I can find.
I can make the request break by increasing the number of file reads on the server (we run models that require from 1 up to 6,000 file reads per request), so figured I must be hitting apache or OpenCPU limits (since everything works as expected in an interactive R session). Copied my
./ocpu/info
below as well.Any pointer as to how I should go about debugging that one?
Thanks!
The text was updated successfully, but these errors were encountered: