-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job status Z for some parallel jobs #191
Comments
Note I run the master process on the same node as the workers, as my cluster kills anything running longer at the login node |
Hi, |
I will give it a try |
Ok I have been able to reproduce it with a self contained example but still using drake (I will try a bit further with only clustermq):
#!/sw/arch/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/bin/Rscript
#SBATCH --job-name=test # job name
#SBATCH --partition=normal # partition
#SBATCH --output=.test.o # you can add .%a for array index
#SBATCH --error=.test.e # log file
#SBATCH --mem=32136
#SBATCH -n 16
#SBATCH --time=00:19:00
options(clustermq.scheduler = "multicore")
require(clustermq)
f<-list.files('data', full.names=T)
require(stars)
fx = function(x) st_as_stars(st_apply(read_stars(x, proxy=T)[2,],2,mean))
require(drake)
p<-drake_plan(
files=list.files('data', full.names=T),
ttt= target(fx(files), dynamic=map(files)))
make(p,
parallelism='clustermq', jobs=16, caching='worker') Note the data directory contains large netcdf files (about 80) |
Ok I now also have an example without drake:
The R script #!/sw/arch/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/bin/Rscript
#SBATCH --job-name=test # job name
#SBATCH --partition=normal # partition
#SBATCH --output=.test.o # you can add .%a for array index
#SBATCH --error=.test.e # log file
#SBATCH --mem=32136
#SBATCH -n 16
#SBATCH --time=00:19:00
options(clustermq.scheduler = "multicore")
require(clustermq)
f<-list.files('data', full.names=T)
require(stars)
fx = function(x) st_as_stars(st_apply(read_stars(x, proxy=T)[2,],2,mean))
a<-Q(fx, x=f, n_jobs=16) The job also does not finish properly (all processes are either S (sleep) or in Z state) I guess this is because some jobs do not finish:
|
Ok, great! Now we know this is a potential clustermq issue and not related to drake. Or, it could be that your function call crashes R for some reason. So let's test that. I would now check the following:
(Unfortunately, the logging on multicore is a bit underdeveloped. I still need to fix that.) |
I have tested with a<-Q(fx, x=f, n_jobs=16 , log_worker=TRUE, template=list(log_file='clustermq.log')) However no log file is produced, is there a specific way to enable logging for multicore? |
You'd need to run this as jobs (via slurm), multicore workers don't support proper logging (for now). |
I have been trying a bit further I ran it in two different ways first I tried to run 16 workers on a node. using the following script and template (rename to ensure I can attach it): This job does not finish properly and the worker node does not shutdown before I cancel it (last line in log), the worker node produces the following log The master node has the following log:
I'm not sure if this is the same problem or a different one. Single worker on a nodeThis works properly but is rather in efficient: output of worker node successful output of master node Loading required package: clustermq
Loading required package: stars
Loading required package: abind
Loading required package: sf
Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 5.0.0
Submitting 16 worker jobs (ID: 7629) ...
Warning in private$fill_options(...) :
Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication
Running 84 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [328.8s 0.1% CPU]; Worker: [avg 99.3% CPU, max 3248.9 Mb]
c(16142.1510370588, 16173.7716961821, 16205.5004053062, 16237.1328383444, 16269.9509591846, 16303.2011487977, 16337.6605700659, 16373.347024536, 16407.6724520324, 16442.6700693559, 16475.7976513871, 16506.0830752003, 16536.2453834126, 16566.3473387541, 16595.3064131182, 16624.8984663854, 16655.0043924283, 16684.3316444557, 16713.5975583934, 16742.8486463249, 16772.5408509392, 16802.3021203059, 16833.0455248498, 16864.0435662742, 16895.6030165495, 16926.9370166867, 16958.0463232466, 16989.6115128455,
17021.1157997998, 17053.6974611012, 17085.5259974742, 17117.5674545038, 17148.8087734615, 17179.8319295221, 17211.0570863137, 17242.0665218813, 17273.3808514385, 17303.5514877842, 17334.3936776904, 17364.7727102406, 17395.3072871116, 17424.8993085654, 17453.9778966651, 17482.6439314819, 17510.340652888, 17537.9195804667, 17564.0312136117, 17590.2533603023, 17614.9966315126, 17639.859933425, 17663.3394946399, 17686.1774068251, 17708.4757907717, 17729.4796733955, 17750.8122056193, 17769.9326833986,
17788.9847515813, 17805.762140778, 17822.4755245362, 17837.6002887535, 17852.4804919686, 17866.7851890507, 17879.6646449455, 17893.8523693818, 17905.2133034273, 17919.0947122158, 17928.7579626699, 17943.2143268231, 17955.9000604121, 17960.520934751, 17970.2885856203, 17969.3942641831, 17971.6104533665, 17974.8370766788, 17975.676868079, 17981.3611348064, 17980.806931675, 17982.1201302886, 17982.1648160881, 17981.0783093266, 17978.9898745255) Session info> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /sara/eb/AVX2/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/lib64/R/lib/libR.so
LAPACK: /sara/eb/AVX2/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/lib64/R/modules/lapack.so
locale:
[1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
[4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
[7] LC_PAPER=en_US LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] clustermq_0.8.9
loaded via a namespace (and not attached):
[1] compiler_3.5.1
|
Sorry for the late reply. I'm a bit confused about what might be going on here. Can you split each worker log in a separate file? |
I have been working a bit further and the issue does not seem to occur if i allocate a lot more memory (although I do not have a solid replication yet). It might thus be that the processes get stopped by some controlling software when memory gets tight. |
Did you find out any more about the cause of this? |
Ok, looking at your submission script again: #S BATCH --array=1-{{ n_jobs }} # job array
#S BATCH --time=00:29:00 # default lisa is 5 minutes
R -e 'sessionInfo()'
echo 'asdf'
ulimit -v $(( 1024 * {{ memory | 11096 }} ))
echo $SLURM_NTASKS
for i in {1..22}
do
echo $i
(CMQ_AUTH={{ auth }} R --quiet --no-save --no-restore -e 'clustermq:::worker("{{ master }}")' ) &
done This will not work, because I will unfortunately only be able to help you further if this is fixed and we can get the slurm logs, and it would be great if you could provide an example I can reproduce without relying on external data. |
According to #179 this is fixed in |
Thank you for resolving this, I will let you know if it occurs again |
I guess that this is more an question than a real bug, where I'm not sure how to proceeds to debug or solve. I'm using
drake
withclustermq
on a cluster. I reserve a node with 16 cores and plenty memory. Some of my jobs require quite a bit of data to run and I feel there might be a slight bottle neck there.When I retrieve the htop output from the node I get this result:
I notice that 8 of the 16 parallel processes get the status Z ( Defunct ("zombie") process, terminated but not reaped by its parent.). This number slowly grows over time. This means I am not very efficiently using the 16 cores available.
My submission script looks as follows (slightly simplified):
What are the best ways to debug this or are there settings that can prevent this from happening.?
The text was updated successfully, but these errors were encountered: