-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibly running out of memory when using a lot of preprocessors. #1915
Comments
A small update, I decided to be creative, and shortened my data preprocessor to:
and using the data over the 20-50 year chunks I was indeed interested in, instead 'overflown' data for 100+ years in order to subtract anomaly. To subtract anomaly I just added an extra group I was thinking, if one subtracts anomalies with the period far away from the area of interest, may be one could re-evaluate the |
@malininae one quick question before I find more time to look deeper into this - why don't you |
Unfortunately, that wouldn't help because the target grid constructed by the |
Plus, since I am doing extremes the edges are super important for me. |
Following the discussion during the November 2023 monthly meeting, here's the monster recipe which I was talking about. I'm analyzing high-resolution HighResMIP sub-daily data. I have an access to the monster computer with up to 445Gb memory per processes, however I have only 3 hours per process. What I ended up doing, I ran each of the variable groups in a separate recipe, since it takes a 2:26 hours and astonishing 248Gb to run a variable group. You can see an example of a log file for one of the groups here. I tried combining the outputs using the To answer some questions upfront, the wind derivation is done lazily, I double checked when I created the function. The only option I see in easing the wind derivation, create separate variable |
Thanks for sharing, I'll have a look! Could you also upload the shapefile here so I can run the recipe? |
Oops, sorry! I couldn't attach it here, I put it into my google drive and set the permissions so everyone with the link can view it, let me know if you can't access it. |
I would recommend using the Dask Distributed scheduler to run this recipe and set A bit of background to that advice: I ran the variable group SLURM job script #!/bin/bash -l
#SBATCH --partition=interactive
#SBATCH --time=08:00:00
#SBATCH --mem=32G
set -eo pipefail
unset PYTHONPATH
. /work/bd0854/b381141/mambaforge/etc/profile.d/conda.sh
conda activate esmvaltool
esmvaltool run recipe_extremes_wind_3h.yml and the following cluster:
type: dask_jobqueue.SLURMCluster
queue: compute
cores: 128
memory: 256GiB
processes: 32
interface: ib0
local_directory: /scratch/b/b381141/dask-tmp
n_workers: 32
walltime: '8:00:00' Considering that the input data is about 700 GB, that means we are processing about 300 MB / second. I profiled the run with py-spy and it looks like half of the time is spent on data crunching and the other half is spent on loading the iris cubes from file and some other things. Thus, using more workers in the Dask cluster may make it faster, provided that the disk you are loading the data from can go faster. There is probably some room to improve the non-data crunching parts too, but that will require changes in iris. The profiling result is available here, you can load the file into speedcope.app if you're interested. The 'save' function is where the computation on the Dask cluster happens. |
Hi all,
In the July 2022 Meeting, I raised an issue, that I can't process the data with a lot of preprocessors (see ESMValGroup/Community#33 (comment)). When I tried it back there, it all seemed to work just fine, however, I finally ran into a problem.
I was running this recipe (it looks ugly, since it's still being developed, so please no judgement), and it works OK for the groups 'all', 'obs_abs' and 'obs_ano', but it can't process 'nat' and 'ssp245'. I thought, OK, I split the recipe and process separately 'nat' and 'ssp245' in their own recipes and recombine them for the diagnostic, but no, I seem to not be able to process them either. The computers I am working on allow only 6h jobs, but there's not much limitations for the number of the checked out processors. For this job, I checked out 27 cpus and allocated 180Gb of memory, I don't allow more than one process per cpu. (I also tried 20, and it didn't work.) My hunch is, that the more fine resolution models are not processed and just keep hanging.
A small note, I think the issue here is my preprocessor
rolling_window_statistics
, because, I think, theiris
function it uses realizes data. I processed everything quite well withoutrolling_window_statistics
on 20 cpus.I'm not sure what's the best way of handling that here. For now, I will process the data as I did it before, but someone might be interested in it.
main_log_debug_tx3x.txt
Here are the computers specifications if that matters:
The text was updated successfully, but these errors were encountered: