-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified rc12 and rc14 testing: e3sm_diags
takes too long on Chrysalis
#485
Comments
@chengzhuzhang Is there any reason it would be taking so long to run E3SM Diags? |
@forsyth2 When I tested on |
@chengzhuzhang E3SM Diags runs in the expected amount of time on both Perlmutter and Compy. |
Could this just be because Chrysalis is being hit pretty hard right now and the disk is slow? |
@xylar I suppose that's a possibility. It would be good if it's nothing on our end. Unfortunately though, it is delaying testing since I need to get E3SM Diags results to run the tests on in the first place... |
It looks like the E3SM Diags tasks completed successfully when given between 4 and 5 hours to run. |
Unfortunately, I can reproduce this problem with the complete zppy test. When setting 2 hours time limit for e3sm_diags runs, jobs were cancelled due to time limit. |
Do you have individual timings for different steps in |
I did a set of e3sm_diags runs comparing timing between
I don't see obvious problem, but there is a I will try smaller tests for timing, |
Even use just minimal example:
real 0m15.513s |
@chengzhuzhang, this suggests a pretty big change in behavior in ESMPy. Do you have a way to time just the ESMF regridding? (Is that what you already did?) I don't think going back to ESMF 8.2.0 is a good option because that just isn't built with the dependencies that a lot of our packages rely on. I don't even think going back to 8.3.1 is an option for the same reason. It's odd that this is only happening on Chrysalis. I suppose it could be an incompatibility of something in ESMPy with Chrysalis specifically but I can't think of what that would be. |
@xylar I will try to find a way to time just ESMF regridding. I was also curious why this only happens to Chrysalis, and went ahead testing on Perlmutter, with rc12, the complete e3sm_diags run failed prematurely with:
But testing the same run in e3sm_diags rc3 standalone environment was fine. I also need to confirm with @forsyth2 for the zppy test results on |
My guess would be that there's some sort of bad interaction going on between dask and esmpy to do with multiprocessing, but I don't really have the time (at the moment) or the expertise to help debug it. Is there anyone else on the team we can turn to? |
@chengzhuzhang Thank you for benchmarking the runtime issue.
This is happening in
I don't have any issues with E3SM Diags on |
Thanks for confirming. Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu. |
Yes this is happening in |
Yes, that should be updated now. See points 3,9 on #484 (comment)
That is very strange. I'm not sure what's happening. |
I still ran into permission error. I think to change owner to e3sm recursively for the directory should work. |
Ok I just ran |
somehow I still can't access. I think |
@chengzhuzhang Were you able to access this? I ran that command. |
I am able to get everything working on Chrysalis given sufficient time, so I don't think this is a |
yes, I have access to the data and updated my test results on the |
Closing this issue, in favor of E3SM-Project/e3sm_diags#720 |
Re-open this issue: E3SM-Project/e3sm_diags#720 is resolved with e3sm_diags rc4. However, in my zppy complete test with |
@chengzhuzhang, can you add a bunch of timers to e3sm_diags so we can find out where the slow performance is? I suspect the remapping but it could be dask. We need to narrow it down. |
From Unified RC14 testing of zppy on Chrysalis:
Times range from 47 to 14,688 seconds (= 244.8 minutes = 4.08 hours) |
@xylar yes, I plan to do some timing profiling. Right now i'm running the e3sm_diags bash script taken from zppy but in the e3sm_diags standalone conda environment. If the timing are similar to running in unified rc14. Then we can use the e3sm_diags env for profiling. Otherwise, we may need a spack env on Chrysalis as what you created for Perlmutter to do the profiling. I will let you updated. |
@chengzhuzhang, there is always a spack environment on each machine to correspond with each conda environment in E3SM-Unified. Just take a look at the contents of the load script and you can find the 2 commands for activating the spack environment. |
@xylar thanks for the instruction. I parsed the time stamps from e3sm_diags logs, comparing the time taken for zppy e3sm_diags runs, between using e3sm_diags conada env and unified rc14 env. It is obvious that with unified rc14, the processes runs much slower. Using e3sm_diags conda env, it completed with 60 minutes. I will find the spack command for performance profiling. |
I'm using
|
I'm uploading the cProfile results: with test on
I'm not very experienced in interpreting cProfile results. The full results is uploaded. Maybe @tomvothecoder, @xylar and @forsyth2 can help take a look as well. I'm also making a larger run to see if we can draw more conclusive results from those. |
@chengzhuzhang I think the cProfile results should be sorted by I don't have experience with cProfile (yet) and it seems a bit hard to parse where the lower level library calls are being made from (e.g., ncalls tottime percall cumtime percall filename:lineno(function)
2 22.055 11.027 22.055 11.027 {method 'poll' of 'select.poll' objects}
1292 9.014 0.007 9.014 0.007 {method 'read' of '_io.BufferedReader' objects}
13 6.772 0.521 6.772 0.521 {method 'acquire' of '_thread.lock' objects}
86/83 2.592 0.030 2.603 0.031 {built-in method _imp.create_dynamic} These calls are I/O and thread related. I can see how scaling up the data size might multiply the runtime for these processes. Python Profiling: What does "method 'poll' of 'select.poll' objects"?
|
@tomvothecoder thank you for inspecting the cProfile results! However after more investigation, I tend to believe the problem may not from something internal to e3sm_diags.... I did the same profiling for much larger tasks, i.e. the whole lat-lon set and all seasons. The timing actually show very similar, around 2880 seconds, between using |
@xylar I'm totally not familiar with the building process for SLURM, MPI...But I did some search, not sure if the solution brought this link makes sense. In the mean time, I drained my idea for more testing... |
Sorry if there's some confusion but I'm not using Spack to build MPI itself. I am using the same MPI module that E3SM uses for Gnu compilers on each machine. If there is MPI trouble, I suspect it is something that is built with conda MPI trying to run using HPC MPI libraries. Please track down what package or script is being run with MPI so we can see if it has been built with system MPI (via Slurm) or if it comes from conda forge, in which case it can't be run with MPI on our systems. This is the reason that I have to build all our MPI tools and libraries with Spack. |
Are you seeing the MPI errors only with cProfile or also with normal runs? I could imagine cProfile itself being hard to use properly in this mixed spack and conda environment. I had in mind to just add calls within e3sm_diags to get the current time at 2 spots in the code and take the difference. The old-fashioned and tedious but also very robust method for timing a code. |
@mahf708 encountered the same issue on Chrysilas, copying his stack message here: Actually I used the same solution to get by the problem. Restart a node from fresh would work fine. |
In my testing, the MPI error occurred both with or without cProfile. |
Copying my Slack discussion with Jill Adding the following code at the top of import dask
dask.config.set(scheduler='synchronous') # overwrite default with single-threaded scheduler We can look at the timestamps for the logs to see the runtime. I’m hoping running Dask sequentially mimics the same slowness if it calls the same multiprocessing modules. just on one thread/process at a time. More info on this setting:
|
@chengzhuzhang, in case it's helpful, I opened:
up for group write permission. |
I tried running via |
Follow-up on #485 (comment) and #485 (comment) With 4 workers:
So, 49 seconds to 14303 seconds (=238.4 minutes = 3.97 hours). So, going from "47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)", that's only a saving of ~6.4 minutes on the high end. |
@forsyth2 thank you for the data points.
When running in serial with Dask, I did notice that some lat_lon sets cause longer time, the longest is for comparing with ERA5 OMEGA at 850 mb, it took about 6 mins for each season, so 30 mins for 5 seasons to finish. Among the calculation for each season for this variable, the contour map plotting step can take around 5 mins. I also suspect there is an issue in scheduling. More investigation is needed. |
e3sm_diags
takes too long on Chrysalise3sm_diags
takes too long on Chrysalis
While @tomvothecoder and I are doing more time profiling work and testing, It occurred that running with unified rc14, each processes are almost proportionally longer. Given that there is also slurm/MPI problem intermittently occur, I suspect this is still an environment problem, that some e3sm diags processes is trying to call MPI in the unified env, but the calls are not properly initiated. With |
Yay!!!!! |
Even after setting
walltime = "4:00:00"
for the[[ atm_monthly_180x360_aave ]]
subtask intests/integration/generated/test_complete_run_chrysalis.cfg
, the following jobs still run out of time:This is a change from
walltime = "2:00:00"
in https://github.com/E3SM-Project/zppy/blob/main/tests/integration/generated/test_complete_run_chrysalis.cfg#L93It also took significantly longer to get compute nodes with this extended time limit.
The text was updated successfully, but these errors were encountered: