Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified rc12 and rc14 testing: e3sm_diags takes too long on Chrysalis #485

Closed
forsyth2 opened this issue Aug 18, 2023 · 46 comments · Fixed by #497
Closed

Unified rc12 and rc14 testing: e3sm_diags takes too long on Chrysalis #485

forsyth2 opened this issue Aug 18, 2023 · 46 comments · Fixed by #497

Comments

@forsyth2
Copy link
Collaborator

forsyth2 commented Aug 18, 2023

Even after setting walltime = "4:00:00" for the [[ atm_monthly_180x360_aave ]] subtask in tests/integration/generated/test_complete_run_chrysalis.cfg, the following jobs still run out of time:

e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853

This is a change from walltime = "2:00:00" in https://github.com/E3SM-Project/zppy/blob/main/tests/integration/generated/test_complete_run_chrysalis.cfg#L93

It also took significantly longer to get compute nodes with this extended time limit.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang Is there any reason it would be taking so long to run E3SM Diags?

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 18, 2023

@forsyth2 When I tested on perlmutter and compy I don't see long time to run. compy is about 10% longer then perlmutter. I didn't tested on chrysalis. how about your test on the other two machines? Are they also take significantly longer time?

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang E3SM Diags runs in the expected amount of time on both Perlmutter and Compy.

@xylar
Copy link
Contributor

xylar commented Aug 18, 2023

Could this just be because Chrysalis is being hit pretty hard right now and the disk is slow?

@forsyth2
Copy link
Collaborator Author

@xylar I suppose that's a possibility. It would be good if it's nothing on our end. Unfortunately though, it is delaying testing since I need to get E3SM Diags results to run the tests on in the first place...

@forsyth2
Copy link
Collaborator Author

It looks like the E3SM Diags tasks completed successfully when given between 4 and 5 hours to run.

@chengzhuzhang
Copy link
Collaborator

Unfortunately, I can reproduce this problem with the complete zppy test. When setting 2 hours time limit for e3sm_diags runs, jobs were cancelled due to time limit.

@xylar
Copy link
Contributor

xylar commented Aug 20, 2023

Do you have individual timings for different steps in e3sm_diags? If so, can you compare a run with rc9 or rc10 to rc12 to see what changed?

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 20, 2023

I did a set of e3sm_diags runs comparing timing between rc9 and rc12.
rc9:
real 7m33.804s
user 45m46.510s
sys 1m23.407s

rc12:
real 19m48.713s
user 96m8.258s
sys 10m17.319s

I don't see obvious problem, but there is a [WARNING] yaksa: 10 leaked handle pool objects shown in the rc12 run, which is also reported here xCDAT/xcdat#513. which is thought to come from esmf>=8.4

I will try smaller tests for timing,

@chengzhuzhang
Copy link
Collaborator

Even use just minimal example:

time e3sm_diags lat_lon --no_viewer --reference_data_path '//lcrc/group/e3sm/public_html/diagnostics/observations/Atm/climatology' --test_data_path '/lcrc/group/e3sm/public_html/e3sm_diags_test_data/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/climatology/rgr' --results_dir '/lcrc/group/e3sm/public_html/diagnostic_output/ac.zhang40/tests/eu190rc9' --case_id 'GPCP_v3.2' --run_type 'model_vs_obs' --sets 'lat_lon' --variables 'PRECT' --seasons 'ANN' --ref_name 'GPCP_v3.2'

rc9:
real 0m2.348s
user 0m1.775s
sys 0m0.485s

rc12:
[WARNING] yaksa: 10 leaked handle pool objects

real 0m15.513s
user 0m5.678s
sys 0m14.101s

@xylar
Copy link
Contributor

xylar commented Aug 21, 2023

@chengzhuzhang, this suggests a pretty big change in behavior in ESMPy. Do you have a way to time just the ESMF regridding? (Is that what you already did?)

I don't think going back to ESMF 8.2.0 is a good option because that just isn't built with the dependencies that a lot of our packages rely on. I don't even think going back to 8.3.1 is an option for the same reason.

It's odd that this is only happening on Chrysalis. I suppose it could be an incompatibility of something in ESMPy with Chrysalis specifically but I can't think of what that would be.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 21, 2023

@xylar I will try to find a way to time just ESMF regridding.

I was also curious why this only happens to Chrysalis, and went ahead testing on Perlmutter, with rc12, the complete e3sm_diags run failed prematurely with:

2023-08-20 15:37:48,537 [ERROR]: run.py(run_diags:36) >> Error traceback:
Traceback (most recent call last):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/run.py", line 34, in run_diags
    main(final_params)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 419, in main
    parameters_results = _run_with_dask(parameters)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 365, in _run_with_dask
    results = bag.map(run_diag).compute(num_workers=num_workers)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 310, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 595, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/multiprocessing.py", line 233, in get
    result = get_async(
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/local.py", line 500, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2023-08-20 15:37:48,592 [INFO]: logger.py(move_log_to_prov_dir:106) >> Log file saved in /global/cfs/cdirs/e3sm/www/chengzhu//v2_9_0_all_sets/eu190rc12/prov/e3sm_diags_run.log

real	6m39.466s
user	72m18.622s
sys	13m8.988s

But testing the same run in e3sm_diags rc3 standalone environment was fine.

I also need to confirm with @forsyth2 for the zppy test results on pm-cpu.

@xylar
Copy link
Contributor

xylar commented Aug 21, 2023

My guess would be that there's some sort of bad interaction going on between dask and esmpy to do with multiprocessing, but I don't really have the time (at the moment) or the expertise to help debug it. Is there anyone else on the team we can turn to?

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 21, 2023

@chengzhuzhang Thank you for benchmarking the runtime issue.

concurrent.futures.process.BrokenProcessPool

This is happening in rc12 on Perlmutter? I thought this was a rc10 issue (#475 (comment))

I also need to confirm with @forsyth2 for the zppy test results on pm-cpu.

I don't have any issues with E3SM Diags on pm-cpu. The only issue I have on Perlmutter at the moment is #484, which concerns the global time series.

@chengzhuzhang
Copy link
Collaborator

I also need to confirm with @forsyth2 for the zppy test results on pm-cpu.

I don't have any issues with E3SM Diags on pm-cpu. The only issue I have on Perlmutter at the moment is #484, which concerns the global time series.

Thanks for confirming. Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.

@chengzhuzhang
Copy link
Collaborator

@chengzhuzhang Thank you for benchmarking the runtime issue.

concurrent.futures.process.BrokenProcessPool

This is happening in rc12 on Perlmutter? I thought this was a rc10 issue (#475 (comment))

Yes this is happening in rc12 on Perlmutter with my e3sm_diags testing.

@forsyth2
Copy link
Collaborator Author

Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.

Yes, that should be updated now. See points 3,9 on #484 (comment)

Yes this is happening in rc12 on Perlmutter with my e3sm_diags testing.

That is very strange. I'm not sure what's happening.

@chengzhuzhang
Copy link
Collaborator

Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.

Yes, that should be updated now. See points 3,9 on #484 (comment)

I still ran into permission error. I think to change owner to e3sm recursively for the directory should work.

@forsyth2
Copy link
Collaborator Author

Ok I just ran chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/.

@chengzhuzhang
Copy link
Collaborator

Ok I just ran chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/.

somehow I still can't access. I think chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth/E3SMv2 should do?

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang Were you able to access this? I ran that command.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 22, 2023

I am able to get everything working on Chrysalis given sufficient time, so I don't think this is a zppy issue. And E3SM Diags does in fact still appear to work. It just takes longer.

@chengzhuzhang
Copy link
Collaborator

@chengzhuzhang Were you able to access this? I ran that command.

yes, I have access to the data and updated my test results on the pm-cpu issue.

@forsyth2
Copy link
Collaborator Author

Closing this issue, in favor of E3SM-Project/e3sm_diags#720

@github-project-automation github-project-automation bot moved this from Todo to Done in forsyth2 current tasks Aug 22, 2023
@chengzhuzhang chengzhuzhang reopened this Aug 29, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in forsyth2 current tasks Aug 29, 2023
@chengzhuzhang
Copy link
Collaborator

Re-open this issue: E3SM-Project/e3sm_diags#720 is resolved with e3sm_diags rc4. However, in my zppy complete test with unfied rc14. The e3sm_diags run still exceed the 2 hour limit.

@xylar
Copy link
Contributor

xylar commented Aug 30, 2023

@chengzhuzhang, can you add a bunch of timers to e3sm_diags so we can find out where the slow performance is? I suspect the remapping but it could be dask. We need to narrow it down.

@forsyth2
Copy link
Collaborator Author

From Unified RC14 testing of zppy on Chrysalis:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o379989 <==
==============================================
Elapsed time: 47 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o379991 <==
==============================================
Elapsed time: 48 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o379990 <==
==============================================
Elapsed time: 47 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o379986 <==
==============================================
Elapsed time: 14658 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o379988 <==
==============================================
Elapsed time: 14688 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o379987 <==
==============================================
Elapsed time: 14662 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o379993 <==
==============================================
Elapsed time: 4130 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o379992 <==
==============================================
Elapsed time: 50 seconds
==============================================

Times range from 47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)

@chengzhuzhang
Copy link
Collaborator

@xylar yes, I plan to do some timing profiling. Right now i'm running the e3sm_diags bash script taken from zppy but in the e3sm_diags standalone conda environment. If the timing are similar to running in unified rc14. Then we can use the e3sm_diags env for profiling. Otherwise, we may need a spack env on Chrysalis as what you created for Perlmutter to do the profiling. I will let you updated.

@xylar
Copy link
Contributor

xylar commented Aug 30, 2023

@chengzhuzhang, there is always a spack environment on each machine to correspond with each conda environment in E3SM-Unified. Just take a look at the contents of the load script and you can find the 2 commands for activating the spack environment.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 30, 2023

@xylar thanks for the instruction. I parsed the time stamps from e3sm_diags logs, comparing the time taken for zppy e3sm_diags runs, between using e3sm_diags conada env and unified rc14 env. It is obvious that with unified rc14, the processes runs much slower. Using e3sm_diags conda env, it completed with 60 minutes.

I will find the spack command for performance profiling.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 30, 2023

I'm using cProfile to look at smaller runs on Chrysalias compute node. Running in unified rc14 does see longer timing for many type processes (e.x. saving provenance). Also I ran to intermittent MPI errors when doing profiling tests. I think it is worth to note as well.

(e3sm_unified_1.9.0rc14_chrysalis) [ac.zhang40@chr-0485 e3sm_diags_parameter]$ python -m cProfile run_lat_lon.py -d lat_lon_ANN.cfg > timing_eurc14_ANN_2
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 14
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[chr-0485:1617174] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@chengzhuzhang
Copy link
Collaborator

I'm uploading the cProfile results: with test on e3sm_diags conda env and unified 14 env.
e3sm_diags conda env cProfile results sorted by cumulative time:

         11900861 function calls (11594276 primitive calls) in 38.465 seconds
  
   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      153    0.005    0.000   43.071    0.282 __init__.py:1(<module>)
   1366/1    0.014    0.000   38.469   38.469 {built-in method builtins.exec}
        1    0.000    0.000   38.469   38.469 run_lat_lon.py:1(<module>)
        1    0.000    0.000   26.105   26.105 run.py:23(run_diags)
        1    0.000    0.000   26.096   26.096 e3sm_diags_driver.py:400(main)
  1378/14    0.006    0.000   13.753    0.982 <frozen importlib._bootstrap>:1022(_find_and_load)
  1351/14    0.005    0.000   13.752    0.982 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
  1285/15    0.005    0.000   13.742    0.916 <frozen importlib._bootstrap>:664(_load_unlocked)
  1182/15    0.003    0.000   13.741    0.916 <frozen importlib._bootstrap_external>:877(exec_module)
  1696/14    0.001    0.000   13.701    0.979 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
   744/33    0.001    0.000   13.451    0.408 {built-in method builtins.__import__}
 1781/524    0.003    0.000   12.659    0.024 <frozen importlib._bootstrap>:1053(_handle_fromlist)
        1    0.000    0.000   10.094   10.094 e3sm_diags_driver.py:137(save_provenance)
        1    0.000    0.000   10.089   10.089 e3sm_diags_driver.py:44(_save_env_yml)
        1    0.000    0.000   10.039   10.039 subprocess.py:1110(communicate)
        1    0.000    0.000   10.039   10.039 subprocess.py:1968(_communicate)
        2    0.000    0.000   10.038    5.019 selectors.py:403(select)
        2   10.038    5.019   10.038    5.019 {method 'poll' of 'select.poll' objects}
        1    0.000    0.000    9.620    9.620 main.py:114(create_viewer)
        1    0.000    0.000    9.609    9.609 default_viewer.py:44(create_viewer)
        1    0.001    0.001    8.708    8.708 lat_lon_viewer.py:372(generate_lat_lon_cmip6_comparison)

unified rc14 cProfile results sorted by cumulative time:

         12416679 function calls (12058492 primitive calls) in 53.285 seconds
  
   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1339/1    0.013    0.000   53.289   53.289 {built-in method builtins.exec}
        1    0.000    0.000   53.289   53.289 run_lat_lon.py:1(<module>)
      152    0.004    0.000   51.361    0.338 __init__.py:1(<module>)
        1    0.000    0.000   38.861   38.861 run.py:23(run_diags)
        1    0.000    0.000   38.851   38.851 e3sm_diags_driver.py:400(main)
        1    0.000    0.000   22.114   22.114 e3sm_diags_driver.py:137(save_provenance)
        1    0.000    0.000   22.091   22.091 e3sm_diags_driver.py:44(_save_env_yml)
        1    0.000    0.000   22.055   22.055 subprocess.py:1110(communicate)
        1    0.000    0.000   22.055   22.055 subprocess.py:1968(_communicate)
        2    0.000    0.000   22.055   11.027 selectors.py:403(select)
        2   22.055   11.027   22.055   11.027 {method 'poll' of 'select.poll' objects}
  1366/14    0.006    0.000   15.746    1.125 <frozen importlib._bootstrap>:1022(_find_and_load)
  1338/14    0.005    0.000   15.746    1.125 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
  1275/15    0.005    0.000   15.737    1.049 <frozen importlib._bootstrap>:664(_load_unlocked)
  1172/15    0.003    0.000   15.736    1.049 <frozen importlib._bootstrap_external>:877(exec_module)
  1691/14    0.001    0.000   15.690    1.121 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
   747/33    0.001    0.000   15.438    0.468 {built-in method builtins.__import__}
 1730/522    0.003    0.000   14.639    0.028 <frozen importlib._bootstrap>:1053(_handle_fromlist)
        2    0.000    0.000   10.936    5.468 dataset.py:1(<module>)
     1160    0.010    0.000    9.832    0.008 <frozen importlib._bootstrap_external>:950(get_code)
        1    0.000    0.000    9.784    9.784 main.py:114(create_viewer)
        1    0.000    0.000    9.773    9.773 default_viewer.py:44(create_viewer)

I'm not very experienced in interpreting cProfile results. The full results is uploaded. Maybe @tomvothecoder, @xylar and @forsyth2 can help take a look as well. I'm also making a larger run to see if we can draw more conclusive results from those.
timing_edvrc4_ANN_3.txt
timing_eurc14_ANN_3.txt

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Aug 30, 2023

@chengzhuzhang I think the cProfile results should be sorted by tottime (total time) in descending order. UPDATE: Sorting by cumulative time helps too for knowing the order of the callstack.

I don't have experience with cProfile (yet) and it seems a bit hard to parse where the lower level library calls are being made from (e.g., select.poll). A few log lines that stand-out in the unified rc14 log include:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   22.055   11.027   22.055   11.027 {method 'poll' of 'select.poll' objects}
     1292    9.014    0.007    9.014    0.007 {method 'read' of '_io.BufferedReader' objects}
       13    6.772    0.521    6.772    0.521 {method 'acquire' of '_thread.lock' objects}
    86/83    2.592    0.030    2.603    0.031 {built-in method _imp.create_dynamic}

These calls are I/O and thread related. I can see how scaling up the data size might multiply the runtime for these processes.

Python Profiling: What does "method 'poll' of 'select.poll' objects"?

  • Methods that execute inside a different process (for example with a ProcessPoolExecutor) are not captured by cProfile. So the select.poll is just showing your main process waiting for results from the other processes.

  • select — Waiting for I/O completion
    • This module provides access to the select() and poll() functions available in most operating systems, devpoll() available on Solaris and derivatives, epoll() available on Linux 2.5+ and kqueue() available on most BSD. Note that on Windows, it only works for sockets; on other operating systems, it also works for other file types (in particular, on Unix, it works on pipes). It cannot be used on regular files to determine whether a file has grown since it was last read.
    • select.poll - (Not supported by all operating systems.) Returns a polling object, which supports registering and unregistering file descriptors, and then polling them for I/O events; see section Polling Objects below for the methods supported by polling objects.
  • lock.acquire(blocking=True, timeout=- 1)
    • Without any optional argument, this method acquires the lock unconditionally, if necessary waiting until it is released by another thread (only one thread at a time can acquire a lock — that’s their reason for existence).

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 31, 2023

@tomvothecoder thank you for inspecting the cProfile results!

However after more investigation, I tend to believe the problem may not from something internal to e3sm_diags.... I did the same profiling for much larger tasks, i.e. the whole lat-lon set and all seasons. The timing actually show very similar, around 2880 seconds, between using e3sm_diags conda env and unified rc14. Note that, I only tested through running interactively on compute node. I later tried to timing through zppy e3sm_diags sets one at a time through submitting slurm jobs, the slowdown only now occur, i.e. the lat-lon set won't finish in 2 hours. Together with the MPI error reported here, both led me to think the slowdown should relevant to building Slurm and MPI...

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 31, 2023

@xylar I'm totally not familiar with the building process for SLURM, MPI...But I did some search, not sure if the solution brought this link makes sense.
https://lists.schedmd.com/pipermail/slurm-users/2019-March/003018.html

In the mean time, I drained my idea for more testing...

@xylar
Copy link
Contributor

xylar commented Aug 31, 2023

Sorry if there's some confusion but I'm not using Spack to build MPI itself. I am using the same MPI module that E3SM uses for Gnu compilers on each machine. If there is MPI trouble, I suspect it is something that is built with conda MPI trying to run using HPC MPI libraries.

Please track down what package or script is being run with MPI so we can see if it has been built with system MPI (via Slurm) or if it comes from conda forge, in which case it can't be run with MPI on our systems. This is the reason that I have to build all our MPI tools and libraries with Spack.

@xylar
Copy link
Contributor

xylar commented Aug 31, 2023

Are you seeing the MPI errors only with cProfile or also with normal runs? I could imagine cProfile itself being hard to use properly in this mixed spack and conda environment.

I had in mind to just add calls within e3sm_diags to get the current time at 2 spots in the code and take the difference. The old-fashioned and tedious but also very robust method for timing a code.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 31, 2023

@mahf708 encountered the same issue on Chrysilas, copying his stack message here:
"I face a specific issue on a chrysalis compute node with e3sm_diags within e3sm-unified (lates RC) above:Triggering some e3sm_diags like below returns an elaborate error. Not sure what it means really. I resolve it with simply quitting the node and restarting over. This suggests that some python processes are holding on to the locks or something?"

Actually I used the same solution to get by the problem. Restart a node from fresh would work fine.

@chengzhuzhang
Copy link
Collaborator

In my testing, the MPI error occurred both with or without cProfile.
Guess I can look into the old fashioned timing method.. Right now the e3sm diags run log contains time stamps, but with multiprocessing, it is not possible to isolate time spent for each processes.

@tomvothecoder
Copy link
Collaborator

In my testing, the MPI error occurred both with or without cProfile. Guess I can look into the old fashioned timing method.. Right now the e3sm diags run log contains time stamps, but with multiprocessing, it is not possible to isolate time spent for each processes.

Copying my Slack discussion with Jill

Adding the following code at the top of e3sm_diags_driver.py should make Dask run sequentially for debugging and profiling.

import dask
dask.config.set(scheduler='synchronous')  # overwrite default with single-threaded scheduler

We can look at the timestamps for the logs to see the runtime. I’m hoping running Dask sequentially mimics the same slowness if it calls the same multiprocessing modules. just on one thread/process at a time.

More info on this setting:

The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.

For example, when using IPython or Jupyter notebooks, the %debug, %pdb, or %prun magics will not work well when using the parallel Dask schedulers (they were not designed to be used in a parallel computing context). However, if you run into an exception and want to step into the debugger, you may wish to rerun your computation under the single-threaded scheduler where these tools will function properly.

https://docs.dask.org/en/stable/scheduling.html#single-thread

@xylar
Copy link
Contributor

xylar commented Sep 1, 2023

@chengzhuzhang, in case it's helpful, I opened:

/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc14_chrysalis/lib/python3.10/site-packages/e3sm_diags

up for group write permission.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 1, 2023

I tried running via zppy with num_workers=4 (rather than default of 24), as @golaz suggested (reasoning that the overhead of the workers may be causing an issue). Unfortunately, it looks like it's still going beyond the original 2-hour limit.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 5, 2023

Follow-up on #485 (comment) and #485 (comment)

With 4 workers:

tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o381747 <==
==============================================
Elapsed time: 49 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o381749 <==
==============================================
Elapsed time: 49 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o381748 <==
==============================================
Elapsed time: 49 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o381744 <==
==============================================
Elapsed time: 14235 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o381746 <==
==============================================
Elapsed time: 14287 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o381745 <==
==============================================
Elapsed time: 14303 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o381751 <==
==============================================
Elapsed time: 4186 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o381750 <==
==============================================
Elapsed time: 49 seconds
==============================================

So, 49 seconds to 14303 seconds (=238.4 minutes = 3.97 hours). So, going from "47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)", that's only a saving of ~6.4 minutes on the high end.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 thank you for the data points.
I'm also update some of my initial investigation here. In my case, I only tested "lat_lon" because it can show the slow down alone. I use sbatch to submit the jobs to slurm.

<style> </style>
  e3sm_diags rc4 env e3sm unified rc14 env
scheduler='synchronous' for Dask single-threaded scheduler 166mins 166mins
multiprocessing for Dask with 24 workers 38 mins 126 mins

When running in serial with Dask, I did notice that some lat_lon sets cause longer time, the longest is for comparing with ERA5 OMEGA at 850 mb, it took about 6 mins for each season, so 30 mins for 5 seasons to finish. Among the calculation for each season for this variable, the contour map plotting step can take around 5 mins.

I also suspect there is an issue in scheduling. More investigation is needed.

@chengzhuzhang chengzhuzhang changed the title Unified rc12 testing: e3sm_diags takes too long on Chrysalis Unified rc12 and rc14 testing: e3sm_diags takes too long on Chrysalis Sep 6, 2023
@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Sep 6, 2023

While @tomvothecoder and I are doing more time profiling work and testing, It occurred that running with unified rc14, each processes are almost proportionally longer. Given that there is also slurm/MPI problem intermittently occur, I suspect this is still an environment problem, that some e3sm diags processes is trying to call MPI in the unified env, but the calls are not properly initiated. With srun -n 1 , it can start the parallel program and setup the parallel environment correctly. The fixe is added to zppy #497 and should remind users who run e3sm_diags in unified to prefix srun to get rid of the PMI2_Init error reported by @mahf708 and I earlier.

@xylar
Copy link
Contributor

xylar commented Sep 6, 2023

Yay!!!!!

@github-project-automation github-project-automation bot moved this from In Progress to Done in forsyth2 current tasks Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants