Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Step attributes from cores to ntasks and cpus_per_task #413

Merged
merged 25 commits into from
Aug 16, 2022

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Jun 17, 2022

This merge is a vital step in @altheaden's work on parallel tasks. Up to now, we have been able to get by in compass without differentiating between "tasks" and "CPUs per task". However, there is an important distinction between these concepts in Slurm and other job managers. For task parallelism to function properly, we need to specify both --cpus_per_task (-c for short) and --ntasks (-n for short) to each Slurm job step we launch with srun.

In anticipation of this need, this merge replaces the cores attribute of the Step class with ntasks and cpus_per_task, and the min_cores attribute with min_tasks and min_cpus_per_task.

Nearly ever test case is affected (thus, 96 files are changed).

I made a separate commit for each land-ice test group but ran out of steam for the ocean test groups and switched (based on the experience gained from landice) to bulk search-and-replace in ocean test groups.

@pep8speaks
Copy link

pep8speaks commented Jun 17, 2022

Hello @xylar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-07-29 09:41:03 UTC

@xylar xylar force-pushed the switch_steps_to_slurm_terminology branch from 0ee98a2 to a6fe7b3 Compare June 17, 2022 18:36
Comment on lines +311 to +313
raise ValueError('Unsupported step configuration in which '
'ntasks > 1 and '
'cpus_per_task != min_cpus_per_task')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve general sanity, I'm enforcing a rule that cpus_per_task == min_cpus_per_task if ntasks > 1. If you're doing MPI jobs, you will always know that cpus_per_task = min_cpus_per_task = openmp_thread, since you're using cpus_per_task for OpenMP threading. (This will be enforced in a future update.) If you are running a multithreaded or multiprocessing job on a single node, you will always have ntasks = 1 and the number of threads or processes will be controlled by cpus_per_task and min_cpus_per_task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should cpus_per_task = min_cpus_per_task = openmp_thread be cpus_per_task = min_cpus_per_task * openmp_thread?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if you're running with threads, the number of OpenMP threads is the same thing as the number of CPUs per MPI task. This is just by definition.

In addition to that, what I'm also saying here is that things get too confusing if cpus_per_task is different from min_cpus_per_task. That is, it gets very confusing if a developer allows for some number of MPI tasks in a range between min_tasks and ntasks, and at the same time they also allow for a range of threads between min_cpus_per_task and cpus_per_task. I don't want to deal with that complicated (maybe even impossible) optimization problem. So I suggest we also require that the number of OpenMP threads is always fixed for tasks with multiple MPI tasks, so that cpus_per_task = openmp_threads and min_cpus_per_task = cpus_per_task.

Does that make things any clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, thanks.

@xylar
Copy link
Collaborator Author

xylar commented Jun 17, 2022

Testing

I have run the ocean pr and land-ice full_integration test suites on Anvil with Intel/Intel-MPI and Gnu/MVAPICH, respectively. All results are bit-for-bit and runtimes are similar. Results are here:

/lcrc/group/e3sm/ac.xylar/compass_1.1/anvil/test_20220617/pr_slurm_terms4

and here:

/lcrc/group/e3sm/ac.xylar/compass_1.1/anvil/test_20220617/integration_slurm_terms

I also set up (but did not try to run) all the tests in compass, to make sure at least through the setup stage there were no references to missing core, min_core or thread attributes.

@xylar
Copy link
Collaborator Author

xylar commented Jun 17, 2022

I'm still working on updating the documentation.

@xylar xylar force-pushed the switch_steps_to_slurm_terminology branch 2 times, most recently from 7c214f1 to a892190 Compare June 17, 2022 20:45
@xylar xylar requested a review from trhille June 17, 2022 20:50
@xylar
Copy link
Collaborator Author

xylar commented Jun 17, 2022

@altheaden, it would be great if you could run a few tests, maybe even the pr test suite on Chrysalis, since I ran on Anvil.

@matthewhoffman and @trhille, if you are willing to do some testing on your favorite machines to make sure I haven't broken tests on the landice side. I only tests those test cases that are in the full_integration test suite so if you want to run some tests outside of that suite, that might be most important.

@mark-petersen, similarly on the ocean side. I ran the pr test suite but I can't feasibly run all of the ocean tests. If you want to test a few that are outside the pr suite, that might be the most helpful.

Obviously, I'm also keen for feedback from all of you.

@xylar
Copy link
Collaborator Author

xylar commented Jun 17, 2022

I reran tests after finding a few more cores-->ntasks that I missed. Tests still pass and I updated the output directories above with the latest. I'm signing off. Have a nice weekend!

@altheaden
Copy link
Collaborator

@xylar In the cosine bell QU90 test, we're seeing that it's using 128 threads when it should be just using one. It looks like the --exclusive flag isn't working as expected.

@xylar
Copy link
Collaborator Author

xylar commented Jun 21, 2022

@mark-petersen and @matthewhoffman. please hold off on reviewing until @altheaden and I figure out the issue above.

@xylar xylar added the in progress This PR is not ready for review or merging label Jun 21, 2022
@xylar xylar marked this pull request as draft June 21, 2022 17:25
@xylar xylar force-pushed the switch_steps_to_slurm_terminology branch from 517d32b to 267e8d5 Compare June 21, 2022 18:42
@xylar xylar removed the in progress This PR is not ready for review or merging label Jun 21, 2022
@xylar xylar marked this pull request as ready for review June 21, 2022 19:14
@xylar
Copy link
Collaborator Author

xylar commented Jun 21, 2022

Okay, issue figured out. I had forgotten to actually use the environment variable (OMP_NUM_THREADS) that I had set. That leads to some mighty strange behavior!

@xylar xylar force-pushed the switch_steps_to_slurm_terminology branch 3 times, most recently from 6a2e411 to 5fa2549 Compare June 24, 2022 14:31
@xylar xylar force-pushed the switch_steps_to_slurm_terminology branch from abaec21 to bf8ca45 Compare July 29, 2022 08:57
@xylar
Copy link
Collaborator Author

xylar commented Jul 29, 2022

@trhille, I was able to run the full_integration test suite successfully on Anvil with Gnu and MVAPICH, and all tests were bit-for-bit. I wasn't able to reproduce the issues with any of the following on Anvil:

  • landice/eismint2/restart_test
  • landice/eismint2/enthalpy_restart_test
  • landice/dome/2000m/fo_decomposition_test
  • landice/greenland/fo_decomposition_test
  • landice/greenland/fo_restart_test

It could be that a recent rebase took care of these or it could be that they're specific to Cori. I will test full_integration there next.

I haven't investigated any of the calving tests yet.

@xylar
Copy link
Collaborator Author

xylar commented Jul 29, 2022

@trhille,
On Cori, I'm seeing the following failures during test validation (not execution) with the current master:

I see the same failures with this branch but no other failures (i.e. not the others you mentioned in #413 (comment)) and the baseline comparison passes for all tests (even these).

Baseline:

/global/cscratch1/sd/xylar/compass_1.2/test_20220729/landice_full_baseline

This branch:

/global/cscratch1/sd/xylar/compass_1.2/test_20220729/landice_full_slurm_terms

I believe permissions should be such that you can take a look.

I did need to copy the components/mpas-framework/Makefile over from the latest E3SM-Project submodule to get the build to work out.

@trhille
Copy link
Collaborator

trhille commented Jul 29, 2022

@xylar, I'm wondering if the issue is with my spack packages or MPAS-Tools, because the test execution failures seem to arise from MPAS Cell Culler in the eistmint2 cases and from an incorrect layerThicknessFractions in the mesh for the dome/2000m/fo_decomposition_test case. I'll test again and update you.

@trhille
Copy link
Collaborator

trhille commented Jul 29, 2022

Okay, I am no longer getting the test execution errors in full_integration, and all tests are passing baseline comparison, although the following tests are failing validation, as you also saw:
landice_greenland_fo_restart_test
landice_thwaites_restart_test
landice_humboldt_mesh-3km_restart_test_velo-fo_calving-von_mises_stress_damage-threshold_faceMelting

It's possible I accidentally had an old branch of MPAS-Tools checked out when I ran the test before, but I'm not sure.

@xylar
Copy link
Collaborator Author

xylar commented Jul 30, 2022

@trhille, I think the more likely thing is just that your conda environment wasn't up-to-date in on or the other branch. Since the 1.1.0 release and changing to 1.2.0-alpha.1 on master, some dependencies have changed that might make a difference.

@xylar
Copy link
Collaborator Author

xylar commented Aug 10, 2022

@matthewhoffman and @mark-petersen, could you make this PR a priority (relative to other reviews -- I know these are very busy times). This is holding up other work that @altheaden and I would like to do before summer is over and she has to resume classes.

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xylar and @altheaden for your work on this. I can see how specifying both ntasks and cpus_per_tasks would be needed to match with the srun format.

I tested this compass PR on cori with intel and badger with gnu, running today's E3SM master. I ran the nightly test suite on both and everything was successful.

I also tried some extra test cases on both:

cori10:pr$ compass setup -p /global/homes/m/mpeterse/repos/E3SM/master/components/mpas-ocean/ -w $n/ocean_model_220816_c4b7eacb_co_intel-nersc_openmp_debug_master -n 196 200 168 183
Setting up test cases:
  ocean/soma/32km/default
  ocean/soma/32km/three_layer
  ocean/internal_wave/default
  ocean/planar_convergence/horizontal_advection
target cores: 272
minimum cores: 27

These appear to run correctly except that ocean/soma/32km/three_layer fails in forward mode both before and after this PR, so that is unrelated.

I do get a warning on cori that appears to be related to this PR:

 pwd
/global/cscratch1/sd/mpeterse/runs/n/ocean_model_220816_c4b7eacb_co_intel-nersc_openmp_debug_master/ocean/soma/32km/three_layer
(dev_compass_1.2.0-alpha.1) nid00961:three_layer$ compass run
ocean/soma/32km/three_layer
/global/u2/m/mpeterse/repos/compass/pr/compass/parallel.py:93: UserWarning: Slurm found 64 cpus per node but config from mache was 32
  warnings.warn(f'Slurm found {cpus_per_node} cpus per node but '
compass calling: compass.ocean.tests.soma.soma_test_case.SomaTestCase.run()
  inherited from: compass.testcase.TestCase.run()
  in /global/u2/m/mpeterse/repos/compass/pr/compass/testcase.py

Running steps: initial_state, forward
  * step: initial_state

It looks like the warning is just telling us that we are not taking advantage of all the cores available on the node, so would be expected.

@xylar
Copy link
Collaborator Author

xylar commented Aug 16, 2022

Thanks a bunch @mark-petersen!

It's worth figuring out what's wrong with 3-layer SOMA but not here.

The warning:

Slurm found 64 cpus per node but config from mache was 32

should be fixed in mache and isn't related to this PR. This happened because NERSC turned on hyperthreading on Cori-Haswell and it now looks like there are 64 cores per node rather than 32. compass trusts what the SLURM variable says (i.e. 64) rather than what mache thinks the number should be, so it's fine.

Copy link
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the landice integration suite on Badger including baseline comparison and all tests pass. I think that is sufficient for MALI interests.

@xylar
Copy link
Collaborator Author

xylar commented Aug 16, 2022

Thank you very much @trhille, @matthewhoffman and @mark-petersen!!! This will be a big deal for allowing @altheaden's work to move forward. Sorry it's so disruptive.

@xylar xylar merged commit 9b29ac3 into MPAS-Dev:master Aug 16, 2022
@xylar xylar deleted the switch_steps_to_slurm_terminology branch August 16, 2022 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants