-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of nodes on Titan #1255
Comments
Can you point me at your case directory? I don't think that I have run into this issue on Titan recently - it was broken this way earlier is I remember correctly. Is this from master? |
Yes, git master v1.0.0-beta-223-g628a26d from 2017-02-02 Case-dir: ~azamat/repos/ACME-homme/cime/scripts/cases/A_WCYCL2000-ne30_oEC-intel-01 |
Looking at your env_mach_pes.xml:
Is the current env_mach_pes.xml the one that you tried? The default for this case has
|
Yes, I'm building with this PE layout now. Not with the default. |
Sorry - you complained about a layout with 675 ATM processes, but the one I see has 2700 ATM processes. What should I be looking at? |
SORRY - 675 NODES (not ATM processes). "never mind" |
2700 ATM processes with 4 threads per process is 10800 tasks that are laid out on 675 nodes. |
I see the same problem - the aprun command is correct however. |
Yes, aprun is good, just getting over-charged. |
And the qstat output after submission:
|
@amametjanov , in the mean time, you can manually edit the case.run file to request the actual number of nodes needed, to get back to what you were trying to do. (Which you probably already know - just wanted to be certain.) |
Yes, everyone should check case.run after case.setup to make sure |
task_maker.py got completely refactored in CIME 5.2, so I'm going to wait until we have that before looking too closely at this. |
@worleyph @amametjanov OK, looking at this now |
@worleyph @amametjanov I have a fix ready, I just need #1321 to get merged to master. |
When can I do the merge ( @rljacob )? Skybridge and melvin tests all failed because of the BFBFLAG change (last I checked). Should I check again tomorrow? What is "good enough" for the merge to master? |
Go ahead. |
Now? |
We have a thunderstorm here right now. Would hate to get in the middle and lost connectivity. If the answer is "now", I'll do the merge as soon as things settle down, or tomorrow morning. Thanks. |
Yes. |
Update queue selection to take walltime into account Adds concept of strict walltime. The idea here is to have better support for machines like blues that have a "debug" queue and a "standard" queue. The "debug" has strict limits on both walltime and num_pes and therefore should not be selected as the user's queue if they asked for a long walltime. For other machines, the maxwalltime setting is being used more like a default walltime than a true max. Test suite: scripts_regression_tests (melvin and skybridge) and some by-hand testing on blues Test baseline: Test namelist changes: Test status: bit for bit Fixes #1255 User interface changes?: Changes in how walltime is handled Code review: @jedwards4b @jayeshkrishna @rljacob
This still appears to be an issue. Nodes calculated in case.run are incorrect: expected 1926 nodes, but got |
Note that the aprun command (1926 node run on Titan - see https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL2000+ne120_oRRS15+Titan+1926) is correct, the PBS args (number of nodes requested) are wrong. |
there's a new feature with the job will allocate more nodes than aprun, so if a node fails (very common on Titan) it will attempt a restart in the same job. But I thought that would only ask for an extra 4 nodes or something like that. Just thought I'd mention it since it might be related. |
Look in your env_mach_pes.xml file ...
The default is an increase of 10% in the requested number of nodes (on Titan). Probably overkill, but I do not know the reasoning that determined this number. |
Ok, now it makes sense. 1926 + 10% increase = 1926 + ceil(1926 * 0.1) = 2119 nodes. |
@jgfouca : 10% seems like a lot of nodes for the default - 192 nodes in this case, meaning we are allowing for 192 failures, each requiring a restart run? Or is it that dozens of nodes fail at a time? If the former, can we change this to max(10%,5)? |
Since failures can happen multiple times during a long run, maybe more than 5 nodes as the max? 10? 20? Perhaps OLCF can advise. Perhaps they already have? |
Take a look at Luke's run log: https://acme-climate.atlassian.net/wiki/display/OCNICE/GMPAS-IAF_oRRS18to6v3 some runs had no failures, a few with 1 failure and a few with 2 failures. One run he listed as "numerous failures". Pinging our new Titan POC, @minxu74 , can you advise? |
10% was a WAG (wild-&$! guess) by me when I initially implemented this capability. I would be happy to change it to something more reasonable if someone can tell me what that is. |
In my Titan experience 5 nodes extra is plenty. I have been running with ten extra nodes but have never come close to needing all ten.
…Sent from my iPhone
On Jun 17, 2017, at 1:50 PM, James Foucar ***@***.***> wrote:
10% was a WAG (wild-&$! guess) by me when I initially implemented this capability. I would be happy to change it to something more reasonable if someone can tell me what that is.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@vanroekel , thanks. The last piece of information is this: is there a 1-1 correlation between the number of times the magic string (error in the log that would indicate a node failure) appears in the log and the number of nodes that failed? |
This has been addressed in CIME: |
New user interface features: option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email. manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative. Add --pesfile option to create_test New case.qstatus tool Allow user to adjust ulimits Add message at end of successful build Add new user-mods option to create_clone New ways to select queue in config_batch.xml scripts_regression_tests.py now has --compiler and --mpilibs option new case variables controlling spare nodes (for failing nodes workaround) User interface bug fixes: xmlquery options --file and --listall now work together. xmlquery options clarified and help message improved. Improve error message for passing incorrect grid alias to create_newcase. Improve error messages from xmlchange Removed user features: Remove long term archiving support. "case.lt_archive" no longer in caseroot. Other changes to case control system: Improve ERR test. New ERS2 test Add IRT test Implement NCR with SystemTestsCompareTwo Rewrite ERP with SystemTestsCompareTwo Convert the ERR and ERRI test to the compare2 framework Fix SEQ test on skybridge new case.test --reset option to manually reset a test to initial conditions. TOTAL_CORES was inaccurate so remove it. Only create rest directories if they are going to be populated get_timing.py should use largest ncpl value, not assume it's the atm value Improve comment for BASELINE status in TestStatus Better handling of fails in the build (don't print a python stacktrace) Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily. Add ability for create_test to infer non-default machine from testname Change all python2 style strings to python3 style strings Improve handling of hist comparisons in TestStatus.log version 3.0 schema for entry_id files Downscale build parallelism on small machines (so we don't overload it) Bring bless_test_results behavior into an exact match with create_test -g lcdf4 flag does apply to pnetcdf. Replace cdf64 logical with PIO_NETCDF_FORMAT character variable Refactor how spare nodes get computed Preserve test order in test_scheduler Allow queue selection by node count instead of task count Improve error check for non-python sub processes Data model changes: Refactored data models Use %CPLHIST for DATM coupler hist forcing Add support for data model to read multiple time slices at once New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file DATM updates for clm Change SIAF compset name in dice to IAF to be consistent with other components Coupler/driver changes: Map SMB from lnd to glc using bilinear mapping with a conservation correction Remove more cesm refs in filenames. seq_comm_mct code cleanup Implementation of multiple couplers within multi-instance runs. Other: User's guide rst files now in cime/doc directory Get PFUnit working on melvin Homebrew generic mac port Fixes #1255 Fixes #1424 Fixes #1426 Fixes #1573 Fixes #1576 Fixes #1585 * jgfouca/update_to_cime_5.3.0.34: (932 commits) One more short term archiver fix Fix merge conflicts, should mostly work cetus xml fix Hack to get PET test to work on chama Fix HOMME Cmake pointing to wrong cime area Update run_acme version Fixes for AQUAP Removes long term archiving from run_acme Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members Cherry-pick fixes from older merge branch cime_developer now working on melvin new git pelayout doc Update ChangeLog Change atm_grid "48x96" to T31 to match the model_grid_alias that is actually being used. Minor fix Fix mistake Better encapsulation of common calls (create_test, wait_for_tests) Add a timeout option to wait_for_tests update to sphinx documentation Restore task_count ...
Update CIME to 5.3.0alpha34 New user interface features: option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email. manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative. Add --pesfile option to create_test New case.qstatus tool Allow user to adjust ulimits Add message at end of successful build Add new user-mods option to create_clone New ways to select queue in config_batch.xml scripts_regression_tests.py now has --compiler and --mpilibs option new case variables controlling spare nodes (for failing nodes workaround) User interface bug fixes: xmlquery options --file and --listall now work together. xmlquery options clarified and help message improved. Improve error message for passing incorrect grid alias to create_newcase. Improve error messages from xmlchange Removed user features: Remove long term archiving support. "case.lt_archive" no longer in caseroot. Other changes to case control system: Improve ERR test. New ERS2 test Add IRT test Implement NCR with SystemTestsCompareTwo Rewrite ERP with SystemTestsCompareTwo Convert the ERR and ERRI test to the compare2 framework Fix SEQ test on skybridge new case.test --reset option to manually reset a test to initial conditions. TOTAL_CORES was inaccurate so remove it. Only create rest directories if they are going to be populated get_timing.py should use largest ncpl value, not assume it's the atm value Improve comment for BASELINE status in TestStatus Better handling of fails in the build (don't print a python stacktrace) Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily. Add ability for create_test to infer non-default machine from testname Change all python2 style strings to python3 style strings Improve handling of hist comparisons in TestStatus.log version 3.0 schema for entry_id files Downscale build parallelism on small machines (so we don't overload it) Bring bless_test_results behavior into an exact match with create_test -g lcdf4 flag does apply to pnetcdf. Replace cdf64 logical with PIO_NETCDF_FORMAT character variable Refactor how spare nodes get computed Preserve test order in test_scheduler Allow queue selection by node count instead of task count Improve error check for non-python sub processes Data model changes: Refactored data models Use %CPLHIST for DATM coupler hist forcing Add support for data model to read multiple time slices at once New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file DATM updates for clm Change SIAF compset name in dice to IAF to be consistent with other components Coupler/driver changes: Map SMB from lnd to glc using bilinear mapping with a conservation correction Remove more cesm refs in filenames. seq_comm_mct code cleanup Implementation of multiple couplers within multi-instance runs. Other: User's guide rst files now in cime/doc directory Get PFUnit working on melvin Homebrew generic mac port Fixes #1255 Fixes #1424 Fixes #1426 Fixes #1573 Fixes #1576 Fixes #1585 * jgfouca/update_to_cime_5.3.0.34: (942 commits) Apply patch for cpl to avoid hangs Adjust ne4 PE layout on Anvil Reset configpes_all to what's on master Cherry-pick wait_for_tests fix from ESMCI Fix xmllint errors on Titan Convert attribute DEBUG to uppercase Update ne4-wcycl PEs on Anvil Fix the rest of the pylint errors for case_st_archive Fix jenkins script mistake made during conflict resolution. short term archiver get_datenames call for last restarts fixed One more short term archiver fix Fix merge conflicts, should mostly work cetus xml fix Hack to get PET test to work on chama Fix HOMME Cmake pointing to wrong cime area Update run_acme version Fixes for AQUAP Removes long term archiving from run_acme Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members Cherry-pick fixes from older merge branch ...
Update CIME to 5.3.0alpha34 New user interface features: option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email. manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative. Add --pesfile option to create_test New case.qstatus tool Allow user to adjust ulimits Add message at end of successful build Add new user-mods option to create_clone New ways to select queue in config_batch.xml scripts_regression_tests.py now has --compiler and --mpilibs option new case variables controlling spare nodes (for failing nodes workaround) User interface bug fixes: xmlquery options --file and --listall now work together. xmlquery options clarified and help message improved. Improve error message for passing incorrect grid alias to create_newcase. Improve error messages from xmlchange Removed user features: Remove long term archiving support. "case.lt_archive" no longer in caseroot. Other changes to case control system: Improve ERR test. New ERS2 test Add IRT test Implement NCR with SystemTestsCompareTwo Rewrite ERP with SystemTestsCompareTwo Convert the ERR and ERRI test to the compare2 framework Fix SEQ test on skybridge new case.test --reset option to manually reset a test to initial conditions. TOTAL_CORES was inaccurate so remove it. Only create rest directories if they are going to be populated get_timing.py should use largest ncpl value, not assume it's the atm value Improve comment for BASELINE status in TestStatus Better handling of fails in the build (don't print a python stacktrace) Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily. Add ability for create_test to infer non-default machine from testname Change all python2 style strings to python3 style strings Improve handling of hist comparisons in TestStatus.log version 3.0 schema for entry_id files Downscale build parallelism on small machines (so we don't overload it) Bring bless_test_results behavior into an exact match with create_test -g lcdf4 flag does apply to pnetcdf. Replace cdf64 logical with PIO_NETCDF_FORMAT character variable Refactor how spare nodes get computed Preserve test order in test_scheduler Allow queue selection by node count instead of task count Improve error check for non-python sub processes Data model changes: Refactored data models Use %CPLHIST for DATM coupler hist forcing Add support for data model to read multiple time slices at once New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file DATM updates for clm Change SIAF compset name in dice to IAF to be consistent with other components Coupler/driver changes: Map SMB from lnd to glc using bilinear mapping with a conservation correction Remove more cesm refs in filenames. seq_comm_mct code cleanup Implementation of multiple couplers within multi-instance runs. Other: User's guide rst files now in cime/doc directory Get PFUnit working on melvin Homebrew generic mac port Fixes #1255 Fixes #1424 Fixes #1426 Fixes #1573 Fixes #1576 Fixes #1585 * jgfouca/update_to_cime_5.3.0.34: (942 commits) Apply patch for cpl to avoid hangs Adjust ne4 PE layout on Anvil Reset configpes_all to what's on master Cherry-pick wait_for_tests fix from ESMCI Fix xmllint errors on Titan Convert attribute DEBUG to uppercase Update ne4-wcycl PEs on Anvil Fix the rest of the pylint errors for case_st_archive Fix jenkins script mistake made during conflict resolution. short term archiver get_datenames call for last restarts fixed One more short term archiver fix Fix merge conflicts, should mostly work cetus xml fix Hack to get PET test to work on chama Fix HOMME Cmake pointing to wrong cime area Update run_acme version Fixes for AQUAP Removes long term archiving from run_acme Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members Cherry-pick fixes from older merge branch ...
@jgfouca and @vanroekel and @mt5555 , the new logic supports both asking CIME to do something reasonable and specifying the extra number of nodes directly:
or something like
(Note that FORCE_SPARE_NODES takes precendence when both are set.) However, the default for Titan is now
where as the default used to be TRUE (adding 10% more nodes). Just wanted to check whether this is what we wanted (default to be FALSE) or not. |
@worleyph I think we want it to be true on titan, I will investigate. |
@vanroekel , yes, that change was made a while ago. I think it maxes-out at 10 nodes. |
@vanroekel , in my experiments it appears that the default for Titan is 10, though you can also modify this on a case-by-case basis. I did not look at the code to verify whoat or how the default was determined. |
Tried to run on Titan with the recommended PE layout from here: https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL2000-ne30_oEC-Titan-931.
I was expecting 675 nodes for ATM and 256 nodes for OCN for a total of 931 nodes. However, case.run shows
#PBS -l nodes=1699
: i.e. total requested nodes is 1699. This can only happen by assigning 4x more nodes to OCN: 675+1024=1699. In other words, scripts are counting that each OCN MPI rank has 4 threads like ATM ranks do, but the PE layout asked for OCN to be single-threaded. So the job is over-charged 1.82x more core-hours. Appears to be a bug.The text was updated successfully, but these errors were encountered: