-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make sure file exists #1868
Merged
Merged
make sure file exists #1868
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jgfouca
approved these changes
Sep 6, 2017
jgfouca
pushed a commit
that referenced
this pull request
Nov 7, 2017
…#1868) Centralize coll. of perf. data at NERSC and update NERSC syslog scripts a) Change SAVE_TIMING_DIR default at NERSC to a central location Currently the default location for SAVE_TIMING_DIR on Edison, Cori-Haswell, and Cori-KNL is /project/projectdirs/$PROJECT . There are a number of ACME-project allocations at NERSC, and it is advantageous for the performance data for all of these to be archived in a single location. Here this default is set to /project/projectdirs/acme . If the ACME model is run by someone not in the acme group and if this default is not changed in env_run.xml, then performance data archiving will be disabled. b) Change mach_syslog for Cori to start checkpointing earlier Currently the scripts for Cori-Haswell and Cori-KNL that monitor model progress do not start until the number of lines in acme.log exceeds the number of cores in the allocation nodes. This design was introduced when the process-to-core mapping was output to acme.log. This mapping output has since been disabled for these systems and the script often waits excessively long for jobs with large node counts. This commit changes these scripts to start after an empirically determined number of lines, attempting to start after the model output starts, thus after the list of MPICH environment variables is output. As this is emprically determined, it may need to be adjusted again in the future. c) Change mach_syslog for Edison to start checkpointing earlier Currently the script for Edison that monitors model progress does not start until the number of lines in acme.log exceeds the number of cores in the allocated nodes. This design was introduced when the process-to-core mapping was output to acme.log. As the number of cores can be larger (and potentially much larger) than the number of MPI processes when using OpenMP threading, the script often waits excessively long for jobs with large nodes counts when OpenMP threading is used. This commit changes this script to start after the length of acme.log exceeds the number of nodes. While not guaranteed to capture all of the process-to-core mapping, this change does guarantee that something is captured before the job ends. Note that this change is needed now because of the successful cleanup of acme.log, significantly shortening its length compared to that generated by earlier versions of the model. Fixes #1858 [BFB] P2-117
jgfouca
pushed a commit
that referenced
this pull request
Feb 23, 2018
…#1868) Centralize coll. of perf. data at NERSC and update NERSC syslog scripts a) Change SAVE_TIMING_DIR default at NERSC to a central location Currently the default location for SAVE_TIMING_DIR on Edison, Cori-Haswell, and Cori-KNL is /project/projectdirs/$PROJECT . There are a number of ACME-project allocations at NERSC, and it is advantageous for the performance data for all of these to be archived in a single location. Here this default is set to /project/projectdirs/acme . If the ACME model is run by someone not in the acme group and if this default is not changed in env_run.xml, then performance data archiving will be disabled. b) Change mach_syslog for Cori to start checkpointing earlier Currently the scripts for Cori-Haswell and Cori-KNL that monitor model progress do not start until the number of lines in acme.log exceeds the number of cores in the allocation nodes. This design was introduced when the process-to-core mapping was output to acme.log. This mapping output has since been disabled for these systems and the script often waits excessively long for jobs with large node counts. This commit changes these scripts to start after an empirically determined number of lines, attempting to start after the model output starts, thus after the list of MPICH environment variables is output. As this is emprically determined, it may need to be adjusted again in the future. c) Change mach_syslog for Edison to start checkpointing earlier Currently the script for Edison that monitors model progress does not start until the number of lines in acme.log exceeds the number of cores in the allocated nodes. This design was introduced when the process-to-core mapping was output to acme.log. As the number of cores can be larger (and potentially much larger) than the number of MPI processes when using OpenMP threading, the script often waits excessively long for jobs with large nodes counts when OpenMP threading is used. This commit changes this script to start after the length of acme.log exceeds the number of nodes. While not guaranteed to capture all of the process-to-core mapping, this change does guarantee that something is captured before the job ends. Note that this change is needed now because of the successful cleanup of acme.log, significantly shortening its length compared to that generated by earlier versions of the model. Fixes #1858 [BFB] P2-117
jgfouca
pushed a commit
that referenced
this pull request
Mar 13, 2018
…#1868) Centralize coll. of perf. data at NERSC and update NERSC syslog scripts a) Change SAVE_TIMING_DIR default at NERSC to a central location Currently the default location for SAVE_TIMING_DIR on Edison, Cori-Haswell, and Cori-KNL is /project/projectdirs/$PROJECT . There are a number of ACME-project allocations at NERSC, and it is advantageous for the performance data for all of these to be archived in a single location. Here this default is set to /project/projectdirs/acme . If the ACME model is run by someone not in the acme group and if this default is not changed in env_run.xml, then performance data archiving will be disabled. b) Change mach_syslog for Cori to start checkpointing earlier Currently the scripts for Cori-Haswell and Cori-KNL that monitor model progress do not start until the number of lines in acme.log exceeds the number of cores in the allocation nodes. This design was introduced when the process-to-core mapping was output to acme.log. This mapping output has since been disabled for these systems and the script often waits excessively long for jobs with large node counts. This commit changes these scripts to start after an empirically determined number of lines, attempting to start after the model output starts, thus after the list of MPICH environment variables is output. As this is emprically determined, it may need to be adjusted again in the future. c) Change mach_syslog for Edison to start checkpointing earlier Currently the script for Edison that monitors model progress does not start until the number of lines in acme.log exceeds the number of cores in the allocated nodes. This design was introduced when the process-to-core mapping was output to acme.log. As the number of cores can be larger (and potentially much larger) than the number of MPI processes when using OpenMP threading, the script often waits excessively long for jobs with large nodes counts when OpenMP threading is used. This commit changes this script to start after the length of acme.log exceeds the number of nodes. While not guaranteed to capture all of the process-to-core mapping, this change does guarantee that something is captured before the job ends. Note that this change is needed now because of the successful cleanup of acme.log, significantly shortening its length compared to that generated by earlier versions of the model. Fixes #1858 [BFB] P2-117
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When testing branches or sandboxes that do not have files currently on master the
is_python_executable subroutine may be called with filepaths that do not exist. Instead of failing just return False so that testing may continue
Test suite: hand tested, scripts_regression_tests.py
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes
User interface changes?:
Update gh-pages html (Y/N)?:
Code review: