Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make sure file exists #1868

Merged
merged 1 commit into from
Sep 6, 2017
Merged

make sure file exists #1868

merged 1 commit into from
Sep 6, 2017

Conversation

jedwards4b
Copy link
Contributor

When testing branches or sandboxes that do not have files currently on master the
is_python_executable subroutine may be called with filepaths that do not exist. Instead of failing just return False so that testing may continue

Test suite: hand tested, scripts_regression_tests.py
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes

User interface changes?:

Update gh-pages html (Y/N)?:

Code review:

@jgfouca jgfouca merged commit 4d9a8d7 into ESMCI:master Sep 6, 2017
@jedwards4b jedwards4b deleted the is_python_a_file branch September 6, 2017 17:23
jgfouca pushed a commit that referenced this pull request Nov 7, 2017
…#1868)

Centralize coll. of perf. data at NERSC and update NERSC syslog scripts

a) Change SAVE_TIMING_DIR default at NERSC to a central location

Currently the default location for SAVE_TIMING_DIR on Edison,
Cori-Haswell, and Cori-KNL is /project/projectdirs/$PROJECT .
There are a number of ACME-project allocations at NERSC, and
it is advantageous for the performance data for all of these to
be archived in a single location. Here this default is set to
/project/projectdirs/acme . If the ACME model is run by
someone not in the acme group and if this default is not
changed in env_run.xml, then performance data archiving
will be disabled.

b) Change mach_syslog for Cori to start checkpointing earlier

Currently the scripts for Cori-Haswell and Cori-KNL
that monitor model progress do not start until the number of lines
in acme.log exceeds the number of cores in the allocation nodes.
This design was introduced when the process-to-core mapping
was output to acme.log. This mapping output has since been disabled
for these systems and the script often waits excessively long for jobs
with large node counts. This commit changes these scripts to start after an
empirically determined number of lines, attempting to start after the
model output starts, thus after the list of MPICH environment variables
is output. As this is emprically determined, it may need to be adjusted
again in the future.

c) Change mach_syslog for Edison to start checkpointing earlier

Currently the script for Edison that monitors model progress does
not start until the number of lines in acme.log exceeds the number of
cores in the allocated nodes. This design was introduced when the
process-to-core mapping was output to acme.log. As the number of cores
can be larger (and potentially much larger) than the number of MPI
processes when using OpenMP threading, the script often waits excessively
long for jobs with large nodes counts when OpenMP threading is used.
This commit changes this script to start after the length of acme.log
exceeds the number of nodes. While not guaranteed to capture all of
the process-to-core mapping, this change does guarantee that something
is captured before the job ends. Note that this change is needed now
because of the successful cleanup of acme.log, significantly shortening
its length compared to that generated by earlier versions of the model.

Fixes #1858

[BFB]

P2-117
jgfouca pushed a commit that referenced this pull request Feb 23, 2018
…#1868)

Centralize coll. of perf. data at NERSC and update NERSC syslog scripts

a) Change SAVE_TIMING_DIR default at NERSC to a central location

Currently the default location for SAVE_TIMING_DIR on Edison,
Cori-Haswell, and Cori-KNL is /project/projectdirs/$PROJECT .
There are a number of ACME-project allocations at NERSC, and
it is advantageous for the performance data for all of these to
be archived in a single location. Here this default is set to
/project/projectdirs/acme . If the ACME model is run by
someone not in the acme group and if this default is not
changed in env_run.xml, then performance data archiving
will be disabled.

b) Change mach_syslog for Cori to start checkpointing earlier

Currently the scripts for Cori-Haswell and Cori-KNL
that monitor model progress do not start until the number of lines
in acme.log exceeds the number of cores in the allocation nodes.
This design was introduced when the process-to-core mapping
was output to acme.log. This mapping output has since been disabled
for these systems and the script often waits excessively long for jobs
with large node counts. This commit changes these scripts to start after an
empirically determined number of lines, attempting to start after the
model output starts, thus after the list of MPICH environment variables
is output. As this is emprically determined, it may need to be adjusted
again in the future.

c) Change mach_syslog for Edison to start checkpointing earlier

Currently the script for Edison that monitors model progress does
not start until the number of lines in acme.log exceeds the number of
cores in the allocated nodes. This design was introduced when the
process-to-core mapping was output to acme.log. As the number of cores
can be larger (and potentially much larger) than the number of MPI
processes when using OpenMP threading, the script often waits excessively
long for jobs with large nodes counts when OpenMP threading is used.
This commit changes this script to start after the length of acme.log
exceeds the number of nodes. While not guaranteed to capture all of
the process-to-core mapping, this change does guarantee that something
is captured before the job ends. Note that this change is needed now
because of the successful cleanup of acme.log, significantly shortening
its length compared to that generated by earlier versions of the model.

Fixes #1858

[BFB]

P2-117
jgfouca pushed a commit that referenced this pull request Mar 13, 2018
…#1868)

Centralize coll. of perf. data at NERSC and update NERSC syslog scripts

a) Change SAVE_TIMING_DIR default at NERSC to a central location

Currently the default location for SAVE_TIMING_DIR on Edison,
Cori-Haswell, and Cori-KNL is /project/projectdirs/$PROJECT .
There are a number of ACME-project allocations at NERSC, and
it is advantageous for the performance data for all of these to
be archived in a single location. Here this default is set to
/project/projectdirs/acme . If the ACME model is run by
someone not in the acme group and if this default is not
changed in env_run.xml, then performance data archiving
will be disabled.

b) Change mach_syslog for Cori to start checkpointing earlier

Currently the scripts for Cori-Haswell and Cori-KNL
that monitor model progress do not start until the number of lines
in acme.log exceeds the number of cores in the allocation nodes.
This design was introduced when the process-to-core mapping
was output to acme.log. This mapping output has since been disabled
for these systems and the script often waits excessively long for jobs
with large node counts. This commit changes these scripts to start after an
empirically determined number of lines, attempting to start after the
model output starts, thus after the list of MPICH environment variables
is output. As this is emprically determined, it may need to be adjusted
again in the future.

c) Change mach_syslog for Edison to start checkpointing earlier

Currently the script for Edison that monitors model progress does
not start until the number of lines in acme.log exceeds the number of
cores in the allocated nodes. This design was introduced when the
process-to-core mapping was output to acme.log. As the number of cores
can be larger (and potentially much larger) than the number of MPI
processes when using OpenMP threading, the script often waits excessively
long for jobs with large nodes counts when OpenMP threading is used.
This commit changes this script to start after the length of acme.log
exceeds the number of nodes. While not guaranteed to capture all of
the process-to-core mapping, this change does guarantee that something
is captured before the job ends. Note that this change is needed now
because of the successful cleanup of acme.log, significantly shortening
its length compared to that generated by earlier versions of the model.

Fixes #1858

[BFB]

P2-117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants