Skip to content

Commit

Permalink
Merge pull request #1816 from ESMCI/jgfouca/refactor_spare_nodes
Browse files Browse the repository at this point in the history
Refactor how spare nodes get computed.

The prior implementation was a bit too simple. Always giving the user
10 percent of nodes as spare nodes was way overkill for large jobs.

The new implementation maxes out the number of spare nodes at 10.

We also add a new variable to allow the user to pick the exact number
of spare nodes that they want.

Test suite: scripts_regression_tests T_TestRunRestart
Test baseline:
Test namelist changes:
Test status: [bit for bit, roundoff, climate changing]

Fixes [CIME Github issue #]

User interface changes?: Yes, new case variables controlling spare nodes

Update gh-pages html (Y/N)?: N

Code review: @jedwards4b
  • Loading branch information
jgfouca authored Aug 16, 2017
2 parents b5ba1e8 + f993df0 commit aa917ee
Show file tree
Hide file tree
Showing 5 changed files with 29 additions and 7 deletions.
2 changes: 1 addition & 1 deletion config/acme/machines/config_machines.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1984,7 +1984,7 @@
<SUPPORTED_BY>acme</SUPPORTED_BY>
<GMAKE_J>8</GMAKE_J>
<MAX_TASKS_PER_NODE>16</MAX_TASKS_PER_NODE>
<PCT_SPARE_NODES>10</PCT_SPARE_NODES>
<ALLOCATE_SPARE_NODES>TRUE</ALLOCATE_SPARE_NODES>
<PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
<PROJECT>cli115</PROJECT>
<PIO_CONFIG_OPTS> -D PIO_BUILD_TIMING:BOOL=ON </PIO_CONFIG_OPTS>
Expand Down
2 changes: 1 addition & 1 deletion config/config_tests.xml
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,7 @@ NODEFAIL Tests restart upon detected node failure. Generates fake failu
<CONTINUE_RUN>FALSE</CONTINUE_RUN>
<CHECK_TIMING>FALSE</CHECK_TIMING>
<NODE_FAIL_REGEX>JGF FAKE NODE FAIL</NODE_FAIL_REGEX>
<PCT_SPARE_NODES>300</PCT_SPARE_NODES>
<FORCE_SPARE_NODES>3</FORCE_SPARE_NODES>
</test>

<test NAME="ICP">
Expand Down
2 changes: 1 addition & 1 deletion config/xml_schemas/env_mach_pes.xsd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<!-- attributes -->
<xs:attribute name="id" type="xs:NCName"/>
<xs:attribute name="value" type="xs:integer"/>
<xs:attribute name="value" type="xs:string"/>
<xs:attribute name="version" type="xs:decimal"/>

<!-- simple elements -->
Expand Down
15 changes: 14 additions & 1 deletion scripts/lib/CIME/XML/env_mach_pes.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,17 @@ def get_total_nodes(self, total_tasks, max_thread_count):
return num_nodes, self.get_spare_nodes(num_nodes)

def get_spare_nodes(self, num_nodes):
return int(math.ceil(float(num_nodes) * (self.get_value("PCT_SPARE_NODES") / 100.0)))
force_spare_nodes = self.get_value("FORCE_SPARE_NODES")
if force_spare_nodes != -999:
return force_spare_nodes

if self.get_value("ALLOCATE_SPARE_NODES"):
ten_pct = int(math.ceil(float(num_nodes) * 0.1))
if ten_pct < 1:
return 1 # Always provide at lease one spare node
elif ten_pct > 10:
return 10 # Never provide more than 10 spare nodes
else:
return ten_pct
else:
return 0
15 changes: 12 additions & 3 deletions src/drivers/mct/cime_config/config_component.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1878,12 +1878,21 @@
<!-- definitions pelayout -->
<!-- ===================================================================== -->

<entry id="PCT_SPARE_NODES">
<entry id="ALLOCATE_SPARE_NODES">
<type>logical</type>
<valid_values>TRUE,FALSE</valid_values>
<default_value>FALSE</default_value>
<group>mach_pes</group>
<file>env_mach_pes.xml</file>
<desc>Allocate some spare nodes to handle node failures. The system will pick a reasonable number</desc>
</entry>

<entry id="FORCE_SPARE_NODES">
<type>integer</type>
<default_value>0</default_value>
<default_value>-999</default_value>
<group>mach_pes</group>
<file>env_mach_pes.xml</file>
<desc>Percent of extra spare nodes to allocate</desc>
<desc>Force this exact number of spare nodes to be allocated</desc>
</entry>

<entry id="NTASKS">
Expand Down

0 comments on commit aa917ee

Please sign in to comment.