Skip to content

Commit

Permalink
Allow unique configuration for each compute resource (#284)
Browse files Browse the repository at this point in the history
Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global
parameters affect all instance types.
Add new configuration option to use the existing parameters as defaults for
each instance type, but allow them to be configured for each included instance
family and each included instance type.

This allows admins to reduce the number of compute resources by, for example,
only configuring spot for small instance types, but not for larger ones.

Resolves #277

Add documentation of manual commands for deconfiguring before deleting cluster.

Resolves #282

=========================================================================

Go through everything and change the original term I used, Submitter, to External Login Node.
Just need to make things consistent.
  • Loading branch information
cartalla authored Nov 6, 2024
1 parent 155193c commit 835f62b
Show file tree
Hide file tree
Showing 39 changed files with 345 additions and 364 deletions.
86 changes: 64 additions & 22 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ This project creates a ParallelCluster configuration file that is documented in
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi">CustomAmi</a>: str
<a href="#architecture">Architecture</a>: str
<a href="#computenodeami">ComputeNodeAmi</a>: str
<a href="#disablesimultaneousmultithreading">DisableSimultaneousMultithreading</a>: str
<a href="#enableefa">EnableEfa</a>: bool
<a href="#database">Database</a>:
<a href="#databasestackname">DatabaseStackName</a>: str
Expand Down Expand Up @@ -95,6 +94,7 @@ This project creates a ParallelCluster configuration file that is documented in
<a href="#instanceconfig">InstanceConfig</a>:
<a href="#useondemand">UseOnDemand</a>: str
<a href="#usespot">UseSpot</a>: str
<a href="#disablesimultaneousmultithreading">DisableSimultaneousMultithreading</a>: str
<a href="#exclude">Exclude</a>:
<a href="#exclude-instancefamilies">InstanceFamilies</a>:
- str
Expand All @@ -104,8 +104,16 @@ This project creates a ParallelCluster configuration file that is documented in
<a href="#maxsizeonly">MaxSizeOnly</a>: bool
<a href="#include-instancefamilies">InstanceFamilies</a>:
- str
- str:
useOnDemand: bool
UseSpot: bool
DisableSimultaneousMultithreading: bool
<a href="#include-instancetypes">InstanceTypes</a>:
- str
- str:
UseOnDemand: bool
UseSpot: bool
DisableSimultaneousMultithreading: bool
<a href="#nodecounts">NodeCounts</a>:
<a href="#defaultmincount">DefaultMinCount</a>: str
<a href="#defaultmaxcount">DefaultMaxCount</a>: str
Expand Down Expand Up @@ -359,22 +367,6 @@ All compute nodes will use the same AMI.

The default AMI is selected by the [Image](#image) parameters.

#### DisableSimultaneousMultithreading

type: bool

default=True

Disable SMT on the compute nodes.

If true, multithreading on the compute nodes is disabled.

Not all instance types can disable multithreading. For a list of instance types that support disabling multithreading, see CPU cores and threads for each CPU core per instance type in the Amazon EC2 User Guide for Linux Instances.

Update policy: The compute fleet must be stopped for this setting to be changed for an update.

[ParallelCluster documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading)

#### EnableEfa

type: bool
Expand Down Expand Up @@ -634,14 +626,14 @@ ParallelCluster is limited to a total of 50 compute resources and
we only put 1 instance type in each compute resource.
This limits you to a total of 50 instance types per cluster.
If you need more instance types than that, then you will need to create multiple clusters.
If you configure both on-demand and spot instances, then the limit is effectively 25 instance types because 2 compute resources will be created for each instance type.
If you configure both on-demand and spot for each instance type, then the limit is effectively 25 instance types because 2 compute resources will be created for each instance type.

If you configure more than 50 instance types then the installer will fail with an error.
You will then need to modify your configuration to either include fewer instance types or
exclude instance types from the configuration.

If no Include and Exclude parameters are specified then default EDA instance types
will be configured.
will be configured with both On-Demand and Spot Instances configured..
The defaults will include the latest generation instance families in the c, m, r, x, and u families.
Older instance families are excluded.
Metal instance types are also excluded.
Expand All @@ -652,7 +644,7 @@ This is because EDA workloads are typically memory limited, not core limited.
If any Include or Exclude parameters are specified, then minimal defaults will be used for the parameters that
aren't specified.
By default, all instance families are included and no specific instance types are included.
By default, all instance types with less than 2 GiB of memory are excluded because they don't have enough memory for a Slurm compute node.
By default, all instance types with less than 4 GiB of memory are excluded because they don't have enough memory for a Slurm compute node.

If no includes or excludes are provided, the defaults are:

Expand Down Expand Up @@ -772,6 +764,8 @@ slurm:
#### UseOnDemand

Configure on-demand instances.
This sets the default for all included instance types.
It can be overridden for included instance families and by instance types.

type: bool

Expand All @@ -780,16 +774,35 @@ default: True
#### UseSpot

Configure spot instances.
This sets the default for all included instance types.
It can be overridden for included instance families and by instance types.

type: bool

default: True

#### DisableSimultaneousMultithreading

type: bool

default=True

Disable SMT on the compute nodes.
If true, multithreading on the compute nodes is disabled.
This sets the default for all included instance types.
It can be overridden for included instance families and by instance types.

Not all instance types can disable multithreading. For a list of instance types that support disabling multithreading, see CPU cores and threads for each CPU core per instance type in the Amazon EC2 User Guide for Linux Instances.

Update policy: The compute fleet must be stopped for this setting to be changed for an update.

[ParallelCluster documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading)

#### Exclude

Instance families and types to exclude.

Exclude patterns are processed first and take precesdence over any includes.
Exclude patterns are processed first and take precedence over any includes.

Instance families and types are regular expressions with implicit '^' and '$' at the begining and end.

Expand All @@ -809,10 +822,39 @@ Default: []

Instance families and types to include.

Exclude patterns are processed first and take precesdence over any includes.
Exclude patterns are processed first and take precedence over any includes.

Instance families and types are regular expressions with implicit '^' and '$' at the begining and end.

Each element in the array can be either a regular expression string or a dictionary where the only key
is the regular expression string and that has overrides **UseOnDemand**, **UseSpot**, and **DisableSimultaneousMultithreading** for the matching instance families or instance types.

The settings for instance families overrides the defaults, and the settings for instance types override the others.

For example, the following configuration defaults to only On-Demand instances with SMT disabled.
It includes all of the r7a, r7i, and r7iz instance types.
The r7a instances will only have On-Demand instances.
The r7i and r7iz instance types will have spot instances except for the r7i.48xlarge which has spot disabled.

This allows you to control these attributes of the compute resources with whatever level of granularity that you need.

```
slurm:
InstanceConfig:
UseOnDemand: true
UseSpot: false
DisableSimultaneousMultithreading: true
Exclude:
InstanceTypes:
- .*\.metal
Include:
InstanceFamilies:
- r7a.*
- r7i.*: {UseSpot: true}
InstanceTypes:
- r7i.48xlarge: {UseSpot: false}
```

##### MaxSizeOnly

type: bool
Expand Down
24 changes: 22 additions & 2 deletions docs/delete-cluster.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,31 @@
# Delete Cluster

To delete the cluster all you need to do is delete the configuration CloudFormation stack.
This will delete the ParallelCluster cluster and all of the configuration resources.
Before deleting the cluster, you should stop the cluster and make sure that no instances are
connected to the clusters head node.

For example, you should deconfigure external login nodes and instances that are creating and updating the users_groups.json file.

If you specified RESEnvironmentName then it will also deconfigure the creation of `users_groups.json` and also deconfigure the VDI
instances so they are no longer using the cluster.

If you configured [DomanJoinedInstance](config.md/#domainjoinedinstance) then the creation of `users_groups.json` will be automatically deconfigured.

If you configured [ExternalLoginNodes](config.md/#externalloginnodes) then they will automatically deconfigured.

If you manually did this configuration, then you should manually deconfigure them also before deleting the cluster.
Otherwise, the NFS mounts of the head node will hang and file system related commands on the instance may hang.
The commands to manually deconfigure can be found in the outputs of the configuration stack.

| Output | Description
|--------|-------------
| command10CreateUsersGroupsJsonDeconfigure | Deconfigure the creation of users_groups.json
| command11ExternalLoginNodeDeconfigure | Deconfigure external login node

To delete the cluster all you need to do is delete the configuration CloudFormation stack.
This will delete the ParallelCluster cluster stack and all of the configuration resources.
You should not manually delete the ParallelCluster stack.
If you do, the deconfiguration of login nodes and such may fail.

If you deployed the Slurm database stack then you can keep that and use it for other clusters.
If you don't need it anymore, then you can delete the stack.
You will also need to manually delete the RDS database.
Expand Down
96 changes: 68 additions & 28 deletions source/SlurmPlugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import boto3
from botocore.exceptions import ClientError
from collections import Counter, defaultdict
from copy import deepcopy
from datetime import datetime, timedelta, timezone
from EC2InstanceTypeInfoPkg.EC2InstanceTypeInfo import EC2InstanceTypeInfo
from functools import wraps
Expand Down Expand Up @@ -1956,30 +1957,54 @@ def get_instance_types_from_instance_config(self, instance_config: dict, regions
Get instance types selected by the config file.
Returns:
dict: Dictionary of arrays of instance types in each region. instance_types[region][instance_types]
dict: Dictionary of dictionary of instance types in each region. instance_types[region]{instance_types: {UseOnDemand: bool, UseSpot: bool, DisableSimultaneousMultithreading: bool}}
'''
instance_config = deepcopy(instance_config)

default_instance_type_config = {
'UseOnDemand': instance_config['UseOnDemand'],
'UseSpot': instance_config['UseSpot'],
'DisableSimultaneousMultithreading': instance_config['DisableSimultaneousMultithreading']
}

instance_types = {}
for region in regions:
# Compile strings into regular expressions
instance_config_re = {}
for include_exclude in ['Include', 'Exclude']:
instance_config_re[include_exclude] = {}
for filter_type in ['InstanceFamilies', 'InstanceTypes']:
instance_config_re[include_exclude][filter_type] = []
for index, re_string in enumerate(instance_config.get(include_exclude, {}).get(filter_type, {})):
if include_exclude == 'Include':
instance_config_re[include_exclude][filter_type] = {}
else:
instance_config_re[include_exclude][filter_type] = []
for index, re_item in enumerate(instance_config.get(include_exclude, {}).get(filter_type, {})):
if type(re_item) is str:
re_string = re_item
re_config = {}
else:
re_string = list(re_item.keys())[0]
re_config = re_item[re_string]
try:
instance_config_re[include_exclude][filter_type].append(re.compile(f"^{re_string}$"))
compiled_re = re.compile(f"^{re_string}$")
except:
logging.exception(f"Invalid regular expression for instance_config['{include_exclude}']['{filter_type}'] {re_string}")
logger.exception(f"Invalid regular expression for instance_config['{include_exclude}']['{filter_type}'] {re_string}")
exit(1)
if include_exclude == 'Include':
instance_config_re[include_exclude][filter_type][re_string] = {
're': compiled_re,
'config': re_config
}
else:
instance_config_re[include_exclude][filter_type].append(compiled_re)

region_instance_types = []
region_instance_types = {}

for instance_family in sorted(self.instance_type_and_family_info[region]['instance_families'].keys()):
logger.debug(f"Considering {instance_family} family exclusions")
exclude = False
for instance_family_re in instance_config_re.get('Exclude', {}).get('InstanceFamilies', {}):
if instance_family_re.match(instance_family):
for instance_family_exclude_re in instance_config_re.get('Exclude', {}).get('InstanceFamilies', {}):
if instance_family_exclude_re.match(instance_family):
logger.debug(f"Excluding {instance_family} family")
exclude = True
break
Expand All @@ -1989,16 +2014,19 @@ def get_instance_types_from_instance_config(self, instance_config: dict, regions
logger.debug(f"{instance_family} family not excluded")

# Check to see if instance family is explicitly included
include_family = False
include_instance_family = False
if instance_config_re['Include']['InstanceFamilies']:
logger.debug(f"Considering {instance_family} family inclusions")
for instance_family_re in instance_config_re['Include']['InstanceFamilies']:
if instance_family_re.match(instance_family):
for instance_family_include_re_string in instance_config_re['Include']['InstanceFamilies']:
instance_family_include_re = instance_config_re['Include']['InstanceFamilies'][instance_family_include_re_string]['re']
if instance_family_include_re.match(instance_family):
logger.debug(f"Including {instance_family} family")
include_family = True
include_instance_family = True
instance_family_config = instance_config_re['Include']['InstanceFamilies'][instance_family_include_re_string]['config']
break
if not include_family:
logger.debug(f"{instance_family} family not included. Will check for instance type inclusions.")
if not include_instance_family:
logger.debug(f"{instance_family} family not included. Will check for instance type inclusions.")
instance_family_config = default_instance_type_config

# Check the family's instance types for exclusion and inclusion. MaxSizeOnly is a type of exclusion.
instance_family_info = self.instance_type_and_family_info[region]['instance_families'][instance_family]
Expand All @@ -2008,31 +2036,43 @@ def get_instance_types_from_instance_config(self, instance_config: dict, regions
logger.debug(f"Excluding {instance_type} because not MaxInstanceType.")
continue
exclude = False
for instance_type_re in instance_config_re['Exclude']['InstanceTypes']:
if instance_type_re.match(instance_type):
logger.debug(f"Excluding {instance_type} because excluded")
for instance_type_exclude_re in instance_config_re['Exclude']['InstanceTypes']:
if instance_type_exclude_re.match(instance_type):
logger.debug(f"Excluding {instance_type} because instance type excluded")
exclude = True
break
if exclude:
continue
logger.debug(f"{instance_type} not excluded by instance type exclusions")

# The instance type isn't explicitly excluded so check if it is included
if include_family:
logger.debug(f"Including {instance_type} because {instance_family} family is included.")
region_instance_types.append(instance_type)
continue
include = False
for instance_type_re in instance_config_re['Include']['InstanceTypes']:

# Even if it is included because of the family, check for explicit instance type inclusion because the config may be different than for the family.
include_instance_type = False
instance_type_config = {}
#logger.info(f"instance_config_re:\n{json.dumps(instance_config_re, indent=4, default=lambda o: '<not serializable>')}")
for instance_type_re_string, instance_type_re_dict in instance_config_re['Include']['InstanceTypes'].items():
instance_type_re = instance_type_re_dict['re']
if instance_type_re.match(instance_type):
logger.debug(f"Including {instance_type}")
include = True
region_instance_types.append(instance_type)
logger.debug(f"Including {instance_type} because explicitly included.")
include_instance_type = True
instance_type_config = instance_type_re_dict['config']
break
if not include:

if include_instance_family:
logger.debug(f"Including {instance_type} because {instance_family} family is included.")

if not (include_instance_family or include_instance_type):
logger.debug(f"Excluding {instance_type} because not included")
continue
instance_types[region] = sorted(region_instance_types)

instance_type_config['UseOnDemand'] = instance_type_config.get('UseOnDemand', instance_family_config.get('UseOnDemand', default_instance_type_config['UseOnDemand']))
instance_type_config['UseSpot'] = instance_type_config.get('UseSpot', instance_family_config.get('UseSpot', default_instance_type_config['UseSpot']))
instance_type_config['DisableSimultaneousMultithreading'] = instance_type_config.get('DisableSimultaneousMultithreading', instance_family_config.get('DisableSimultaneousMultithreading', default_instance_type_config['DisableSimultaneousMultithreading']))

region_instance_types[instance_type] = instance_type_config

instance_types[region] = region_instance_types
return instance_types

# Translate region code to region name
Expand Down
Loading

0 comments on commit 835f62b

Please sign in to comment.