Skip to content

Commit

Permalink
Update config files and fix errors found in testing new configs
Browse files Browse the repository at this point in the history
Add --RESEnvironmentName to the installer

Ease initial integration with Research and Engineering Studio (RES).

Automatically add the correct submitter security groups and configure
the /home directory.

Resolves #207

============================

Update template config files

Added more comments to clarify that these are examples that should be copied
and customized by users.

Added comments for typical configuration options.

Deleted obsolete configs that were from v1.

Resolves #203

=============================

Set default head node instance type based on architecture.

Resolves #206

==============================

Clean up ansible-lint errors and warnings.
Arm architecture cluster was failing because of an incorrect condition in the ansible playbook that is flagged by lint.

==============================

Use vdi controller instead of cluster manager for users and groups info

Cluster manager stopped being domain joined for some reason.

==============================

Paginate describe_instances when creating head node a record.

Otherwise, may not find the cluster head node instance.

==============================

Add default MungeKeySecret.

This should be the default or you can't access multiple clusters from the same server.

==============================

Increase timeout for ssm command that configures submitters

Need the time to compile slurm.

==============================

Force slurm to be rebuilt for submitters of all os distributions even if they match the os of the cluster.

Otherwise get errors because can't find PluginDir in the same location as when it was compiled.

==============================

Paginate describe_instances in UpdateHeadNode lambda

==============================

Add check for min memory of 4 GB for slurm controller
  • Loading branch information
cartalla committed Mar 8, 2024
1 parent a8b6555 commit 0a8a45c
Show file tree
Hide file tree
Showing 52 changed files with 1,442 additions and 1,151 deletions.
357 changes: 309 additions & 48 deletions source/cdk/cdk_slurm_stack.py

Large diffs are not rendered by default.

49 changes: 41 additions & 8 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
logger.setLevel(logging.INFO)

# MIN_PARALLEL_CLUSTER_VERSION
# Releases: https://github.com/aws/aws-parallelcluster/releases
# 3.2.0:
# * Add support for memory-based job scheduling in Slurm
# 3.3.0:
Expand All @@ -61,7 +62,7 @@
# * Fix pmix CVE
# * Use Slurm 23.02.5
MIN_PARALLEL_CLUSTER_VERSION = parse_version('3.6.0')
DEFAULT_PARALLEL_CLUSTER_VERSION = parse_version('3.8.0')
# Update source/resources/default_config.yml with latest version when this is updated.
PARALLEL_CLUSTER_VERSIONS = [
'3.6.0',
'3.6.1',
Expand Down Expand Up @@ -124,7 +125,7 @@
]

def get_parallel_cluster_version(config):
return config['slurm']['ParallelClusterConfig'].get('Version', str(DEFAULT_PARALLEL_CLUSTER_VERSION))
return config['slurm']['ParallelClusterConfig']['Version']

def get_PARALLEL_CLUSTER_MUNGE_VERSION(config):
parallel_cluster_version = get_parallel_cluster_version(config)
Expand Down Expand Up @@ -185,6 +186,38 @@ def PARALLEL_CLUSTER_SUPPORTS_HOME_MOUNT(parallel_cluster_version):
logger.error(f"{fg('red')}Unable to list all AWS regions. Make sure you have set your IAM credentials. {err} {attr('reset')}")
exit(1)

VALID_ARCHITECTURES = ['arm64', 'x86_64']

DEFAULT_ARCHITECTURE = 'x86_64'

# Controller needs at least 4 GB or will hit OOM

DEFAULT_ARM_CONTROLLER_INSTANCE_TYPE = 'c6g.large'

DEFAULT_X86_CONTROLLER_INSTANCE_TYPE = 'c6a.large'

def default_controller_instance_type(config):
architecture = config['slurm']['ParallelClusterConfig'].get('Architecture', DEFAULT_ARCHITECTURE)
if architecture == 'x86_64':
return DEFAULT_X86_CONTROLLER_INSTANCE_TYPE
elif architecture == 'arm64':
return DEFAULT_ARM_CONTROLLER_INSTANCE_TYPE
else:
raise ValueError(f"Invalid architecture: {architecture}")

DEFAULT_ARM_OS = 'rhel8'

DEFAULT_X86_OS = 'rhel8'

def DEFAULT_OS(config):
architecture = config['slurm']['ParallelClusterConfig'].get('Architecture', DEFAULT_ARCHITECTURE)
if architecture == 'x86_64':
return DEFAULT_X86_OS
elif architecture == 'arm64':
return DEFAULT_ARM_OS
else:
raise ValueError(f"Invalid architecture: {architecture}")

filesystem_lifecycle_policies = [
'None',
'AFTER_14_DAYS',
Expand Down Expand Up @@ -350,12 +383,12 @@ def get_config_schema(config):
'slurm': {
Optional('ParallelClusterConfig'): {
Optional('Enable', default=True): And(bool, lambda s: s == True),
Optional('Version', default=str(DEFAULT_PARALLEL_CLUSTER_VERSION)): And(str, lambda version: version in PARALLEL_CLUSTER_VERSIONS, lambda version: parse_version(version) >= MIN_PARALLEL_CLUSTER_VERSION),
Optional('Image', default={'Os': 'centos7'}): {
'Os': And(str, lambda s: s in PARALLEL_CLUSTER_ALLOWED_OSES, ),
'Version': And(str, lambda version: version in PARALLEL_CLUSTER_VERSIONS, lambda version: parse_version(version) >= MIN_PARALLEL_CLUSTER_VERSION),
Optional('Image', default={'Os': DEFAULT_OS(config)}): {
'Os': And(str, lambda s: s in PARALLEL_CLUSTER_ALLOWED_OSES),
Optional('CustomAmi'): And(str, lambda s: s.startswith('ami-')),
},
Optional('Architecture', default='x86_64'): And(str, lambda s: s in ['arm64', 'x86_64']),
Optional('Architecture', default=DEFAULT_ARCHITECTURE): And(str, lambda s: s in VALID_ARCHITECTURES),
Optional('ComputeNodeAmi'): And(str, lambda s: s.startswith('ami-')),
Optional('DisableSimultaneousMultithreading', default=True): bool,
# Recommend to not use EFA unless necessary to avoid insufficient capacity errors when starting new instances in group or when multiple instance types in the group
Expand Down Expand Up @@ -424,13 +457,13 @@ def get_config_schema(config):
# If the secret doesn't exist one will be created, but won't be part of the cloudformation stack
# so that it won't be deleted when the stack is deleted.
# Required if your submitters need to use more than 1 cluster.
Optional('MungeKeySecret'): str,
Optional('MungeKeySecret', default='/slurm/munge_key'): str,
#
# SlurmCtl:
# Required, but can be an empty dict to accept all of the defaults
'SlurmCtl': {
Optional('SlurmdPort', default=6818): int,
Optional('instance_type', default='c6a.large'): str,
Optional('instance_type', default=default_controller_instance_type(config)): str,
Optional('volume_size', default=200): int,
Optional('CloudWatchPeriod', default=5): int,
Optional('PreemptMode', default='REQUEUE'): And(str, lambda s: s in ['OFF', 'CANCEL', 'GANG', 'REQUEUE', 'SUSPEND']),
Expand Down
72 changes: 68 additions & 4 deletions source/resources/config/default_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,92 @@
#====================================================================
# Sample configuraton that creates a minimal Slurm cluster
#
# NOTE: This is just an example.
# Please create your own revision controlled config file.
#
# No SlurmDbd in this configuration.
# Configure 5 each of t3 instance types.
#
# This config doesn't provide required parameters like VpcId so you must
# use the --prompt option with it.
# To use:
# source setup.sh
# ./install.sh --config-file source/config/default_config.yml --prompt
#
# Defaults and valid configuration options are in source/config_schema.py.
# Command line values override values in the config file.
#====================================================================

StackName: slurmminimal-config

# @TODO: Add Region
# Region: {{Region}}

# @TODO: Add your SshKeyPair
# SshKeyPair: {{SshKeyPair}}

# @TODO: Update with your VPC
# VpcId: vpc-xxxxxxxxxxxxxxxxx

# @TODO: Update with your private subnet in your VPC
# SubnetId: subnet-xxxxxxxxxxxxxxxxx

# @TODO: Update with your SNS Topic. Make sure to subscribe your email address to the topic and confirm the subscription
# ErrorSnsTopicArn: arn:aws:sns:{{Region}}:{{AccountId}}:{{TopicName}}

# @TODO: Add your preferred timezone so times aren't in UTC
# TimeZone: America/Chicago # America/Los_Angeles or America/Denver or America/New_York

# @TODO: If using Research and Engineering Studio, update with environment name
# RESEnvironmentName: {{ResEnvironmentName}}

slurm:
ParallelClusterConfig:
Enable: true
Version: 3.8.0
# @TODO: Choose the CPU architecture: x86_64, arm64. Default: x86_64
# Architecture: x86_64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
# Database:
# DatabaseStackName: {{DatabaseStackName}}

MungeKeySecret: SlurmMungeKey

SlurmCtl: {}

# InstanceConfig:
# Configure the instances used by the cluster
# A partition will be created for each combination of Base OS, Architecture, and Spot
InstanceConfig:
UseSpot: true
Include:
# @TODO: Update InstanceFamiles and InstanceTypes to use in your cluster
InstanceFamilies:
- t3
InstanceTypes: []
NodeCounts:
# @TODO: Update the max number of each instance type to configure
DefaultMaxCount: 5
# @TODO: You can update the max instance count for each compute resource
# ComputeResourceCounts:
# od-1024gb-16-cores: # x2iedn.8xlarge', x2iezn.8xlarge
# MaxCount: 1
# sp-1024gb-16-cores: # x2iedn.8xlarge', x2iezn.8xlarge
# MaxCount: 2

# @TODO: Configure storage mounts
# storage:
# ExtraMounts:
# - dest: /home
# StorageType: Efs
# FileSystemId: 'fs-xxxxxxxxxxxxxxxxx'
# src: fs-xxxxxxxxxxxxxxxxx.efs.{{Region}}.amazonaws.com:/
# type: nfs4
# options: nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
# ExtraMountSecurityGroups:
# nfs:
# DCV-Host: sg-xxxxxxxxxxxxxxxxx

# @TODO: Configure license counts
Licenses:
vcs:
Count: 10
Server: synopsys_licenses
Port: '24680'
ServerType: flexlm
84 changes: 84 additions & 0 deletions source/resources/config/slurm_all_arm_instance_types.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
#====================================================================
# Minimal cluster with all X86_64 instance types
#
# NOTE: This is just an example.
# Please create your own revision controlled config file.
#
# No SlurmDbd in this configuration.
# Configure 10 each of all x86_64 instance types.
#
# Defaults and valid configuration options are in source/config_schema.py.
# Command line values override values in the config file.
#====================================================================

StackName: slurm-all-arm-config

# @TODO: Add Region
# Region: {{Region}}

# @TODO: Add your SshKeyPair
# SshKeyPair: {{SshKeyPair}}

# @TODO: Update with your VPC
# VpcId: vpc-xxxxxxxxxxxxxxxxx

# @TODO: Update with your private subnet in your VPC
# SubnetId: subnet-xxxxxxxxxxxxxxxxx

# @TODO: Update with your SNS Topic. Make sure to subscribe your email address to the topic and confirm the subscription
# ErrorSnsTopicArn: arn:aws:sns:{{Region}}:{{AccountId}}:{{TopicName}}

# @TODO: Add your preferred timezone so times aren't in UTC
# TimeZone: America/Chicago # America/Los_Angeles or America/Denver or America/New_York

# @TODO: If using Research and Engineering Studio, update with environment name
# RESEnvironmentName: {{ResEnvironmentName}}

slurm:
ParallelClusterConfig:
Version: 3.8.0
Architecture: arm64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
# Database:
# DatabaseStackName: {{DatabaseStackName}}

MungeKeySecret: SlurmMungeKey

SlurmCtl: {}

InstanceConfig:
UseSpot: true
Include:
InstanceFamilies: ['.*']
InstanceTypes: []
NodeCounts:
# @TODO: Update the max number of each instance type to configure
DefaultMaxCount: 5
# @TODO: You can update the max instance count for each compute resource
ComputeResourceCounts:
od-1024gb-64-cores: # x2gd.16xlarge
MaxCount: 1
sp-1024gb-64-cores: # x2gd.16xlarge
MaxCount: 2

# @TODO: Configure storage mounts
# storage:
# ExtraMounts:
# - dest: /home
# StorageType: Efs
# FileSystemId: 'fs-xxxxxxxxxxxxxxxxx'
# src: fs-xxxxxxxxxxxxxxxxx.efs.{{Region}}.amazonaws.com:/
# type: nfs4
# options: nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
# ExtraMountSecurityGroups:
# nfs:
# DCV-Host: sg-xxxxxxxxxxxxxxxxx

# @TODO: Configure license counts
Licenses:
vcs:
Count: 10
Server: synopsys_licenses
Port: '24680'
ServerType: flexlm
26 changes: 0 additions & 26 deletions source/resources/config/slurm_all_instance_types.yml

This file was deleted.

37 changes: 0 additions & 37 deletions source/resources/config/slurm_all_os.yml

This file was deleted.

Loading

0 comments on commit 0a8a45c

Please sign in to comment.