Skip to content

Commit

Permalink
Add ParallelCluster 3.10.0 support
Browse files Browse the repository at this point in the history
Add support for ParallelCluster 3.10.0.

Add alinux2023 support.

Add support for external slurmdbd instance.

Update documentation.

Change the UID of the slurm user to 401 to match what ParallelCluster uses.
Otherwise munge flags security errors because the UID of the submitter doesn't match the head node.

Change the UpdateHeadNode lambda to only do the update via ssm if the cluster ins't already being updated.

Resolves #242

Change the installer so that it checks to make sure that the cluster stack
isn't already being changed or in a bad state.

Resolves #221
  • Loading branch information
cartalla committed Jul 12, 2024
1 parent 8ee5253 commit 12bf043
Show file tree
Hide file tree
Showing 9 changed files with 438 additions and 85 deletions.
73 changes: 65 additions & 8 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,16 @@ This project creates a ParallelCluster configuration file that is documented in
<a href="#database">Database</a>:
<a href="#databasestackname">DatabaseStackName</a>: str
<a href="#fqdn">FQDN</a>: str
<a href="#port">Port</a>: str
<a href="#database-port">Port</a>: str
<a href="#adminusername">AdminUserName</a>: str
<a href="#adminpasswordsecretarn">AdminPasswordSecretArn</a>: str
<a href="#clientsecuritygroup">ClientSecurityGroup</a>:
<a href="#database-clientsecuritygroup">ClientSecurityGroup</a>:
SecurityGroupName: SecurityGroupId
<a href="#slurmdbd">Slurmdbd</a>:
<a href="#slurmdbdstackname">SlurmdbdStackName</a>: str
<a href="#slurmdbd-host">Host</a>: str
<a href="#slurmdbd-port">Port</a>: str
<a href="#slurmdbd-clientsecuritygroup">ClientSecurityGroup</a>: str
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#HeadNode-v3-Dcv">Dcv:</a>
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Dcv-Enabled">Enabled</a>: bool
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Dcv-Port">Port</a>: int
Expand Down Expand Up @@ -304,13 +309,18 @@ See [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#p

Optional

**Note**: Starting with ParallelCluster 3.10.0, you should use slurm/ParallelClusterConfig/[Slurmdbd](#slurmdbd) instead of slurm/ParallelClusterConfig/Database.
You cannot have both parameters.

Configure the Slurm database to use with the cluster.

This is created independently of the cluster so that the same database can be used with multiple clusters.

The easiest way to do this is to use the [CloudFormation template provided by ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3) and then to just pass
the name of the stack in [DatabaseStackName](#databasestackname).
All of the other parameters will be pulled from the stack.
See [Create ParallelCluster Slurm Database](../deployment-prerequisites#create-parallelcluster-slurm-database) on the deployment prerequisites page.

If you used the [CloudFormation template provided by ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3), then the easiest way to configure it is to pass
the name of the stack in slurm/ParallelClusterConfig/Database/[DatabaseStackName](#databasestackname).
All of the other parameters will be pulled from the outputs of the stack.

See the [ParallelCluster documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#Scheduling-v3-SlurmSettings-Database).

Expand All @@ -330,7 +340,7 @@ The following parameters will be set using the outputs of the stack:

Used with the Port to set the [Uri](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmSettings-Database-Uri) of the database.

##### Port
##### Database: Port

type: int

Expand All @@ -353,11 +363,56 @@ This password is used together with AdminUserName and Slurm accounting to authen

Sets the [PasswordSecretArn](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmSettings-Database-PasswordSecretArn) parameter in ParallelCluster.

##### ClientSecurityGroup
##### Database: ClientSecurityGroup

Security group that has permissions to connect to the database.

Required to be attached to the head node that is running slurmdbd so that the port connection to the database is allows.
Required to be attached to the head node that is running slurmdbd so that the port connection to the database is allowed.

#### Slurmdbd

**Note**: This is not supported before ParallelCluster 3.10.0. If you specify this parameter then you cannot specify slurm/ParallelClusterConfig/[Database](#database).

Optional

Configure an external Slurmdbd instance to use with the cluster.
The Slurmdbd instance provides access to the shared Slurm database.
This is created independently of the cluster so that the same database can be used with multiple clusters.

This is created independently of the cluster so that the same slurmdbd instance can be used with multiple clusters.

See [Create Slurmdbd instance](../deployment-prerequisites#create-slurmdbd-instance) on the deployment prerequisites page.

If you used the [CloudFormation template provided by ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1), then the easiest way to configure it is to pass
the name of the stack in slurm/ParallelClusterConfig/Database/[SlurmdbdStackName](#slurmdbdstackname).
All of the other parameters will be pulled from the parameters and outputs of the stack.

See the [ParallelCluster documentation for ExternalSlurmdbd](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#Scheduling-v3-SlurmSettings-ExternalSlurmdbd).

##### SlurmdbdStackName

Name of the ParallelCluster CloudFormation stack that created the Slurmdbd instance.

The following parameters will be set using the outputs of the stack:

* Host
* Port
* ClientSecurityGroup

##### Slurmdbd: Host

IP address or DNS name of the Slurmdbd instance.

##### Slurmdbd: Port

Default: 6819

Port used by the slurmdbd daemon on the Slurmdbd instance.

##### Slurmdbd: ClientSecurityGroup

Security group that has access to use the Slurmdbd instance.
This will be added as an extra security group to the head node.

### ClusterName

Expand All @@ -373,6 +428,8 @@ For an existing secret can be the secret name or the ARN.
If the secret doesn't exist one will be created, but won't be part of the cloudformation stack so that it won't be deleted when the stack is deleted.
Required if your submitters need to use more than 1 cluster.

See [Create Munge Key](../deployment-prerequisites#create-munge-key) for more details.

### SlurmCtl

Configure the Slurm head node or controller.
Expand Down
18 changes: 0 additions & 18 deletions docs/deploy-parallel-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,6 @@ The current latest version is 3.9.1.

See [Deployment Prerequisites](deployment-prerequisites.md) page.

### Create ParallelCluster UI (optional but recommended)

It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters.
A different UI is required for each version of ParallelCluster that you are using.
The versions are list in the [ParallelCluster Release Notes](https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html).
The minimum required version is 3.6.0 which adds support for RHEL 8 and increases the number of allows queues and compute resources.
The suggested version is at least 3.7.0 because it adds configurable compute node weights which we use to prioritize the selection of
compute nodes by their cost.

The instructions are in the [ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html).

### Create ParallelCluster Slurm Database

The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling.
It you need these and other features then you will need to create a ParallelCluster Slurm Database.
You do not need to create a new database for each cluster; multiple clusters can share the same database.
Follow the directions in this [ParallelCluster tutorial to configure slurm accounting](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3).

## Create the Cluster

To install the cluster run the install script. You can override some parameters in the config file
Expand Down
96 changes: 86 additions & 10 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,78 @@ The version that has been tested is in the CDK_VERSION variable in the install s

The install script will try to install the prerequisites if they aren't already installed.

## Create ParallelCluster UI (optional but recommended)

It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters.
A different UI is required for each version of ParallelCluster that you are using.
The versions are list in the [ParallelCluster Release Notes](https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html).
The minimum required version is 3.6.0 which adds support for RHEL 8 and increases the number of allows queues and compute resources.
The suggested version is at least 3.7.0 because it adds configurable compute node weights which we use to prioritize the selection of
compute nodes by their cost.

The instructions are in the [ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html).

## Create Munge Key

Munge is a package that Slurm uses to secure communication between servers.
The munge service uses a preshared key that must be the same on all of the servers in the Slurm cluster.
If you want to be able to use multiple clusters from your submission hosts, such as virtual desktops, then all of the clusters must be using the same munge key.
This is done by creating a munge key and storing it in secrets manager.
The secret is then passed as a parameter to ParallelCluster so that it can use it when configuring munge on all of the cluster instances.

To create the munge key and store it in AWS Secrets Manager, run the following commands.

```
aws secretsmanager create-secret --name SlurmMungeKey --secret-string "$(dd if=/dev/random bs=1024 count=1 | base64 -w 0)"
```

Save the ARN of the secret for when you create the Slurmdbd instance and for when you create the configuration file.

See the [Slurm documentation for authentication](https://slurm.schedmd.com/authentication.html) for more information.

See the [ParallelCluster documentation for MungeKeySecretArn](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmSettings-MungeKeySecretArn).

See the [MungeKeySecret configuration parameter](../config#mungekeysecret).

## Create ParallelCluster Slurm Database

The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling.
It you need these and other features then you will need to create a ParallelCluster Slurm Database.
You do not need to create a new database for each cluster; multiple clusters can share the same database.
Follow the directions in this [ParallelCluster tutorial to configure slurm accounting](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3).

## Create Slurmdbd Instance

**Note**: Before ParallelCluster 3.10.0, the slurmdbd daemon that connects to the data was created on each cluster's head node.
The recommended Slurm architecture is to have a shared slurmdbd daemon that is used by all of the clusters.
Starting in version 3.10.0, ParallelCluster supports specifying an external slurmdbd instance when you create a cluster and provide a cloud formation template to create it.

Follow the directions in this [ParallelCluster tutorial to configure slurmdbd](https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1).
This requires that you have already created the slurm database.

Here are some notes on the required parameters and how to fill them out.

| Parameter | Description
|--------------|------------
| AmiId | You can get this using the ParallelCluster UI. Click on Images and sort on Operating system. Confirm that the version is at least 3.10.0. Select the AMI for alinux2023 and the arm64 architecture.
| CustomCookbookUrl | Leave blank
| DBMSClientSG | Get this from the DatabaseClientSecurityGroup output of the database stack.
| DBMSDatabaseName | This is an arbitrary name. It must be alphanumeric. I use slurmaccounting
| DBMSPasswordSecretArn | Get this from the DatabaseSecretArn output of the database stack
| DBMSUri | Get this from the DatabaseHost output of the database stack. Note that if you copy and paste the link you should delete the https:// prefix and the trailing '/'.
| DBMSUsername | Get this from the DatabaseAdminUser output of the database stack.
| EnableSlurmdbdSystemService | Set to true. Note the warning. If the database already exists and was created with an older version of slurm then the database will be upgraded. This may break clusters using an older slurm version that are still using the cluster. Set to false if you don't want this to happen.
| InstanceType | Choose an instance type that is compatible with the AMI. For example, m7g.large.
| KeyName | Use an existing EC2 key pair.
| MungeKeySecretArn | ARN of an existing munge key secret. See [Create Munge Key](#create-munge-key).
| PrivateIp | Choose an available IP in the subnet.
| PrivatePrefix | CIDR of the instance's subnet.
| SlurmdbdPort | 6819
| SubnetId | Preferably the same subnet where the clusters will be deployed.
| VPCId | The VPC of the subnet.

The stack name will be used in the slurm/ParallelClusterConfig/[SlurmdbdStackName](../config#slurmdbdstackname) configuration parameter.

## Security Groups for Login Nodes

If you want to allow instances like remote desktops to use the cluster directly, you must define
Expand All @@ -111,25 +183,29 @@ I'll call the three security groups the following names, but they can be whateve
* SlurmHeadNodeSG
* SlurmComputeNodeSG

First create these security groups without any security group rules.
The reason for this is that the security group rules reference the other security groups so the groups must all exist before any of the rules can be created.
After you have created the security groups then create the rules as described below.

### Slurm Submitter Security Group

The SlurmSubmitterSG will be attached to your login nodes, such as your virtual desktops.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11
| Type | Port range | Source | Description | Details
|------|------------|--------------------|------------ |--------
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral | Head node can use ephemeral ports to connect to the submitter
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral | Compute node will connect to submitter using ephemeral ports to manage interactive shells
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11 | Compute node can send X11 traffic to submitter for GUI applications

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd
| Type | Port range | Destination | Description | Details
|------|------------|--------------------|-------------|--------
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS | Mount the slurm NFS file system with binaries and config
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd | Connect to compute node for interactive jobs
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd | Connect to slurmdbd (accounting database) daemon on head node for versions before 3.10.0.
| TCP | 6820-6829 | SlurmHeadNodeSG | SlurmHeadNode slurmctld
| TCP | 6830 | SlurmHeadNodeSG | SlurmHeadNode slurmrestd

Expand Down
Loading

0 comments on commit 12bf043

Please sign in to comment.