Skip to content

Commit

Permalink
Update config files and fix errors found in testing new configs (#214)
Browse files Browse the repository at this point in the history
Add --RESEnvironmentName to the installer

Ease initial integration with Research and Engineering Studio (RES).

Automatically add the correct submitter security groups and configure
the /home directory.

Automatically choose the subnets if not specified based on RES subnets.

Resolves #207

============================

Update template config files

Added more comments to clarify that these are examples that should be copied
and customized by users.

Added comments for typical configuration options.

Deleted obsolete configs that were from v1.

Resolves #203

=============================

Set default head node instance type based on architecture.

Resolves #206

==============================

Clean up ansible-lint errors and warnings.
Arm architecture cluster was failing because of an incorrect condition in the ansible playbook that is flagged by lint.

==============================

Use vdi controller instead of cluster manager for users and groups info

Cluster manager stopped being domain joined for some reason.

==============================

Paginate describe_instances when creating head node a record.

Otherwise, may not find the cluster head node instance.

==============================

Add default MungeKeySecret.

This should be the default or you can't access multiple clusters from the same server.

==============================

Increase timeout for ssm command that configures submitters

Need the time to compile slurm.

==============================

Force slurm to be rebuilt for submitters of all os distributions even if they match the os of the cluster.

Otherwise get errors because can't find PluginDir in the same location as when it was compiled.

==============================

Paginate describe_instances in UpdateHeadNode lambda

==============================

Add check for min memory of 4 GB for slurm controller

==============================

Update documentation.

Remove Regions from InstanceConfig. This was left over from legacy cluster.
ParallelCluster doesn't support multiple regions.
  • Loading branch information
cartalla authored Mar 22, 2024
1 parent a8b6555 commit 58f70e7
Show file tree
Hide file tree
Showing 63 changed files with 1,639 additions and 1,534 deletions.
88 changes: 6 additions & 82 deletions docs/debug.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,12 @@
# Debug

## Log Files on File System
For ParallelCluster and Slurm issues, refer to the official [AWS ParallelCluster Troubleshooting documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html).

Most of the key log files are stored on the Slurm file system so that they can be accessed from any instance with the file system mounted.

| Logfile | Description
|---------|------------
| `/opt/slurm/{{ClusterName}}/logs/nodes/{{node-name}}/slurmd.log` | Slurm daemon (slurmd) logfile
| `/opt/slurm/{{ClusterName}}/logs/nodes/{{node-name}}/spot_monitor.log` | Spot monitor logfile
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/cloudwatch.log` | Cloudwatch cron (slurm_ec2_publish_cw.py) logfile
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/power_save.log` | Power saving API logfile
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/slurmctld.log` | Slurm controller daemon (slurmctld) logfile
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/terminate_old_instances.log` | Terminate old instances cron (terminate_old_instances.py) logfile
| `/opt/slurm/{{ClusterName}}/logs/slurmdbd/slurmdbd.log` | Slurm database daemon (slurmdbd) logfile

## Slurm AMI Nodes

The Slurm AMI nodes build the Slurm binaries for all of the configured operating system (OS) variants.
The Amazon Linux 2 build is a prerequisite for the Slurm controllers and slurmdbd instances.
The other builds are prerequisites for compute nodes and submitters.

First check for errors in the user data script. The following command will show the output:

`grep cloud-init /var/log/messages | less`

The most common problem is that the ansible playbook failed.
Check the ansible log file to see what failed.

`less /var/log/ansible.log`

The following command will rerun the user data.
It will download the playbooks from the S3 deployment bucket and then run it to configure the instance.

`/var/lib/cloud/instance/scripts/part-001`

If the problem is with the ansible playbook, then you can edit it in /root/playbooks and then run
your modified playbook by running the following command.

`/root/slurm_node_ami_config.sh`

## Slurm Controller
## Slurm Head Node

If slurm commands hang, then it's likely a problem with the Slurm controller.

The first thing to check is the controller's logfile which is stored on the Slurm file system.

`/opt/slurm/{{ClusterName}}/logs/nodes/slurmctl[1-2]/slurmctld.log`

If the logfile doesn't exist or is empty then you will need to connect to the slurmctl instance using SSM Manager or ssh and switch to the root user.
Connect to the head node from the EC2 console using SSM Manager or ssh and switch to the root user.

`sudo su`

Expand All @@ -59,24 +18,14 @@ If it isn't then first check for errors in the user data script. The following c

`grep cloud-init /var/log/messages | less`

The most common problem is that the ansible playbook failed.
Check the ansible log file to see what failed.
Then check the controller's logfile.

`less /var/log/ansible.log`
`/var/log/slurmctld.log`

The following command will rerun the user data.
It will download the playbooks from the S3 deployment bucket and then run it to configure the instance.

`/var/lib/cloud/instance/scripts/part-001`

If the problem is with the ansible playbook, then you can edit it in /root/playbooks and then run
your modified playbook by running the following command.

`/root/slurmctl_config.sh`

The daemon may also be failing because of some other error.
Check the `slurmctld.log` for errors.

Another way to debug the `slurmctld` daemon is to launch it interactively with debug set high.
The first thing to do is get the path to the slurmctld binary.

Expand All @@ -90,31 +39,6 @@ Then you can run slurmctld:
$slurmctld -D -vvvvv
```

### Slurm Controller Log Files

| Logfile | Description
|---------|------------
| `/var/log/ansible.log` | Ansible logfile
| `/var/log/slurm/cloudwatch.log` | Logfile for the script that uploads CloudWatch events.
| `/var/log/slurm/slurmctld.log` | slurmctld logfile
| `/var/log/slurm/power_save.log` | Slurm plugin logfile with power saving scripts that start, stop, and terminated instances.
| `/var/log/slurm/terminate_old_instances.log` | Logfile for the script that terminates stopped instances.

## Slurm Accounting Database (slurmdbd)

If you are having problems with the slurm accounting database connect to the slurmdbd instance using SSM Manager.

Check for cloud-init and ansible errors the same way as for the slurmctl instance.

Also check the `slurmdbd.log` for errors.

### Log Files

| Logfile | Description
|---------|------------
| `/var/log/ansible.log` | Ansible logfile
| `/var/log/slurm/slurmdbd.log` | slurmctld logfile

## Compute Nodes

If there are problems with the compute nodes, connect to them using SSM Manager.
Expand All @@ -132,7 +56,7 @@ Check that the slurm daemon is running.

| Logfile | Description
|---------|------------
| `/var/log/slurm/slurmd.log` | slurmctld logfile
| `/var/log/slurmd.log` | slurmctld logfile

## Job Stuck in Pending State

Expand Down
51 changes: 10 additions & 41 deletions docs/delete-cluster.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,14 @@
# Delete Cluster (legacy)
# Delete Cluster

Most of the resources can be deleted by simply deleting the cluster's CloudFormation stack.
However, there a couple of resources that must be manually deleted:
To delete the cluster all you need to do is delete the configuration CloudFormation stack.
This will delete the ParallelCluster cluster and all of the configuration resources.

* The Slurm RDS database
* The Slurm file system
If you specified RESEnvironmentName then it will also deconfigure the creation of `users_groups.json` and also deconfigure the VDI
instances so they are no longer using the cluster.

The deletion of the CloudFormation stack will fail because of these 2 resources and some resources that are used
by them will also fail to delete.
Manually delete the resources and then retry deleting the CloudFormation stack.
If you deployed the Slurm database stack then you can keep that and use it for other clusters.
If you don't need it anymore, then you can delete the stack.
You will also need to manually delete the RDS database.

## Manually Delete RDS Database

If the database contains production data then it is highly recommended that you back up the data.
You could also keep the database and use it for creating new clusters.


Even after deleting the database CloudFormation may say that it failed to delete.
Confirm in the RDS console that it deleted and then ignore the resource when retrying the stack deletion.

* Go the the RDS console
* Select Databases on the left
* Remove deletion protection
* Select the cluster's database
* Click `Modify`
* Expand `Additional scaling configuration`
* Uncheck `Scale the capacity to 0 ACIs when cluster is idle`
* Uncheck `Enable deletion protection`
* Click `Continue`
* Select `Apply immediately`
* Click `Modify cluster`
* Delete the database
* Select the cluster's database
* Click `Actions` -> `Delete`
* Click `Delete DB cluster`

## Manually delete the Slurm file system

### FSx for OpenZfs

* Go to the FSx console
* Select the cluster's file system
* Click `Actions` -> `Delete file system`
* Click `Delete file system`
If you deployed the ParallelCluster UI then you can keep it and use it with other clusters.
If you don't need it anymore then you can delete the stack.
44 changes: 15 additions & 29 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,18 +96,18 @@ You should save your selections in the config file.

| Parameter | Description | Valid Values | Default
|------------------------------------|-------------|--------------|--------
| [StackName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L221)] | The cloudformation stack that will deploy the cluster. | | None
| [slurm/ClusterName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L318-L320) | Name of the Slurm cluster | For ParallelCluster shouldn't be the same as StackName | | None
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L222-L223) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L226-L227) | The vpc where the cluster will be deployed. | vpc-* | None
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L224-L225) | EC2 Keypair to use for instances | | None
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L435-L439) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L233-L234) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L444-L509) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)
| [StackName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L366-L367) | The cloudformation stack that will deploy the cluster. | | None
| [slurm/ClusterName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L447-L452) | Name of the Slurm cluster | For ParallelCluster shouldn't be the same as StackName | | None
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L368-L369) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L372-L373) | The vpc where the cluster will be deployed. | vpc-* | None
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L370-L371) | EC2 Keypair to use for instances | | None
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L480-L485) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L379-L380) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)

### Configure the Compute Instances

The [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L444-L509) configuration parameter configures the base operating systems, CPU architectures, instance families,
The [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) configuration parameter configures the base operating systems, CPU architectures, instance families,
and instance types that the Slurm cluster should support.
ParallelCluster currently doesn't support heterogeneous clusters;
all nodes must have the same architecture and Base OS.
Expand All @@ -118,6 +118,7 @@ all nodes must have the same architecture and Base OS.
| CentOS 7 | x86_64
| RedHat 7 | x86_64
| RedHat 8 | x86_64, arm64
| Rocky 8 | x86_64, arm64

You can exclude instances types by family or specific instance type.
By default the InstanceConfig excludes older generation instance families.
Expand All @@ -134,19 +135,16 @@ The disadvantage is higher cost if the instance is lightly loaded.
The default InstanceConfig includes all supported base OSes and architectures and burstable and general purpose
instance types.

* [default instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L124-L166)
* [default instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L168-L173)
* [default excluded instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L175-L192)
* [default excluded instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L194-L197)
* [default instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L230-L271)
* [default instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L314-L319)
* [default excluded instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L321-L338)
* [default excluded instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L340-L343)

Note that instance types and families are python regular expressions.

```
slurm:
InstanceConfig:
BaseOsArchitecture:
CentOS:
7: [x86_64]
Include:
InstanceFamilies:
- t3.*
Expand All @@ -160,9 +158,6 @@ The following InstanceConfig configures instance types recommended for EDA workl
```
slurm:
InstanceConfig:
BaseOsArchitecture:
CentOS:
7: [x86_64]
Include:
InstanceFamilies:
- c5.*
Expand All @@ -186,15 +181,6 @@ slurm:
DefaultMinCount: 1
```

The Legacy cluster also allows you to specify the names of specific nodes.

```
slurm:
InstanceConfig:
AlwaysOnNodes:
- nodename-[0-4]
```

### Configure Fair Share Scheduling (Optional)

Slurm supports [fair share scheduling](https://slurm.schedmd.com/fair_tree.html), but it requires the fair share policy to be configured.
Expand Down Expand Up @@ -285,7 +271,7 @@ then jobs will stay pending in the queue until a job completes and frees up a li
Combined with the fairshare algorithm, this can prevent users from monopolizing licenses and preventing others from
being able to run their jobs.

Licenses are configured using the [slurm/Licenses](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L621-L629) configuration variable.
Licenses are configured using the [slurm/Licenses](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L569-L577) configuration variable.
If you are using the Slurm database then these will be configured in the database.
Otherwises they will be configured in **/opt/slurm/{{ClusterName}}/etc/slurm_licenses.conf**.

Expand Down
22 changes: 5 additions & 17 deletions docs/onprem.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# On-Premises Integration (legacy)
# On-Premises Integration

The slurm cluster can also be configured to manage on-premises compute nodes.
The Slurm cluster can also be configured to manage on-premises compute nodes.
The user must configure the on-premises compute nodes and then give the configuration information.

## Network Requirements
Expand All @@ -20,6 +20,9 @@ All of the compute nodes in the cluster, including the on-prem nodes, must have
This can involve mounting filesystems across VPN or Direct Connect or synchronizing file systems using tools like rsync or NetApp FlexCache or SnapMirror.
Performance will dictate the architecture of the file system.

The onprem compute nodes must mount the Slurm controller's NFS export so that they have access to the Slurm binaries and configuration file.
They must then be configured to run slurmd so that they can be managed by Slurm.

## Slurm Configuration of On-Premises Compute Nodes

The slurm cluster's configuration file allows the configuration of on-premises compute nodes.
Expand All @@ -29,21 +32,6 @@ All that needs to be configured are the configuration file for the on-prem nodes

```
InstanceConfig:
UseSpot: true
DefaultPartition: CentOS_7_x86_64_spot
NodesPerInstanceType: 10
BaseOsArchitecture:
CentOS: {7: [x86_64]}
Include:
MaxSizeOnly: false
InstanceFamilies:
- t3
InstanceTypes: []
Exclude:
InstanceFamilies: []
InstanceTypes:
- '.+\.(micro|nano)' # Not enough memory
- '.*\.metal'
OnPremComputeNodes:
ConfigFile: 'slurm_nodes_on_prem.conf'
CIDR: '10.1.0.0/16'
Expand Down
Loading

0 comments on commit 58f70e7

Please sign in to comment.