Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize RES configuration of login nodes and user/group json #255

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,9 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
for dst_sg_name, dst_sg in lustre_security_groups.items():
src_sg.connections.allow_to(dst_sg, ec2.Port.tcp(988), f"{src_sg_name} to {dst_sg_name} lustre")
src_sg.connections.allow_to(dst_sg, ec2.Port.tcp_range(1018, 1023), f"{src_sg_name} to {dst_sg_name} lustre")
# It shouldn't be necessary to do allow_to and allow_from, but CDK left off the ingress rule form lustre to lustre if I didn't add the allow_from.
dst_sg.connections.allow_from(src_sg, ec2.Port.tcp(988), f"{src_sg_name} to {dst_sg_name} lustre")
dst_sg.connections.allow_from(src_sg, ec2.Port.tcp_range(1018, 1023), f"{src_sg_name} to {dst_sg_name} lustre")

# Rules for FSx Ontap
for fsx_client_sg_name, fsx_client_sg in fsx_client_security_groups.items():
Expand All @@ -138,12 +141,21 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
fsx_client_sg.connections.allow_to(fsx_ontap_sg, ec2.Port.udp(4046), f"{fsx_client_sg_name} to {fsx_ontap_sg_name} Network status monitor for NFS")

for fsx_zfs_sg_name, fsx_zfs_sg in zfs_security_groups.items():
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.tcp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.udp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.tcp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.udp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.tcp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.udp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.tcp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.udp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.tcp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.udp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.tcp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.udp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
# There is a bug in PC 3.10.1 that requires outbound traffic to be enabled even though ZFS doesn't.
cartalla marked this conversation as resolved.
Show resolved Hide resolved
# Remove when bug in PC is fixed.
# Tracked by https://github.com/aws-samples/aws-eda-slurm-cluster/issues/253
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.tcp(111), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.udp(111), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.tcp(2049), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.udp(2049), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.tcp_range(20001, 20003), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.udp_range(20001, 20003), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS mount, status monitor, and lock daemon")

for sg_name, sg in security_groups.items():
CfnOutput(self, f"{sg_name}Id",
Expand Down
69 changes: 68 additions & 1 deletion docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,16 @@ This project creates a ParallelCluster configuration file that is documented in
<a href="#timezone">TimeZone</a>: str
<a href="#additionalsecuritygroupsstackname">AdditionalSecurityGroupsStackName</a>: str
<a href="#resstackname">RESStackName</a>: str
<a href="#externalloginnodes">ExternalLoginNodes</a>:
- <a href="#tags">Tags</a>:
- Key: str
Values: [ str ]
SecurityGroupId: str
<a href="#domainjoinedinstance">DomainJoinedInstance</a>:
- <a href="#tags">Tags</a>:
- Key: str
Values: [ str ]
SecurityGroupId: str
<a href="#slurm">slurm</a>:
<a href="#parallelclusterconfig">ParallelClusterConfig</a>:
<a href="#version">Version</a>: str
Expand Down Expand Up @@ -212,6 +222,63 @@ This requires you to [configure security groups for external login nodes](../dep
The Slurm binaries will be compiled for the OS of the desktops and and environment modulefile will be created
so that the users just need to load the cluster modulefile to use the cluster.

### ExternalLoginNodes

An array of specifications for instances that should automatically be configured as Slurm login nodes.
Each array element contains one or more tags that will be used to select login node instances.
It also includes the security group id that must be attached to the login node to give it access to the slurm cluster.
The tags for a group of instances is an array with the tag name and an array of values.

A lambda function processes each login node specification.
It uses the tags to select running instances.
If the instances do not have the security group attached, then it will attach the security group.
It will then run a script each instance to configure it as a login node for the slurm cluster.
To use the cluster, users simply load the environment modulefile that is created by the script.

For example, to configure RES virtual desktops as Slurm login nodes the following configuration is added.

```
---
ExternalLoginNodes:
- Tags:
- Key: 'res:EnvironmentName'
Values: [ 'res-eda' ]
- Key: 'res:NodeType'
Values: ['virtual-desktop-dcv-host']
SecurityGroupId: <SlurmLoginNodeSGId>
```

### DomainJoinedInstance

A specifications for a domain joined instance that will be used to create and update users_groups.json.
It also includes the security group id that must be attached to the login node to give it access to the slurm head node so it can mount the slurm configuration file system.
The tags for the instance is an array with the tag name and an array of values.

A lambda function the specification.
It uses the tags to select a running instance.
If the instance does not have the security group attached, then it will attach the security group.
It will then run a script each instance to configure it to save all of the users and groups into a json file that
is used to create local users and groups on compute nodes when they boot.

For example, to configure the RES cluster manager, the following configuration is added.

```
---
DomainJoinedInstance:
- Tags:
- Key: 'Name'
Values: [ 'res-eda-cluster-manager' ]
- Key: 'res:EnvironmentName'
Values: [ 'res-eda' ]
- Key: 'res:ModuleName'
Values: [ 'cluster-manager' ]
- Key: 'res:ModuleId'
Values: [ 'cluster-manager' ]
- Key: 'app'
Values: ['virtual-desktop-dcv-host']
SecurityGroupId: <SlurmLoginNodeSGId>
```

## slurm

Slurm configuration parameters.
Expand Down Expand Up @@ -425,7 +492,7 @@ Otherwise add "-cl" to end of StackName.
AWS secret with a base64 encoded munge key to use for the cluster.
For an existing secret can be the secret name or the ARN.
If the secret doesn't exist one will be created, but won't be part of the cloudformation stack so that it won't be deleted when the stack is deleted.
Required if your submitters need to use more than 1 cluster.
Required if your login nodes need to use more than 1 cluster.

See [Create Munge Key](../deployment-prerequisites#create-munge-key) for more details.

Expand Down
55 changes: 55 additions & 0 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,3 +441,58 @@ slurm:
ansys:
Count: 1
```

### Configure File Systems

The Storage/ExtraMounts parameter allows you to configure additional file systems to mount on compute nodes.
Note that the security groups for the file systems must allow connections from the compute nodes.

#### Lustre

The following example shows how to add an FSx for Lustre file system.
The mount information can be found from the FSx console.

```
storage:
ExtraMounts
- dest: /lustre
src: <FileSystemId>.fsx.<Region>.amazonaws.com@tcp:/<MountName>
StorageType: FsxLustre
FileSystemId: <FileSystemId>
type: lustre
options: relatime,flock
```

#### ONTAP

The following example shows how to add an FSx for NetApp ONTAP file system.
The mount information can be found from the FSx console.

```
storage:
ExtraMounts
- dest: /ontap
src: <SvmId>.<FileSystemId>.fsx.<Region>.amazonaws.com:/vol1
StorageType: FsxOntap
FileSystemId: <FileSystemId>
VolumeId: <VolumeId>
type: nfs
options: default
```

#### ZFS

The following example shows how to add an FSx for OpenZFS file system.
The mount information can be found from the FSx console.

```
storage:
ExtraMounts
- dest: /zfs
src: <FileSystemId>.fsx.<Region>.amazonaws.com:/fsx
StorageType: FsxOpenZfs
FileSystemId: <FileSystemId>
VolumeId: <VolumeId>
type: nfs
options: noatime,nfsvers=3,sync,nconnect=16,rsize=1048576,wsize=1048576
```
Loading