Skip to content

Commit

Permalink
Fix xio resume script (#287)
Browse files Browse the repository at this point in the history
Miscellaneous Exostellar Infrastructure Optimizer integration fixes.

Updated documentation.

Add DefaultImageName to example config.

Rename some of the XIO config parameters.

* Replace ControllerSecurityGroupIds with Controllers/SecurityGroupIds
* Replace WorkerSecurityGroupIds with Workers/SecurityGroupIds

Fix a bug where referencing unset variable if AdditionalSecurityGroupsStackName not set.

Change error to warning if an instance type doesn't exist in the current region.

Fix configure_xio.py script to create new resources if they don't already exist.

Fix hard-coded SLURM_CONF_PATH in resume_xspot.sh script.
Check that XIO profile name is alphanumeric.

If XIO pool's MinMemory not set, set it to the same value as MaxMemory.
  • Loading branch information
cartalla authored Dec 16, 2024
1 parent 1ea25f7 commit 20a62f2
Show file tree
Hide file tree
Showing 8 changed files with 814 additions and 179 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,10 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
)
security_groups['SlurmdbdSG'] = slurmdbd_sg

# Rules for compute nodes
# Allow mounting of /opt/slurm and from head node
slurm_compute_node_sg.connections.allow_to(slurm_head_node_sg, ec2.Port.tcp(2049), f"SlurmComputeNodeSG to SlurmHeadNodeSG NFS")

# Rules for login nodes
slurm_login_node_sg.connections.allow_from(slurm_head_node_sg, ec2.Port.tcp_range(1024, 65535), f"SlurmHeadNodeSG to SlurmLoginNodeSG ephemeral")
slurm_login_node_sg.connections.allow_from(slurm_compute_node_sg, ec2.Port.tcp_range(1024, 65535), f"SlurmComputeNodeSG to SlurmLoginNodeSG ephemeral")
Expand Down
111 changes: 64 additions & 47 deletions docs/exostellar-infrastructure-optimizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,16 @@ Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/H
First deploy your cluster without configuring XIO.
The cluster deploys ansible playbooks that will be used to create the XIO ParallelCluster AMI.

### Install the Exostellar Management Server (EMS)
### Deploy the Exostellar Management Server (EMS)

The next step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server).
Exostellar will provide a link to a CloudFormation template that
will deploy the server in your account and will share 3 AMIs that are used by the template to create the EMS, controllers, and workers.
You must first subscribe to the three Exostellar Infrastructure AMIs in the AWS Marketplace.

* [Exostellar Management Server](https://aws.amazon.com/marketplace/server/procurement?productId=prod-crdnafbqnbnm2)
* [Exostellar Controller](https://aws.amazon.com/marketplace/server/procurement?productId=prod-d4lifqwlw4kja)
* [Exostellar Worker](https://aws.amazon.com/marketplace/server/procurement?productId=prod-2smeyk5fuxt7q)

Then follow the [directions to deploy the CloudFormation template](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server#v2.4.0.0InstallingwithCloudFormationTemplate(AWS)-Step3:CreateaNewStack).

### Create XIO Configuration

Expand All @@ -80,12 +85,15 @@ available capacity pools and increase the likelihood of running on spot.

**Note**: The Intel instance families contain more configurations and higher memory instances. They also have high frequency instance types such as m5zn, r7iz, and z1d. They also tend to have more capacity. The AMD instance families include HPC instance types, however, they do not support spot pricing and can only be used for on-demand.

**Note**: This is only an example configuration. You should customize it for your requirements.

```
slurm:
Xio:
ManagementServerStackName: exostellar-management-server
PartitionName: xio
AvailabilityZone: us-east-2b
DefaultImageName: <your-xio-vm-image-name>
Profiles:
- ProfileName: amd
NodeGroupName: amd
Expand Down Expand Up @@ -191,38 +199,6 @@ slurm:
- xiezn
- z1d
EnableHyperthreading: false
- ProfileName: intel24core350g
NodeGroupName: intel24core350g
MaxControllers: 10
InstanceTypes:
- r5.12xlarge:1
- r5d.12xlarge:2
- r6i.12xlarge:3
- r6id.12xlarge:4
- r7i.12xlarge:5
- r7iz.12xlarge:6
SpotFleetTypes:
- r5.12xlarge:1
- r5d.12xlarge:2
- r6i.12xlarge:3
- r6id.12xlarge:4
- r7i.12xlarge:5
- r7iz.12xlarge:6
EnableHyperthreading: false
- ProfileName: amd24core350g
NodeGroupName: amd24core350g
MaxControllers: 10
InstanceTypes:
- r5a.12xlarge:1
- r5ad.12xlarge:2
- r6a.12xlarge:3
- r7a.12xlarge:5
SpotFleetTypes:
- r5a.12xlarge:1
- r5ad.12xlarge:2
- r6a.12xlarge:3
- r7a.12xlarge:5
EnableHyperthreading: false
Pools:
- PoolName: amd-8-gb-1-cores
ProfileName: amd
Expand Down Expand Up @@ -261,18 +237,12 @@ slurm:
MaxMemory: 350000
```

### Create XIO Profiles

In the EMS GUI copy the existing az1 profile to the profiles that you configured.
The name is all that matters.
The deployment will update the profile automatically from your configuration.

### Verify that the "az1" profile exists

### Create the Application Environment
In the EMS GUI go to Profiles and make sure that the "az1" profile exists.
I use that as a template to create your new profiles.

In the EMS GUI copy the **slurm** Application Environment to a new environment that is the same
name as your ParallelCluster cluster.
The deployment will update the application environment from your configuration.
If it doesn't exist, there was a problem with the EMS deployment and you should contact Exostellar support.

### Create an XIO ParallelCluster AMI

Expand All @@ -292,13 +262,18 @@ packages.

Create an AMI from the instance and wait for it to become available.

### Update the cluster with the XIO Iconfiguration
After the AMI has been successfully created you can either stop or terminated the instance to save costs.
If you may need to do additional customization, then stop it, otherwise terminate it.

### Update the cluster with the XIO configuration

Update the cluster with the XIO configuration.

This will update the profiles and environment on the EMS server and configure the cluster for XIO.
The only remaining step before you can submit jobs is to create the XIO VM image.

This is done before creating an image because the XIO scripts get deployed by this step.

### Create an XIO Image from the XIO ParallelCluster AMI

Connect to the head node and create the XIO Image from the AMI you created.
Expand All @@ -315,11 +290,53 @@ The pool, profile, and image_name should be from your configuration.
The host name doesn't matter.

```
/opt/slurm/etc/exostellar/teste_creasteVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
/opt/slurm/etc/exostellar/test_createVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
```

When this is done, the VM, worker, and controller should all terminate on their own.
If they do not, then connect to the EMS and cancel the job that started the controller.

Use `squeue` to list the controller jobs. Use `scancel` to terminate them.

### Run a test job using Slurm

```
srun --pty -p xio-
```

## Debug

### UpdateHeadNode resource failed

If the UpdateHeadNode resource fails then it is usually because as task in the ansible script failed.
Connect to the head node and look for errors in:

```/var/log/ansible.log```

Usually it will be a problem with the `/opt/slurm/etc/exostellar/configure_xio.py` script.

When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status.
Before you can update it again you will need to complete the rollback.
Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.

### XIO Controller not starting

On EMA, check that a job is running to create the controller.

`squeue`

On EMS, check the autoscaling log to see if there are errors starting the instance.

`less /var/log/slurm/autoscaling.log``

EMS Slurm partions are at:

`/xcompute/slurm/bin/partitions.json`

They are derived from the partition and pool names.

### Worker instance not starting

### VM not starting on worker

### VM not starting Slurm job
61 changes: 42 additions & 19 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -892,21 +892,26 @@ def update_config_for_exostellar(self):
if not exostellar_security_group:
logger.error(f"ExostellarSecurityGroup resource not found in {ems_stack_name} EMS stack")
exit(1)
if 'ControllerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'] = []
if 'WorkerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'] = []
if exostellar_security_group not in self.config['slurm']['Xio']['ControllerSecurityGroupIds']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'].append(exostellar_security_group)
if exostellar_security_group not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(exostellar_security_group)
if self.slurm_compute_node_sg_id:
if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.slurm_compute_node_sg_id)
if 'Controllers' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['Controllers'] = {}
if 'SecurityGroupIds' not in self.config['slurm']['Xio']['Controllers']:
self.config['slurm']['Xio']['Controllers']['SecurityGroupIds'] = []
if 'Workers' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['Workers'] = {}
if 'SecurityGroupIds' not in self.config['slurm']['Xio']['Workers']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'] = []
if exostellar_security_group not in self.config['slurm']['Xio']['Controllers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Controllers']['SecurityGroupIds'].append(exostellar_security_group)
if exostellar_security_group not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(exostellar_security_group)
if 'AdditionalSecurityGroupsStackName' in self.config:
if self.slurm_compute_node_sg_id:
if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(self.slurm_compute_node_sg_id)
if 'RESStackName' in self.config:
if self.res_dcv_security_group_id:
if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.res_dcv_security_group_id)
if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(self.res_dcv_security_group_id)

# Get values from stack outputs
ems_ip_address = None
Expand All @@ -920,6 +925,7 @@ def update_config_for_exostellar(self):
self.config['slurm']['Xio']['ManagementServerIp'] = ems_ip_address

# Check that all of the profiles used by the pools are defined
logger.debug(f"Xio config:\n{json.dumps(self.config['slurm']['Xio'], indent=4)}")
WEIGHT_PER_CORE = {
'amd': 45,
'intel': 78
Expand All @@ -928,35 +934,47 @@ def update_config_for_exostellar(self):
'amd': 3,
'intel': 3
}
number_of_warnings = 0
number_of_errors = 0
xio_profile_configs = {}
self.instance_type_info = self.plugin.get_instance_types_info(self.cluster_region)
self.instance_family_info = self.plugin.get_instance_families_info(self.cluster_region)
for profile_config in self.config['slurm']['Xio']['Profiles']:
profile_name = profile_config['ProfileName']
# Check that profile name is alphanumeric
if not re.compile('^[a-zA-z0-9]+$').fullmatch(profile_name):
logger.error(f"Invalid XIO profile name: {profile_name}. Name must be alphanumeric.")
number_of_errors += 1
continue
if profile_name in xio_profile_configs:
logger.error(f"{profile_config['ProfileNmae']} XIO profile already defined")
number_of_errors += 1
continue
xio_profile_configs[profile_name] = profile_config
# Check that all instance types and families are from the correct CPU vendor
profile_cpu_vendor = profile_config['CpuVendor']
invalid_instance_types = []
for instance_type_or_family_with_weight in profile_config['InstanceTypes']:
(instance_type, instance_family) = self.get_instance_type_and_family_from_xio_config(instance_type_or_family_with_weight)
if not instance_type or not instance_family:
logger.error(f"XIO InstanceType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_errors += 1
logger.warning(f"XIO InstanceType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_warnings += 1
invalid_instance_types.append(instance_type_or_family_with_weight)
continue
instance_type_cpu_vendor = self.plugin.get_cpu_vendor(self.cluster_region, instance_type)
if instance_type_cpu_vendor != profile_cpu_vendor:
logger.error(f"Xio InstanceType {instance_type_or_family_with_weight} is from {instance_type_cpu_vendor} and must be from {profile_cpu_vendor}")
number_of_errors += 1
for invalid_instance_type in invalid_instance_types:
profile_config['InstanceTypes'].remove(invalid_instance_type)

invalid_instance_types = []
for instance_type_or_family_with_weight in profile_config['SpotFleetTypes']:
(instance_type, instance_family) = self.get_instance_type_and_family_from_xio_config(instance_type_or_family_with_weight)
if not instance_type or not instance_family:
logger.error(f"Xio SpotFleetType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_errors += 1
logger.warning(f"Xio SpotFleetType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_warnings += 1
invalid_instance_types.append(instance_type_or_family_with_weight)
continue
# Check that spot pricing is available for spot pools.
price = self.plugin.instance_type_and_family_info[self.cluster_region]['instance_types'][instance_type]['pricing']['spot'].get('max', None)
Expand All @@ -967,6 +985,9 @@ def update_config_for_exostellar(self):
if instance_type_cpu_vendor != profile_cpu_vendor:
logger.error(f"Xio InstanceType {instance_type_or_family_with_weight} is from {instance_type_cpu_vendor} and must be from {profile_cpu_vendor}")
number_of_errors += 1
for invalid_instance_type in invalid_instance_types:
profile_config['SpotFleetTypes'].remove(invalid_instance_type)

xio_pool_names = {}
for pool_config in self.config['slurm']['Xio']['Pools']:
pool_name = pool_config['PoolName']
Expand All @@ -985,6 +1006,8 @@ def update_config_for_exostellar(self):
number_of_errors += 1
else:
pool_config['ImageName'] = self.config['slurm']['Xio']['DefaultImageName']
if 'MinMemory' not in pool_config:
pool_config['MinMemory'] = pool_config['MaxMemory']
if 'Weight' not in pool_config:
profile_config = xio_profile_configs[profile_name]
cpu_vendor = profile_config['CpuVendor']
Expand Down Expand Up @@ -2226,9 +2249,9 @@ def get_instance_template_vars(self, instance_role):
if 'Xio' in self.config['slurm']:
instance_template_vars['xio_mgt_ip'] = self.config['slurm']['Xio']['ManagementServerIp']
instance_template_vars['xio_availability_zone'] = self.config['slurm']['Xio']['AvailabilityZone']
instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['ControllerSecurityGroupIds']
instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['Controllers']['SecurityGroupIds']
instance_template_vars['subnet_id'] = self.config['SubnetId']
instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['WorkerSecurityGroupIds']
instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['Workers']['SecurityGroupIds']
instance_template_vars['xio_config'] = self.config['slurm']['Xio']
elif instance_role == 'ParallelClusterExternalLoginNode':
instance_template_vars['slurm_version'] = get_SLURM_VERSION(self.config)
Expand Down
14 changes: 10 additions & 4 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1408,11 +1408,17 @@ def get_config_schema(config):
Optional('Weight'): int
}
],
Optional('ManagementServerImageId'): str,
Optional('AvailabilityZone'): str,
Optional('ControllerSecurityGroupIds'): [ str ],
Optional('ControllerImageId'): str,
Optional('WorkerSecurityGroupIds'): [ str ],
Optional('Controllers'): {
Optional('ImageId'): str,
Optional('SecurityGroupIds'): [str],
Optional('IdentityRole'): str,
},
Optional('Workers'): {
Optional('ImageId'): str,
Optional('SecurityGroupIds'): [ str ],
Optional('IdentityRole'): str
},
Optional('WorkerImageId'): str,
},
Optional('SlurmUid', default=401): int,
Expand Down
Loading

0 comments on commit 20a62f2

Please sign in to comment.