Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xio resume script #287

Merged
merged 1 commit into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,10 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
)
security_groups['SlurmdbdSG'] = slurmdbd_sg

# Rules for compute nodes
# Allow mounting of /opt/slurm and from head node
slurm_compute_node_sg.connections.allow_to(slurm_head_node_sg, ec2.Port.tcp(2049), f"SlurmComputeNodeSG to SlurmHeadNodeSG NFS")

# Rules for login nodes
slurm_login_node_sg.connections.allow_from(slurm_head_node_sg, ec2.Port.tcp_range(1024, 65535), f"SlurmHeadNodeSG to SlurmLoginNodeSG ephemeral")
slurm_login_node_sg.connections.allow_from(slurm_compute_node_sg, ec2.Port.tcp_range(1024, 65535), f"SlurmComputeNodeSG to SlurmLoginNodeSG ephemeral")
Expand Down
111 changes: 64 additions & 47 deletions docs/exostellar-infrastructure-optimizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,16 @@ Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/H
First deploy your cluster without configuring XIO.
The cluster deploys ansible playbooks that will be used to create the XIO ParallelCluster AMI.

### Install the Exostellar Management Server (EMS)
### Deploy the Exostellar Management Server (EMS)

The next step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server).
Exostellar will provide a link to a CloudFormation template that
will deploy the server in your account and will share 3 AMIs that are used by the template to create the EMS, controllers, and workers.
You must first subscribe to the three Exostellar Infrastructure AMIs in the AWS Marketplace.

* [Exostellar Management Server](https://aws.amazon.com/marketplace/server/procurement?productId=prod-crdnafbqnbnm2)
* [Exostellar Controller](https://aws.amazon.com/marketplace/server/procurement?productId=prod-d4lifqwlw4kja)
* [Exostellar Worker](https://aws.amazon.com/marketplace/server/procurement?productId=prod-2smeyk5fuxt7q)

Then follow the [directions to deploy the CloudFormation template](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server#v2.4.0.0InstallingwithCloudFormationTemplate(AWS)-Step3:CreateaNewStack).

### Create XIO Configuration

Expand All @@ -80,12 +85,15 @@ available capacity pools and increase the likelihood of running on spot.

**Note**: The Intel instance families contain more configurations and higher memory instances. They also have high frequency instance types such as m5zn, r7iz, and z1d. They also tend to have more capacity. The AMD instance families include HPC instance types, however, they do not support spot pricing and can only be used for on-demand.

**Note**: This is only an example configuration. You should customize it for your requirements.

```
slurm:
Xio:
ManagementServerStackName: exostellar-management-server
PartitionName: xio
AvailabilityZone: us-east-2b
DefaultImageName: <your-xio-vm-image-name>
Profiles:
- ProfileName: amd
NodeGroupName: amd
Expand Down Expand Up @@ -191,38 +199,6 @@ slurm:
- xiezn
- z1d
EnableHyperthreading: false
- ProfileName: intel24core350g
NodeGroupName: intel24core350g
MaxControllers: 10
InstanceTypes:
- r5.12xlarge:1
- r5d.12xlarge:2
- r6i.12xlarge:3
- r6id.12xlarge:4
- r7i.12xlarge:5
- r7iz.12xlarge:6
SpotFleetTypes:
- r5.12xlarge:1
- r5d.12xlarge:2
- r6i.12xlarge:3
- r6id.12xlarge:4
- r7i.12xlarge:5
- r7iz.12xlarge:6
EnableHyperthreading: false
- ProfileName: amd24core350g
NodeGroupName: amd24core350g
MaxControllers: 10
InstanceTypes:
- r5a.12xlarge:1
- r5ad.12xlarge:2
- r6a.12xlarge:3
- r7a.12xlarge:5
SpotFleetTypes:
- r5a.12xlarge:1
- r5ad.12xlarge:2
- r6a.12xlarge:3
- r7a.12xlarge:5
EnableHyperthreading: false
Pools:
- PoolName: amd-8-gb-1-cores
ProfileName: amd
Expand Down Expand Up @@ -261,18 +237,12 @@ slurm:
MaxMemory: 350000
```

### Create XIO Profiles

In the EMS GUI copy the existing az1 profile to the profiles that you configured.
The name is all that matters.
The deployment will update the profile automatically from your configuration.

### Verify that the "az1" profile exists

### Create the Application Environment
In the EMS GUI go to Profiles and make sure that the "az1" profile exists.
I use that as a template to create your new profiles.

In the EMS GUI copy the **slurm** Application Environment to a new environment that is the same
name as your ParallelCluster cluster.
The deployment will update the application environment from your configuration.
If it doesn't exist, there was a problem with the EMS deployment and you should contact Exostellar support.

### Create an XIO ParallelCluster AMI

Expand All @@ -292,13 +262,18 @@ packages.

Create an AMI from the instance and wait for it to become available.

### Update the cluster with the XIO Iconfiguration
After the AMI has been successfully created you can either stop or terminated the instance to save costs.
If you may need to do additional customization, then stop it, otherwise terminate it.

### Update the cluster with the XIO configuration

Update the cluster with the XIO configuration.

This will update the profiles and environment on the EMS server and configure the cluster for XIO.
The only remaining step before you can submit jobs is to create the XIO VM image.

This is done before creating an image because the XIO scripts get deployed by this step.

### Create an XIO Image from the XIO ParallelCluster AMI

Connect to the head node and create the XIO Image from the AMI you created.
Expand All @@ -315,11 +290,53 @@ The pool, profile, and image_name should be from your configuration.
The host name doesn't matter.

```
/opt/slurm/etc/exostellar/teste_creasteVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
/opt/slurm/etc/exostellar/test_createVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
```

When this is done, the VM, worker, and controller should all terminate on their own.
If they do not, then connect to the EMS and cancel the job that started the controller.

Use `squeue` to list the controller jobs. Use `scancel` to terminate them.

### Run a test job using Slurm

```
srun --pty -p xio-
```

## Debug

### UpdateHeadNode resource failed

If the UpdateHeadNode resource fails then it is usually because as task in the ansible script failed.
Connect to the head node and look for errors in:

```/var/log/ansible.log```

Usually it will be a problem with the `/opt/slurm/etc/exostellar/configure_xio.py` script.

When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status.
Before you can update it again you will need to complete the rollback.
Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.

### XIO Controller not starting

On EMA, check that a job is running to create the controller.

`squeue`

On EMS, check the autoscaling log to see if there are errors starting the instance.

`less /var/log/slurm/autoscaling.log``

EMS Slurm partions are at:

`/xcompute/slurm/bin/partitions.json`

They are derived from the partition and pool names.

### Worker instance not starting

### VM not starting on worker

### VM not starting Slurm job
61 changes: 42 additions & 19 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -892,21 +892,26 @@ def update_config_for_exostellar(self):
if not exostellar_security_group:
logger.error(f"ExostellarSecurityGroup resource not found in {ems_stack_name} EMS stack")
exit(1)
if 'ControllerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'] = []
if 'WorkerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'] = []
if exostellar_security_group not in self.config['slurm']['Xio']['ControllerSecurityGroupIds']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'].append(exostellar_security_group)
if exostellar_security_group not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(exostellar_security_group)
if self.slurm_compute_node_sg_id:
if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.slurm_compute_node_sg_id)
if 'Controllers' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['Controllers'] = {}
if 'SecurityGroupIds' not in self.config['slurm']['Xio']['Controllers']:
self.config['slurm']['Xio']['Controllers']['SecurityGroupIds'] = []
if 'Workers' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['Workers'] = {}
if 'SecurityGroupIds' not in self.config['slurm']['Xio']['Workers']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'] = []
if exostellar_security_group not in self.config['slurm']['Xio']['Controllers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Controllers']['SecurityGroupIds'].append(exostellar_security_group)
if exostellar_security_group not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(exostellar_security_group)
if 'AdditionalSecurityGroupsStackName' in self.config:
if self.slurm_compute_node_sg_id:
if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(self.slurm_compute_node_sg_id)
if 'RESStackName' in self.config:
if self.res_dcv_security_group_id:
if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.res_dcv_security_group_id)
if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(self.res_dcv_security_group_id)

# Get values from stack outputs
ems_ip_address = None
Expand All @@ -920,6 +925,7 @@ def update_config_for_exostellar(self):
self.config['slurm']['Xio']['ManagementServerIp'] = ems_ip_address

# Check that all of the profiles used by the pools are defined
logger.debug(f"Xio config:\n{json.dumps(self.config['slurm']['Xio'], indent=4)}")
WEIGHT_PER_CORE = {
'amd': 45,
'intel': 78
Expand All @@ -928,35 +934,47 @@ def update_config_for_exostellar(self):
'amd': 3,
'intel': 3
}
number_of_warnings = 0
number_of_errors = 0
xio_profile_configs = {}
self.instance_type_info = self.plugin.get_instance_types_info(self.cluster_region)
self.instance_family_info = self.plugin.get_instance_families_info(self.cluster_region)
for profile_config in self.config['slurm']['Xio']['Profiles']:
profile_name = profile_config['ProfileName']
# Check that profile name is alphanumeric
if not re.compile('^[a-zA-z0-9]+$').fullmatch(profile_name):
logger.error(f"Invalid XIO profile name: {profile_name}. Name must be alphanumeric.")
number_of_errors += 1
continue
if profile_name in xio_profile_configs:
logger.error(f"{profile_config['ProfileNmae']} XIO profile already defined")
number_of_errors += 1
continue
xio_profile_configs[profile_name] = profile_config
# Check that all instance types and families are from the correct CPU vendor
profile_cpu_vendor = profile_config['CpuVendor']
invalid_instance_types = []
for instance_type_or_family_with_weight in profile_config['InstanceTypes']:
(instance_type, instance_family) = self.get_instance_type_and_family_from_xio_config(instance_type_or_family_with_weight)
if not instance_type or not instance_family:
logger.error(f"XIO InstanceType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_errors += 1
logger.warning(f"XIO InstanceType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_warnings += 1
invalid_instance_types.append(instance_type_or_family_with_weight)
continue
instance_type_cpu_vendor = self.plugin.get_cpu_vendor(self.cluster_region, instance_type)
if instance_type_cpu_vendor != profile_cpu_vendor:
logger.error(f"Xio InstanceType {instance_type_or_family_with_weight} is from {instance_type_cpu_vendor} and must be from {profile_cpu_vendor}")
number_of_errors += 1
for invalid_instance_type in invalid_instance_types:
profile_config['InstanceTypes'].remove(invalid_instance_type)

invalid_instance_types = []
for instance_type_or_family_with_weight in profile_config['SpotFleetTypes']:
(instance_type, instance_family) = self.get_instance_type_and_family_from_xio_config(instance_type_or_family_with_weight)
if not instance_type or not instance_family:
logger.error(f"Xio SpotFleetType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_errors += 1
logger.warning(f"Xio SpotFleetType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
number_of_warnings += 1
invalid_instance_types.append(instance_type_or_family_with_weight)
continue
# Check that spot pricing is available for spot pools.
price = self.plugin.instance_type_and_family_info[self.cluster_region]['instance_types'][instance_type]['pricing']['spot'].get('max', None)
Expand All @@ -967,6 +985,9 @@ def update_config_for_exostellar(self):
if instance_type_cpu_vendor != profile_cpu_vendor:
logger.error(f"Xio InstanceType {instance_type_or_family_with_weight} is from {instance_type_cpu_vendor} and must be from {profile_cpu_vendor}")
number_of_errors += 1
for invalid_instance_type in invalid_instance_types:
profile_config['SpotFleetTypes'].remove(invalid_instance_type)

xio_pool_names = {}
for pool_config in self.config['slurm']['Xio']['Pools']:
pool_name = pool_config['PoolName']
Expand All @@ -985,6 +1006,8 @@ def update_config_for_exostellar(self):
number_of_errors += 1
else:
pool_config['ImageName'] = self.config['slurm']['Xio']['DefaultImageName']
if 'MinMemory' not in pool_config:
pool_config['MinMemory'] = pool_config['MaxMemory']
if 'Weight' not in pool_config:
profile_config = xio_profile_configs[profile_name]
cpu_vendor = profile_config['CpuVendor']
Expand Down Expand Up @@ -2226,9 +2249,9 @@ def get_instance_template_vars(self, instance_role):
if 'Xio' in self.config['slurm']:
instance_template_vars['xio_mgt_ip'] = self.config['slurm']['Xio']['ManagementServerIp']
instance_template_vars['xio_availability_zone'] = self.config['slurm']['Xio']['AvailabilityZone']
instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['ControllerSecurityGroupIds']
instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['Controllers']['SecurityGroupIds']
instance_template_vars['subnet_id'] = self.config['SubnetId']
instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['WorkerSecurityGroupIds']
instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['Workers']['SecurityGroupIds']
instance_template_vars['xio_config'] = self.config['slurm']['Xio']
elif instance_role == 'ParallelClusterExternalLoginNode':
instance_template_vars['slurm_version'] = get_SLURM_VERSION(self.config)
Expand Down
14 changes: 10 additions & 4 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1408,11 +1408,17 @@ def get_config_schema(config):
Optional('Weight'): int
}
],
Optional('ManagementServerImageId'): str,
Optional('AvailabilityZone'): str,
Optional('ControllerSecurityGroupIds'): [ str ],
Optional('ControllerImageId'): str,
Optional('WorkerSecurityGroupIds'): [ str ],
Optional('Controllers'): {
Optional('ImageId'): str,
Optional('SecurityGroupIds'): [str],
Optional('IdentityRole'): str,
},
Optional('Workers'): {
Optional('ImageId'): str,
Optional('SecurityGroupIds'): [ str ],
Optional('IdentityRole'): str
},
Optional('WorkerImageId'): str,
},
Optional('SlurmUid', default=401): int,
Expand Down
Loading