Fix xio resume script (#287)

Miscellaneous Exostellar Infrastructure Optimizer integration fixes. Updated documentation. Add DefaultImageName to example config. Rename some of the XIO config parameters. * Replace ControllerSecurityGroupIds with Controllers/SecurityGroupIds * Replace WorkerSecurityGroupIds with Workers/SecurityGroupIds Fix a bug where referencing unset variable if AdditionalSecurityGroupsStackName not set. Change error to warning if an instance type doesn't exist in the current region. Fix configure_xio.py script to create new resources if they don't already exist. Fix hard-coded SLURM_CONF_PATH in resume_xspot.sh script. Check that XIO profile name is alphanumeric. If XIO pool's MinMemory not set, set it to the same value as MaxMemory.
aws-samples · Dec 16, 2024 · 20a62f2 · 20a62f2
1 parent 1ea25f7
commit 20a62f2
Show file tree

Hide file tree

Showing 8 changed files with 814 additions and 179 deletions.
diff --git a/...-slurm-security-groups/create_slurm_security_groups/create_slurm_security_groups_stack.py b/...-slurm-security-groups/create_slurm_security_groups/create_slurm_security_groups_stack.py
@@ -105,6 +105,10 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
             )
             security_groups['SlurmdbdSG'] = slurmdbd_sg
 
+        # Rules for compute nodes
+        # Allow mounting of /opt/slurm and from head node
+        slurm_compute_node_sg.connections.allow_to(slurm_head_node_sg, ec2.Port.tcp(2049), f"SlurmComputeNodeSG to SlurmHeadNodeSG NFS")
+
         # Rules for login nodes
         slurm_login_node_sg.connections.allow_from(slurm_head_node_sg, ec2.Port.tcp_range(1024, 65535), f"SlurmHeadNodeSG to SlurmLoginNodeSG ephemeral")
         slurm_login_node_sg.connections.allow_from(slurm_compute_node_sg, ec2.Port.tcp_range(1024, 65535), f"SlurmComputeNodeSG to SlurmLoginNodeSG ephemeral")

diff --git a/docs/exostellar-infrastructure-optimizer.md b/docs/exostellar-infrastructure-optimizer.md
@@ -49,11 +49,16 @@ Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/H
 First deploy your cluster without configuring XIO.
 The cluster deploys ansible playbooks that will be used to create the XIO ParallelCluster AMI.
 
-### Install the Exostellar Management Server (EMS)
+### Deploy the Exostellar Management Server (EMS)
 
 The next step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server).
-Exostellar will provide a link to a CloudFormation template that
-will deploy the server in your account and will share 3 AMIs that are used by the template to create the EMS, controllers, and workers.
+You must first subscribe to the three Exostellar Infrastructure AMIs in the AWS Marketplace.
+
+* [Exostellar Management Server](https://aws.amazon.com/marketplace/server/procurement?productId=prod-crdnafbqnbnm2)
+* [Exostellar Controller](https://aws.amazon.com/marketplace/server/procurement?productId=prod-d4lifqwlw4kja)
+* [Exostellar Worker](https://aws.amazon.com/marketplace/server/procurement?productId=prod-2smeyk5fuxt7q)
+
+Then follow the [directions to deploy the CloudFormation template](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server#v2.4.0.0InstallingwithCloudFormationTemplate(AWS)-Step3:CreateaNewStack).
 
 ### Create XIO Configuration
 
@@ -80,12 +85,15 @@ available capacity pools and increase the likelihood of running on spot.
 
 **Note**: The Intel instance families contain more configurations and higher memory instances. They also have high frequency instance types such as m5zn, r7iz, and z1d. They also tend to have more capacity. The AMD instance families include HPC instance types, however, they do not support spot pricing and can only be used for on-demand.
 
+**Note**: This is only an example configuration. You should customize it for your requirements.
+
 ```
 slurm:
   Xio:
     ManagementServerStackName: exostellar-management-server
     PartitionName: xio
     AvailabilityZone: us-east-2b
+    DefaultImageName: <your-xio-vm-image-name>
     Profiles:
       - ProfileName: amd
         NodeGroupName: amd
@@ -191,38 +199,6 @@ slurm:
           - xiezn
           - z1d
         EnableHyperthreading: false
-      - ProfileName: intel24core350g
-        NodeGroupName: intel24core350g
-        MaxControllers: 10
-        InstanceTypes:
-          - r5.12xlarge:1
-          - r5d.12xlarge:2
-          - r6i.12xlarge:3
-          - r6id.12xlarge:4
-          - r7i.12xlarge:5
-          - r7iz.12xlarge:6
-        SpotFleetTypes:
-          - r5.12xlarge:1
-          - r5d.12xlarge:2
-          - r6i.12xlarge:3
-          - r6id.12xlarge:4
-          - r7i.12xlarge:5
-          - r7iz.12xlarge:6
-        EnableHyperthreading: false
-      - ProfileName: amd24core350g
-        NodeGroupName: amd24core350g
-        MaxControllers: 10
-        InstanceTypes:
-          - r5a.12xlarge:1
-          - r5ad.12xlarge:2
-          - r6a.12xlarge:3
-          - r7a.12xlarge:5
-        SpotFleetTypes:
-          - r5a.12xlarge:1
-          - r5ad.12xlarge:2
-          - r6a.12xlarge:3
-          - r7a.12xlarge:5
-        EnableHyperthreading: false
     Pools:
       - PoolName: amd-8-gb-1-cores
         ProfileName: amd
@@ -261,18 +237,12 @@ slurm:
         MaxMemory: 350000
 ```
 
-### Create XIO Profiles
-
-In the EMS GUI copy the existing az1 profile to the profiles that you configured.
-The name is all that matters.
-The deployment will update the profile automatically from your configuration.
-
+### Verify that the "az1" profile exists
 
-### Create the Application Environment
+In the EMS GUI go to Profiles and make sure that the "az1" profile exists.
+I use that as a template to create your new profiles.
 
-In the EMS GUI copy the **slurm** Application Environment to a new environment that is the same
-name as your ParallelCluster cluster.
-The deployment will update the application environment from your configuration.
+If it doesn't exist, there was a problem with the EMS deployment and you should contact Exostellar support.
 
 ### Create an XIO ParallelCluster AMI
 
@@ -292,13 +262,18 @@ packages.
 
 Create an AMI from the instance and wait for it to become available.
 
-### Update the cluster with the XIO Iconfiguration
+After the AMI has been successfully created you can either stop or terminated the instance to save costs.
+If you may need to do additional customization, then stop it, otherwise terminate it.
+
+### Update the cluster with the XIO configuration
 
 Update the cluster with the XIO configuration.
 
 This will update the profiles and environment on the EMS server and configure the cluster for XIO.
 The only remaining step before you can submit jobs is to create the XIO VM image.
 
+This is done before creating an image because the XIO scripts get deployed by this step.
+
 ### Create an XIO Image from the XIO ParallelCluster AMI
 
 Connect to the head node and create the XIO Image from the AMI you created.
@@ -315,11 +290,53 @@ The pool, profile, and image_name should be from your configuration.
 The host name doesn't matter.
 
 ```
-/opt/slurm/etc/exostellar/teste_creasteVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
+/opt/slurm/etc/exostellar/test_createVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
 ```
 
+When this is done, the VM, worker, and controller should all terminate on their own.
+If they do not, then connect to the EMS and cancel the job that started the controller.
+
+Use `squeue` to list the controller jobs. Use `scancel` to terminate them.
+
 ### Run a test job using Slurm
 
 ```
 srun --pty -p xio-
 ```
+
+## Debug
+
+### UpdateHeadNode resource failed
+
+If the UpdateHeadNode resource fails then it is usually because as task in the ansible script failed.
+Connect to the head node and look for errors in:
+
+```/var/log/ansible.log```
+
+Usually it will be a problem with the `/opt/slurm/etc/exostellar/configure_xio.py` script.
+
+When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status.
+Before you can update it again you will need to complete the rollback.
+Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.
+
+### XIO Controller not starting
+
+On EMA, check that a job is running to create the controller.
+
+`squeue`
+
+On EMS, check the autoscaling log to see if there are errors starting the instance.
+
+`less /var/log/slurm/autoscaling.log``
+
+EMS Slurm partions are at:
+
+`/xcompute/slurm/bin/partitions.json`
+
+They are derived from the partition and pool names.
+
+### Worker instance not starting
+
+### VM not starting on worker
+
+### VM not starting Slurm job
diff --git a/source/cdk/cdk_slurm_stack.py b/source/cdk/cdk_slurm_stack.py
@@ -892,21 +892,26 @@ def update_config_for_exostellar(self):
         if not exostellar_security_group:
             logger.error(f"ExostellarSecurityGroup resource not found in {ems_stack_name} EMS stack")
             exit(1)
-        if 'ControllerSecurityGroupIds' not in self.config['slurm']['Xio']:
-            self.config['slurm']['Xio']['ControllerSecurityGroupIds'] = []
-        if 'WorkerSecurityGroupIds' not in self.config['slurm']['Xio']:
-            self.config['slurm']['Xio']['WorkerSecurityGroupIds'] = []
-        if exostellar_security_group not in self.config['slurm']['Xio']['ControllerSecurityGroupIds']:
-            self.config['slurm']['Xio']['ControllerSecurityGroupIds'].append(exostellar_security_group)
-        if exostellar_security_group not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
-            self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(exostellar_security_group)
-        if self.slurm_compute_node_sg_id:
-            if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
-                self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.slurm_compute_node_sg_id)
+        if 'Controllers' not in self.config['slurm']['Xio']:
+            self.config['slurm']['Xio']['Controllers'] = {}
+        if 'SecurityGroupIds' not in self.config['slurm']['Xio']['Controllers']:
+            self.config['slurm']['Xio']['Controllers']['SecurityGroupIds'] = []
+        if 'Workers' not in self.config['slurm']['Xio']:
+            self.config['slurm']['Xio']['Workers'] = {}
+        if 'SecurityGroupIds' not in self.config['slurm']['Xio']['Workers']:
+            self.config['slurm']['Xio']['Workers']['SecurityGroupIds'] = []
+        if exostellar_security_group not in self.config['slurm']['Xio']['Controllers']['SecurityGroupIds']:
+            self.config['slurm']['Xio']['Controllers']['SecurityGroupIds'].append(exostellar_security_group)
+        if exostellar_security_group not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
+            self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(exostellar_security_group)
+        if 'AdditionalSecurityGroupsStackName' in self.config:
+            if self.slurm_compute_node_sg_id:
+                if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
+                    self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(self.slurm_compute_node_sg_id)
         if 'RESStackName' in self.config:
             if self.res_dcv_security_group_id:
-                if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
-                    self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.res_dcv_security_group_id)
+                if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['Workers']['SecurityGroupIds']:
+                    self.config['slurm']['Xio']['Workers']['SecurityGroupIds'].append(self.res_dcv_security_group_id)
 
         # Get values from stack outputs
         ems_ip_address = None
@@ -920,6 +925,7 @@ def update_config_for_exostellar(self):
         self.config['slurm']['Xio']['ManagementServerIp'] = ems_ip_address
 
         # Check that all of the profiles used by the pools are defined
+        logger.debug(f"Xio config:\n{json.dumps(self.config['slurm']['Xio'], indent=4)}")
         WEIGHT_PER_CORE = {
             'amd':   45,
             'intel': 78
@@ -928,35 +934,47 @@ def update_config_for_exostellar(self):
             'amd':   3,
             'intel': 3
         }
+        number_of_warnings = 0
         number_of_errors = 0
         xio_profile_configs = {}
         self.instance_type_info = self.plugin.get_instance_types_info(self.cluster_region)
         self.instance_family_info = self.plugin.get_instance_families_info(self.cluster_region)
         for profile_config in self.config['slurm']['Xio']['Profiles']:
             profile_name = profile_config['ProfileName']
+            # Check that profile name is alphanumeric
+            if not re.compile('^[a-zA-z0-9]+$').fullmatch(profile_name):
+                logger.error(f"Invalid XIO profile name: {profile_name}. Name must be alphanumeric.")
+                number_of_errors += 1
+                continue
             if profile_name in xio_profile_configs:
                 logger.error(f"{profile_config['ProfileNmae']} XIO profile already defined")
                 number_of_errors += 1
                 continue
             xio_profile_configs[profile_name] = profile_config
             # Check that all instance types and families are from the correct CPU vendor
             profile_cpu_vendor = profile_config['CpuVendor']
+            invalid_instance_types = []
             for instance_type_or_family_with_weight in profile_config['InstanceTypes']:
                 (instance_type, instance_family) = self.get_instance_type_and_family_from_xio_config(instance_type_or_family_with_weight)
                 if not instance_type or not instance_family:
-                    logger.error(f"XIO InstanceType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
-                    number_of_errors += 1
+                    logger.warning(f"XIO InstanceType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
+                    number_of_warnings += 1
+                    invalid_instance_types.append(instance_type_or_family_with_weight)
                     continue
                 instance_type_cpu_vendor = self.plugin.get_cpu_vendor(self.cluster_region, instance_type)
                 if instance_type_cpu_vendor != profile_cpu_vendor:
                     logger.error(f"Xio InstanceType {instance_type_or_family_with_weight} is from {instance_type_cpu_vendor} and must be from {profile_cpu_vendor}")
                     number_of_errors += 1
+            for invalid_instance_type in invalid_instance_types:
+                profile_config['InstanceTypes'].remove(invalid_instance_type)
 
+            invalid_instance_types = []
             for instance_type_or_family_with_weight in profile_config['SpotFleetTypes']:
                 (instance_type, instance_family) = self.get_instance_type_and_family_from_xio_config(instance_type_or_family_with_weight)
                 if not instance_type or not instance_family:
-                    logger.error(f"Xio SpotFleetType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
-                    number_of_errors += 1
+                    logger.warning(f"Xio SpotFleetType {instance_type_or_family_with_weight} is not a valid instance type or family in the {self.cluster_region} region")
+                    number_of_warnings += 1
+                    invalid_instance_types.append(instance_type_or_family_with_weight)
                     continue
                 # Check that spot pricing is available for spot pools.
                 price = self.plugin.instance_type_and_family_info[self.cluster_region]['instance_types'][instance_type]['pricing']['spot'].get('max', None)
@@ -967,6 +985,9 @@ def update_config_for_exostellar(self):
                 if instance_type_cpu_vendor != profile_cpu_vendor:
                     logger.error(f"Xio InstanceType {instance_type_or_family_with_weight} is from {instance_type_cpu_vendor} and must be from {profile_cpu_vendor}")
                     number_of_errors += 1
+            for invalid_instance_type in invalid_instance_types:
+                profile_config['SpotFleetTypes'].remove(invalid_instance_type)
+
         xio_pool_names = {}
         for pool_config in self.config['slurm']['Xio']['Pools']:
             pool_name = pool_config['PoolName']
@@ -985,6 +1006,8 @@ def update_config_for_exostellar(self):
                     number_of_errors += 1
                 else:
                     pool_config['ImageName'] = self.config['slurm']['Xio']['DefaultImageName']
+            if 'MinMemory' not in pool_config:
+                pool_config['MinMemory'] = pool_config['MaxMemory']
             if 'Weight' not in pool_config:
                 profile_config = xio_profile_configs[profile_name]
                 cpu_vendor = profile_config['CpuVendor']
@@ -2226,9 +2249,9 @@ def get_instance_template_vars(self, instance_role):
             if 'Xio' in self.config['slurm']:
                 instance_template_vars['xio_mgt_ip'] = self.config['slurm']['Xio']['ManagementServerIp']
                 instance_template_vars['xio_availability_zone'] = self.config['slurm']['Xio']['AvailabilityZone']
-                instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['ControllerSecurityGroupIds']
+                instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['Controllers']['SecurityGroupIds']
                 instance_template_vars['subnet_id'] = self.config['SubnetId']
-                instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['WorkerSecurityGroupIds']
+                instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['Workers']['SecurityGroupIds']
                 instance_template_vars['xio_config'] = self.config['slurm']['Xio']
         elif instance_role == 'ParallelClusterExternalLoginNode':
             instance_template_vars['slurm_version']                      = get_SLURM_VERSION(self.config)

diff --git a/source/cdk/config_schema.py b/source/cdk/config_schema.py
@@ -1408,11 +1408,17 @@ def get_config_schema(config):
                         Optional('Weight'): int
                     }
                 ],
-                Optional('ManagementServerImageId'): str,
                 Optional('AvailabilityZone'): str,
-                Optional('ControllerSecurityGroupIds'): [ str ],
-                Optional('ControllerImageId'): str,
-                Optional('WorkerSecurityGroupIds'): [ str ],
+                Optional('Controllers'): {
+                    Optional('ImageId'): str,
+                    Optional('SecurityGroupIds'): [str],
+                    Optional('IdentityRole'): str,
+                },
+                Optional('Workers'): {
+                    Optional('ImageId'): str,
+                    Optional('SecurityGroupIds'): [ str ],
+                    Optional('IdentityRole'): str
+                },
                 Optional('WorkerImageId'): str,
             },
             Optional('SlurmUid', default=401): int,