Enable spot for only certain InstanceTypes #277

gwolski · 2024-10-28T23:33:05Z

I'm trying to set up one cluster for all my EDA tool and SW team needs. This might be too grand of a goal and I might ultimately just might need to have multiple clusters for different "tool" types, i.e. simulation, physical design, software team. But let me just document this request, see if you find value in it or have other suggestions.

I like using spot for my simulation jobs that require smaller machines, less memory. Parallelcluster limits us to 50 compute resources. So if we generate a spot instance for every OnDemand instances, you can only have 25 instance types. Here is what I presently have enabled.

    - r7a.medium   # cpu=1 mem=8
    - m7i.large    # cpu=1 mem=8 hyper-threading turned off
    - r7i.large    # cpu=1 mem=16 hyper-threading turned off
    - m7a.large    # cpu=2 mem=8
    - c7i.xlarge   # cpu=2 mem=8 hyper-threading turned off
    - r7a.large    # cpu=2 mem=16
    - m7a.xlarge   # cpu=4 mem=16
    - r7i.xlarge   # cpu=2 mem=32 hyperthreading turned off
    - r7a.xlarge   # cpu=4 mem=32
    - r7i.2xlarge  # cpu=4 mem=64 hyperthreading turned off
    - c7a.2xlarge  # cpu=8 mem=16
    - m7a.2xlarge  # cpu=8 mem=32
    - c7i.4xlarge  # cpu=8 mem=32 hyperthreading turned off
    #- r7a.2xlarge  # cpu=8 mem=64
    - m7i.4xlarge  # cpu=8 mem=64  hyper-threading turned off
    - r7i.4xlarge  # cpu=8 mem=128  hyper-threading turned off
    - c7a.4xlarge  # cpu=16 mem=32
    #- c7i.8xlarge  # cpu=16 mem=64 hyper-threading turned off
    - m7a.4xlarge  # cpu=16 mem=64
    - r7a.4xlarge  # cpu=16 mem=128
    - r7i.8xlarge  # cpu=16 mem=256  hyper-threading turned off.
    - c7a.8xlarge  # cpu=32 mem=64
    - m7a.8xlarge  # cpu=32 mem=128
    - r7a.8xlarge  # cpu=32 mem=256
    - r7a.12xlarge # cpu=48 mem=384
    - c7a.16xlarge # cpu=64 mem=128

Is it possible to extend the configuration language to allow me to specify which machines are enabled for spot? Or maybe another section? I don't want to have another cluster for spot, as my understanding is the users would have to bounce between clusters with different module loads.

Maybe I just build a "small" machine cluster and a "big" machine cluster and not worry about it as most PD engineers who use the bigger machines never use the smaller machines... This is what I've done the past.

Your thoughts are appreciated as you have seen more clusters in action than I have...
Arguably this should also be a ticket with parallelcluster team to raise the CR limit to unlimited - I'm missing why there is a hard limit.

The text was updated successfully, but these errors were encountered:

cartalla · 2024-10-30T18:42:21Z

What would the config look like for that? I think that the request makes sense. Right now spot and on-demand are all or nothing. I'm thinking that each entry in the Include section could be extended to be either an instance type/family, or a dictionary with additional configuration that is specific to the type/family.

For example:

slurm:
  InstanceConfig:
    UseOnDemand: true
    UseSpot: false
    Include:
      InstanceTypes:
        - r7a.medium: {UseSpot: true}
        - m7i.large: {UseSpot: true}
        - r7a.12xlarge
        - c7a.16xlarge

gwolski · 2024-11-03T10:48:37Z

Your proposal is elegant. I think I would use it.
I ultimately might break out my clusters into cluster-sim, cluster-pd, cluster-sw though. The biggest reason being the Idletimeout value. For the smaller machines that are used by many verification people, I would set the idletimeout to be 60 minutes - that way once the day gets primed and machines start up, there tend to be machines up as people start their day. However PD machines are expensive, and if we have many of them up, and we keep them up for an hour, that's a lot of wasted $$ and resources - I have kept them up for only 15 minutes in the past - enough time to fix typos in your submission and restart and not wait again.

And none of this would be necessary if parallelcluster team didn't put a limit on the number of compute resources. If you know changes coming there, maybe this can be deferred/will-not-implement?

That said, if you supported it now, I would use it today.

cartalla · 2024-11-05T17:45:33Z

I don't know of any plans to increase the number of compute resources. I think that it is related to the use of EC2 fleets to manage CRs and limits on the number of fleets, but not sure. All I know is that currently the limit is 50 and if you need more than that your best option is multiple clusters. This was part of my rationale for supporting being able to use your virtual desktop as a login node for multiple clusters.

Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global parameters affect all instance types. Add new configuration option to use the existing parameters as defaults for each instance type, but allow them to be configured for each included instance family and each included instance type. This allows admins to reduce the number of compute resources by, for example, only configuring spot for small instance types, but not for larger ones. Resolves #277

Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global parameters affect all instance types. Add new configuration option to use the existing parameters as defaults for each instance type, but allow them to be configured for each included instance family and each included instance type. This allows admins to reduce the number of compute resources by, for example, only configuring spot for small instance types, but not for larger ones. Resolves #277 Add documentation of manual commands for deconfiguring before deleting cluster. Resolves #282 ========================================================================= Go through everything and change the original term I used, Submitter, to External Login Node. Just need to make things consistent.

gwolski · 2024-11-06T01:44:59Z

I am happy with the way you support the multiple clusters. Most of my users will use a specific cluster as I mentioned above, and I can have different windows open with different cluster access, so I'm happy happy.

Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global parameters affect all instance types. Add new configuration option to use the existing parameters as defaults for each instance type, but allow them to be configured for each included instance family and each included instance type. This allows admins to reduce the number of compute resources by, for example, only configuring spot for small instance types, but not for larger ones. Resolves #277 Add documentation of manual commands for deconfiguring before deleting cluster. Resolves #282 ========================================================================= Go through everything and change the original term I used, Submitter, to External Login Node. Just need to make things consistent.

cartalla self-assigned this Nov 4, 2024

cartalla linked a pull request Nov 6, 2024 that will close this issue

Allow unique configuration for each compute resource #284

Merged

cartalla mentioned this issue Nov 6, 2024

Allow unique configuration for each compute resource #284

Merged

cartalla closed this as completed in #284 Nov 6, 2024

cartalla closed this as completed in 835f62b Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable spot for only certain InstanceTypes #277

Enable spot for only certain InstanceTypes #277

gwolski commented Oct 28, 2024 •

edited by cartalla

Loading

cartalla commented Oct 30, 2024

gwolski commented Nov 3, 2024

cartalla commented Nov 5, 2024

gwolski commented Nov 6, 2024

Enable spot for only certain InstanceTypes #277

Enable spot for only certain InstanceTypes #277

Comments

gwolski commented Oct 28, 2024 • edited by cartalla Loading

cartalla commented Oct 30, 2024

gwolski commented Nov 3, 2024

cartalla commented Nov 5, 2024

gwolski commented Nov 6, 2024

gwolski commented Oct 28, 2024 •

edited by cartalla

Loading