Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable spot for only certain InstanceTypes #277

Closed
gwolski opened this issue Oct 28, 2024 · 4 comments · Fixed by #284
Closed

Enable spot for only certain InstanceTypes #277

gwolski opened this issue Oct 28, 2024 · 4 comments · Fixed by #284
Assignees

Comments

@gwolski
Copy link

gwolski commented Oct 28, 2024

I'm trying to set up one cluster for all my EDA tool and SW team needs. This might be too grand of a goal and I might ultimately just might need to have multiple clusters for different "tool" types, i.e. simulation, physical design, software team. But let me just document this request, see if you find value in it or have other suggestions.

I like using spot for my simulation jobs that require smaller machines, less memory. Parallelcluster limits us to 50 compute resources. So if we generate a spot instance for every OnDemand instances, you can only have 25 instance types. Here is what I presently have enabled.

    - r7a.medium   # cpu=1 mem=8
    - m7i.large    # cpu=1 mem=8 hyper-threading turned off
    - r7i.large    # cpu=1 mem=16 hyper-threading turned off
    - m7a.large    # cpu=2 mem=8
    - c7i.xlarge   # cpu=2 mem=8 hyper-threading turned off
    - r7a.large    # cpu=2 mem=16
    - m7a.xlarge   # cpu=4 mem=16
    - r7i.xlarge   # cpu=2 mem=32 hyperthreading turned off
    - r7a.xlarge   # cpu=4 mem=32
    - r7i.2xlarge  # cpu=4 mem=64 hyperthreading turned off
    - c7a.2xlarge  # cpu=8 mem=16
    - m7a.2xlarge  # cpu=8 mem=32
    - c7i.4xlarge  # cpu=8 mem=32 hyperthreading turned off
    #- r7a.2xlarge  # cpu=8 mem=64
    - m7i.4xlarge  # cpu=8 mem=64  hyper-threading turned off
    - r7i.4xlarge  # cpu=8 mem=128  hyper-threading turned off
    - c7a.4xlarge  # cpu=16 mem=32
    #- c7i.8xlarge  # cpu=16 mem=64 hyper-threading turned off
    - m7a.4xlarge  # cpu=16 mem=64
    - r7a.4xlarge  # cpu=16 mem=128
    - r7i.8xlarge  # cpu=16 mem=256  hyper-threading turned off.
    - c7a.8xlarge  # cpu=32 mem=64
    - m7a.8xlarge  # cpu=32 mem=128
    - r7a.8xlarge  # cpu=32 mem=256
    - r7a.12xlarge # cpu=48 mem=384
    - c7a.16xlarge # cpu=64 mem=128

Is it possible to extend the configuration language to allow me to specify which machines are enabled for spot? Or maybe another section? I don't want to have another cluster for spot, as my understanding is the users would have to bounce between clusters with different module loads.

Maybe I just build a "small" machine cluster and a "big" machine cluster and not worry about it as most PD engineers who use the bigger machines never use the smaller machines... This is what I've done the past.

Your thoughts are appreciated as you have seen more clusters in action than I have...
Arguably this should also be a ticket with parallelcluster team to raise the CR limit to unlimited - I'm missing why there is a hard limit.

@cartalla
Copy link
Contributor

What would the config look like for that? I think that the request makes sense. Right now spot and on-demand are all or nothing. I'm thinking that each entry in the Include section could be extended to be either an instance type/family, or a dictionary with additional configuration that is specific to the type/family.

For example:

slurm:
  InstanceConfig:
    UseOnDemand: true
    UseSpot: false
    Include:
      InstanceTypes:
        - r7a.medium: {UseSpot: true}
        - m7i.large: {UseSpot: true}
        - r7a.12xlarge
        - c7a.16xlarge

@gwolski
Copy link
Author

gwolski commented Nov 3, 2024

Your proposal is elegant. I think I would use it.
I ultimately might break out my clusters into cluster-sim, cluster-pd, cluster-sw though. The biggest reason being the Idletimeout value. For the smaller machines that are used by many verification people, I would set the idletimeout to be 60 minutes - that way once the day gets primed and machines start up, there tend to be machines up as people start their day. However PD machines are expensive, and if we have many of them up, and we keep them up for an hour, that's a lot of wasted $$ and resources - I have kept them up for only 15 minutes in the past - enough time to fix typos in your submission and restart and not wait again.

And none of this would be necessary if parallelcluster team didn't put a limit on the number of compute resources. If you know changes coming there, maybe this can be deferred/will-not-implement?

That said, if you supported it now, I would use it today.

@cartalla cartalla self-assigned this Nov 4, 2024
@cartalla
Copy link
Contributor

cartalla commented Nov 5, 2024

I don't know of any plans to increase the number of compute resources. I think that it is related to the use of EC2 fleets to manage CRs and limits on the number of fleets, but not sure. All I know is that currently the limit is 50 and if you need more than that your best option is multiple clusters. This was part of my rationale for supporting being able to use your virtual desktop as a login node for multiple clusters.

cartalla added a commit that referenced this issue Nov 5, 2024
Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global
parameters affect all instance types.
Add new configuration option to use the existing parameters as defaults for
each instance type, but allow them to be configured for each included instance
family and each included instance type.

This allows admins to reduce the number of compute resources by, for example,
only configuring spot for small instance types, but not for larger ones.

Resolves #277
cartalla added a commit that referenced this issue Nov 5, 2024
Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global
parameters affect all instance types.
Add new configuration option to use the existing parameters as defaults for
each instance type, but allow them to be configured for each included instance
family and each included instance type.

This allows admins to reduce the number of compute resources by, for example,
only configuring spot for small instance types, but not for larger ones.

Resolves #277

Add documentation of manual commands for deconfiguring before deleting cluster.

Resolves #282

=========================================================================

Go through everything and change the original term I used, Submitter, to External Login Node.
Just need to make things consistent.
cartalla added a commit that referenced this issue Nov 5, 2024
Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global
parameters affect all instance types.
Add new configuration option to use the existing parameters as defaults for
each instance type, but allow them to be configured for each included instance
family and each included instance type.

This allows admins to reduce the number of compute resources by, for example,
only configuring spot for small instance types, but not for larger ones.

Resolves #277

Add documentation of manual commands for deconfiguring before deleting cluster.

Resolves #282

=========================================================================

Go through everything and change the original term I used, Submitter, to External Login Node.
Just need to make things consistent.
@gwolski
Copy link
Author

gwolski commented Nov 6, 2024

I am happy with the way you support the multiple clusters. Most of my users will use a specific cluster as I mentioned above, and I can have different windows open with different cluster access, so I'm happy happy.

cartalla added a commit that referenced this issue Nov 6, 2024
Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global
parameters affect all instance types.
Add new configuration option to use the existing parameters as defaults for
each instance type, but allow them to be configured for each included instance
family and each included instance type.

This allows admins to reduce the number of compute resources by, for example,
only configuring spot for small instance types, but not for larger ones.

Resolves #277

Add documentation of manual commands for deconfiguring before deleting cluster.

Resolves #282

=========================================================================

Go through everything and change the original term I used, Submitter, to External Login Node.
Just need to make things consistent.
cartalla added a commit that referenced this issue Nov 6, 2024
Right UseOnDemand, UseSpot, and DisableSimultaneousMultithreading are global
parameters affect all instance types.
Add new configuration option to use the existing parameters as defaults for
each instance type, but allow them to be configured for each included instance
family and each included instance type.

This allows admins to reduce the number of compute resources by, for example,
only configuring spot for small instance types, but not for larger ones.

Resolves #277

Add documentation of manual commands for deconfiguring before deleting cluster.

Resolves #282

=========================================================================

Go through everything and change the original term I used, Submitter, to External Login Node.
Just need to make things consistent.
@cartalla cartalla linked a pull request Nov 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants