Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation for ClusterConfig section #272

Closed
gwolski opened this issue Oct 27, 2024 · 3 comments · Fixed by #274
Closed

Improve documentation for ClusterConfig section #272

gwolski opened this issue Oct 27, 2024 · 3 comments · Fixed by #274
Assignees

Comments

@gwolski
Copy link

gwolski commented Oct 27, 2024

I'm trying to increase the timeout of the ScaledownIdletime.

I added the following ClusterConfig/SlurmSettings/ScaledownIdletime to my cluster config file:

ParallelClusterConfig:
  Version: 3.11.1
  Architecture: x86_64
  Image:
    Os: rocky8
    CustomAmi: ami-0d68c6538XXXXXXX  # pcluster-3-11-1-Rocky-8-x86-64-ami-0d002XXXXXXXXXX 2024-10-24T03-12-55.412Z
  DisableSimultaneousMultithreading: true
  ClusterConfig:
    SlurmSettings:
      ScaledownIdletime: 20

I've discovered this is the wrong syntax. Your documentation only states make ClusterConfig a dict. I look at the config_schema.py and not much there to go on either. I've tried multiple variations, including:

ClusterConfig:
  Scheduling: 
    SlurmSettings:
      ScaledownIdletime: 20

just can't figure it out. This latter example at least throws an error by the python code.

I have been able to get the simple case of tags to work:

ClusterConfig:
  Tags:
    - Key: Project
      Value: amazing

Can you please add some examples (specifically my need) to your documentation and this issue?

How do I add a section in the config file to change the ScaledownIdletime?

@cartalla cartalla self-assigned this Oct 28, 2024
@cartalla
Copy link
Contributor

Your code looks correct. I'm testing it right now.

ClusterConfig:
  Scheduling: 
    SlurmSettings:
      ScaledownIdletime: 20

@cartalla
Copy link
Contributor

cartalla commented Oct 28, 2024

I made the change in my configuration.
I downloaded the config file that get generated and confirmed that the setting shows up in the ParallelCluster config file.
I updated my cluster and it successfully updated the config and ParallelCluster.
I checked in slurm_parallelcluster.conf and confirmed that SuspendTime is set to 1200 seconds which is 20 minutes.
So I think that it is working.
I was initially a little confused because there is no ScaledownIdletime parameter in slurm.conf.
The slurm parameter is SuspendTime.

cartalla added a commit that referenced this issue Oct 28, 2024
Add pointer to ParallelCluster config file doc.

Add an example of how to set a parameter.

Resolves #272
@cartalla cartalla linked a pull request Oct 28, 2024 that will close this issue
@gwolski
Copy link
Author

gwolski commented Oct 28, 2024

PLBKAC. Solved.

Here is the error I was getting, it happens right after the AMI builds section is output (I've copied and pasted a bit of that here to give you context):

"Rocky": {
    "8": {
        "arm64": {},
        "x86_64": {}
    },
    "9": {
        "arm64": {},
        "x86_64": {}
    }
}

}
Traceback (most recent call last):
File "/proj/work/gwolski/aws-eda-slurm-cluster-3.11.1/source/app.py", line 31, in
CdkSlurmStack(app, app.node.try_get_context('stack_name'), env=cdk_env,
File "/users/gwolski/.local/lib/python3.11/site-packages/jsii/_runtime.py", line 118, in call
inst = super(JSIIMeta, cast(JSIIMeta, cls)).call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/proj/work/gwolski/aws-eda-slurm-cluster-3.11.1/source/cdk/cdk_slurm_stack.py", line 143, in init
self.create_parallel_cluster_config()
File "/proj/work/gwolski/aws-eda-slurm-cluster-3.11.1/source/cdk/cdk_slurm_stack.py", line 2557, in create_parallel_cluster_config
self.parallel_cluster_config['Scheduling']['Scheduler'] = 'slurm'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
TypeError: list indices must be integers or slices, not str

Subprocess exited with error 1

I tried so many variants, I must have copied and pasted the wrong code in my issue here. Here is the offensive code that caused the above error that I should have realized is wrong.

ClusterConfig:
  Scheduling: 
    - SlurmSettings:
        ScaledownIdletime: 20

Note the '-' in front of the SlurmSettings. Argh. Damn (tired) user. Never file a ticket when you are tired. Thank you.

I have now used the appropriate code, as you have shown, and I see the correct entry in the YAML file when downloaded with PCUI and also the value SuspendTime=1200 in the slurm_parallelcluster.conf file. All good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants