Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Change RealMemory of compute nodes to match total instance type memory #283

Closed
cartalla opened this issue Nov 5, 2024 · 2 comments · Fixed by #285
Closed

[FEATURE] Change RealMemory of compute nodes to match total instance type memory #283

cartalla opened this issue Nov 5, 2024 · 2 comments · Fixed by #285
Assignees

Comments

@cartalla
Copy link
Contributor

cartalla commented Nov 5, 2024

Is your feature request related to a problem? Please describe.

By default, PC sets the RealMemory of a compute node to 95% of the total instance type memory.
This is because not 100% of the memory is available to jobs.
When slurmd starts, it reports the amount of free memory and if it is less than the configured RealMemory then it flags an error and marks the node as Drain.
This behavior is modified by the by the following Slurm config parameter:

SlurmctldParameters=node_reg_mem_percent=75

This allows the node to register with the controller as long as the available memory is at least 75% of the configured RealMemory.

This works because slurmd will not configure a cgroup a job that exceeds the amount of available real memory.
So even if the scheduler allocates a job that uses more than the available memory, the job will not be allowed to cause an OOM situation and cause the instance to creash.

The current configuration results in unintuitive behavior and waste of memory.
Lets consider an instance with 8 GB of real memory.
PC configures the RealMemory as 8 * 1024 * 0.95 = 7782 MiB = 7.6 GiB.
However, you'd kind of expect to be able to fit two 4 GB jobs on that machine.
But to do that today you'd have to request 3.8 GB for each job or else they won't fit.
If you specify 4 GB then the first job will start and 8GB instance, not a 4GB instance and reserve 4 GB for the job.
The scheduler will see the instance as having only 3.6 GB memory free for additional jobs.
So, the next 4 GB machine will start another 8GB machine for the 2nd job.
This will use double the compute resources as would be expected.

What I propose is to configure each compute resource as having 100% RealMemory and set node_reg_mem_percent to a value that allows the compute nodes to register successfully.
This should allow jobs to use round numbers for memory requests and allow more efficient instance utilization.
Job memory requests have to be in excess of actual memory requirements to prevent jobs from running out of memory
so the fact that available memory is less shouldn't be an issue.
Note that this is already the case anyway.
If anything, this should make more memory available to jobs.

@gwolski
Copy link

gwolski commented Nov 6, 2024

I fully endorse this proposal. I already do this with my submission scripts or the way I specify the submission.
I presently have users spin up a machine specifying the od-XX-gb partition, core count needed along with --mem=0 and --exclusive that way they get the machine they want and all the memory of the machine.
This proposal makes it a bit easier for me and now I can ask the users to use real numbers of the machine types, i.e. 8g, 16g, etc.

This proposal requires that a spec of --mem=8g will get an 8g machine, i.e. PC (or slurm) sees all 8G.

I'm all for this.

cartalla added a commit that referenced this issue Nov 6, 2024
The default is to set it to 95% which makes it difficult to correctly request
memory because you could submit a job and request 4GB and wind up running on an
8GB machine.

This just makes it more intuitive.

Resolves #283
@cartalla cartalla linked a pull request Nov 6, 2024 that will close this issue
cartalla added a commit that referenced this issue Nov 6, 2024
The default is to set it to 95% which makes it difficult to correctly request
memory because you could submit a job and request 4GB and wind up running on an
8GB machine.

This just makes it more intuitive.

Resolves #283
@cartalla
Copy link
Contributor Author

cartalla commented Nov 6, 2024

Turned out to be a super easy change and now things work the way that I expect.

You still might want to use the memory based partitions to make sure that a job doesn't land on a larger machine.
Slurm will not consider powered down cloud nodes if a running node has the required resources which may or may not be what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants