-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory oversubscription #10247
Memory oversubscription #10247
Conversation
@@ -988,7 +993,7 @@ func (tr *TaskRunner) buildTaskConfig() *drivers.TaskConfig { | |||
Resources: &drivers.Resources{ | |||
NomadResources: taskResources, | |||
LinuxResources: &drivers.LinuxResources{ | |||
MemoryLimitBytes: taskResources.Memory.MemoryMB * 1024 * 1024, | |||
MemoryLimitBytes: memoryLimit * 1024 * 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the linux resources is intended to ease drivers implementation and adoption of the features: drivers that use resources.LinuxResources.MemoryLimitBytes
don't need to be updated.
Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.
I don't know the breakdown of how external 3rd party drivers check memory limit, but happy to change the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.
So if they don't get updated, they'll just end up setting their limit equal to the memory
field value, just as they do today? They just end up ignoring memory_max
?
From a customer/community impact standpoint, the two I'd worry the most about are containerd
and podman
. Also, do we want to update qemu
to take whichever is greater?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the failure mode is ignoring memory_max and behaving like today. I'm researching soft vs hard limits a bit now, and will ensure containerd
and podman
are updated to the recommended pattern.
} | ||
} | ||
|
||
task "cgroup-fetcher" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exec doesn't mount cgroup filesystem into the exec container - so I needed to have this raw_exec "sidecar" to lookup the relevant cgroup and memory limit instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever. Out of scope for this PR, but should we be mounting that filesystem in the exec container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not sure why we didn't. Exec is hopelessly behind other drivers in goodness - it may make sense to combine all of that in a exec-v2 refresh.
nomad/plan_normalization_test.go
Outdated
@@ -62,5 +62,5 @@ func TestPlanNormalize(t *testing.T) { | |||
} | |||
|
|||
optimizedLogSize := buf.Len() | |||
assert.True(t, float64(optimizedLogSize)/float64(unoptimizedLogSize) < 0.62) | |||
assert.Less(t, float64(optimizedLogSize)/float64(unoptimizedLogSize), 0.63) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compression value needed to be raised to account the new MemoryMaxMB field. Seems like a pretty odd test that will effectively fail anytime we add fields to allocs. It's nice to keep track of the value overtime, but don't know of a better way to track it, so just changed the value here.
if r.MemoryMaxMB != 0 && r.MemoryMaxMB < r.MemoryMB { | ||
mErr.Errors = append(mErr.Errors, fmt.Errorf("MemoryMaxMB value (%d) should be larger than MemoryMB value (%d)", r.MemoryMaxMB, r.MemoryMB)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only validation we do for MemoryMaxMB is that it needs to be equal or higher than MemoryMB. The scheduler may place the alloc on a client with less memory than MemoryMaxMB and the client may run it.
It's unclear what the behavior should be: ideally, we prioritize placing the job allocations on clients that exceed the MemoryMaxMB, but they should be run IMO even if the only nodes available are nodes that meet the basic memory requirements but not the max ones. Also, I suspect some operators will set high values as an optimistic "just be lenient and give me some access memory" rather than setting max values through rigorous analysis and experimentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question - when clusters start to fill up, there's an incentive to set Memory absurdly low to increase your chance of getting scheduled, then lean on MemoryMax for resources to actually run. Should MemoryMax at least feed into Quota?
if delta.MemoryMaxMB != 0 { | ||
a.MemoryMaxMB += delta.MemoryMaxMB | ||
} else { | ||
a.MemoryMaxMB += delta.MemoryMB | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updates in AllocatedMemoryResource tracking isn't strictly needed. I'm adding them for consistency and to ease having the scheduler consider MemoryMaxMB in the future.
Devices []*RequestedDevice `hcl:"device,block"` | ||
CPU *int `hcl:"cpu,optional"` | ||
MemoryMB *int `mapstructure:"memory" hcl:"memory,optional"` | ||
MemoryMaxMB *int `mapstructure:"memory_max" hcl:"memory_max,optional"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this iteration, I've opted to simply add a memory_max
field in the job spec, with memory
remaining as the "reserve"/base memory requirement. Happy to reconsider this and use an alternative name for the "base", e.g. memory_reserve
,memory_required
?
I considered memory_min
- but I find it ambiguous. min
indicates the minimum memory a task uses rather than how much memory we should reserve/allocate for the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that never really got resolved on the RFC, but I'm totally 👍 for this. It avoids any migration issues later, too.
waiting for this since #2771 ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid work here. I want to take a second pass through it but I want to press Submit on this review so you can answer any questions async. Also, do we need to do anything here in OSS to pass the max memory to quota stack checking in ENT?
(It's kind of a shocking amount of plumbing code required!)
} | ||
} | ||
|
||
task "cgroup-fetcher" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever. Out of scope for this PR, but should we be mounting that filesystem in the exec container?
Devices []*RequestedDevice `hcl:"device,block"` | ||
CPU *int `hcl:"cpu,optional"` | ||
MemoryMB *int `mapstructure:"memory" hcl:"memory,optional"` | ||
MemoryMaxMB *int `mapstructure:"memory_max" hcl:"memory_max,optional"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that never really got resolved on the RFC, but I'm totally 👍 for this. It avoids any migration issues later, too.
@@ -988,7 +993,7 @@ func (tr *TaskRunner) buildTaskConfig() *drivers.TaskConfig { | |||
Resources: &drivers.Resources{ | |||
NomadResources: taskResources, | |||
LinuxResources: &drivers.LinuxResources{ | |||
MemoryLimitBytes: taskResources.Memory.MemoryMB * 1024 * 1024, | |||
MemoryLimitBytes: memoryLimit * 1024 * 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.
So if they don't get updated, they'll just end up setting their limit equal to the memory
field value, just as they do today? They just end up ignoring memory_max
?
From a customer/community impact standpoint, the two I'd worry the most about are containerd
and podman
. Also, do we want to update qemu
to take whichever is greater?
1b34849
to
34ea86e
Compare
Start tracking a new MemoryMaxMB field that represents the maximum memory a task may use in the client. This allows tasks to specify a memory reservation (to be used by scheduler when placing the task) but use excess memory used on the client if the client has any. This commit adds the server tracking for the value, and ensures that allocations AllocatedResource fields include the value.
This commit updates the API to pass the MemoryMaxMB field, and the CLI to show the max set for the task. Also, start parsing the MemoryMaxMB in hcl2, as it's set by tags. A sample CLI output; note the additional `Max: ` for "task": ``` $ nomad alloc status 96fbeb0b ID = 96fbeb0b-a0b3-aa95-62bf-b8a39492fd5c [...] Task "cgroup-fetcher" is "running" Task Resources CPU Memory Disk Addresses 0/500 MHz 32 MiB/20 MiB 300 MiB Task Events: [...] Task "task" is "running" Task Resources CPU Memory Disk Addresses 0/500 MHz 176 KiB/20 MiB 300 MiB Max: 30 MiB Task Events: [...] ```
Allow specifying the `memory_max` field in HCL under the resources block. Though HCLv1 is deprecated, I've updated them to ease our testing.
Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease drivers implementation and adoption of the features: drivers that use `resources.LinuxResources.MemoryLimitBytes` don't need to be updated. Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.
Linux offers soft memory limit: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html#soft-limits , and https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?highlight=memory.low . We can set soft memory limits through libcontainer `Resources.MemoryReservation`: https://pkg.go.dev/github.com/opencontainers/[email protected]/libcontainer/configs#Resources
34ea86e
to
43549b4
Compare
I've updated the PR, by rebasing to address merge conflicts with the core pinning changes. Also, added a change so that we set soft memory limit for the exec/java task cgroups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There's a raft leadership test failure currently, but it may be an unrelated flake... not sure I've seen that one in particular but the servers don't look like they can reach each other, so that doesn't seem relevant to this PR.
{ | ||
Type: DiffTypeEdited, | ||
Name: "Resources", | ||
Fields: []*FieldDiff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to add the Cores
field here:
=== FAIL: nomad/structs TestTaskDiff/Resources_edited_memory_max_with_context (0.00s)
diff_test.go:7038: case 16: got:
Task "" (Edited):
"Resources" (Edited) {
"CPU" (None): "100" => "100"
"Cores" (None): "0" => "0"
"DiskMB" (None): "100" => "100"
"IOPS" (None): "0" => "0"
"MemoryMB" (None): "100" => "100"
"MemoryMaxMB" (Edited): "200" => "300"
}
want:
Task "" (Edited):
"Resources" (Edited) {
"CPU" (None): "100" => "100"
"DiskMB" (None): "100" => "100"
"IOPS" (None): "0" => "0"
"MemoryMB" (None): "100" => "100"
"MemoryMaxMB" (Edited): "200" => "300"
}
--- FAIL: TestTaskDiff/Resources_edited_memory_max_with_context (0.00s)
Should this close #606? |
Will there be a way to disable oversubscription? 🤔 Personally, I would like to see this both as an ACL policy option and as a knob under |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR adds support for memory oversubscription to enable better packing and resource utilizations. A task may specify two memory limits: a reserved memory limit to be used by the scheduler when placing the allocation, and another higher limit for actual runtime. This allows job submitters to set lower and less conservative memory resources, while being able to use the excess memory on the client if there is some.
This PR lays the foundation for oversubscription - namely the internal tracking and driver plumbing, but the UX is still in flux and I'll add additional notes.
Proposed UX
A job submitter can configure a task to use excess memory capacity by setting
memory_max
under the task resource:nomad alloc status
will report the memory limit:Notes for reviewers and technical implementation
The PR is relatively large (~1k LOC). I've attempted to organize to place logical changes together in separate PR. I'd recommend reviewing the PR by examining the individual commits.
Also, I've added inline comments with technical discussions and design choices to ease discussing them. in threads.
Follow up Work post PR
Will create new GitHub issues to track these, but adding here to set the vision for users as well as set context for the PR reviewer.
Short term:
oom_score_adj
so that aggressive tasks may not end up OOM killing other jobs on the nodes.Longer Term: