Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix a bug where we subtract reserved node resources twice #23386

Merged
merged 4 commits into from
Jun 21, 2024

Conversation

pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Jun 19, 2024

Fixes a bug in the nodeResources.Comparable method, where CPU resources were accidentally offset with reserved resources, whereas functions that use this field expect total CPU resources.

Fixes #23371 and #23314

@pkazmierczak pkazmierczak self-assigned this Jun 19, 2024
@pkazmierczak pkazmierczak added this to the 1.8.2 milestone Jun 19, 2024
@pkazmierczak pkazmierczak added backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.8.x backport to 1.8.x release line labels Jun 19, 2024
@@ -3191,7 +3191,7 @@ func (n *NodeResources) Comparable() *ComparableResources {
c := &ComparableResources{
Flattened: AllocatedTaskResources{
Cpu: AllocatedCpuResources{
CpuShares: int64(n.Processors.Topology.UsableCompute()),
CpuShares: int64(n.Processors.Topology.TotalCompute()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in the caller in AllocsFit seems to agree that this should be TotalCompute:

	// Check that the node resources (after subtracting reserved) are a
	// super set of those that are being allocated
	available := node.NodeResources.Comparable()
	available.Subtract(node.ReservedResources.Comparable())
	if superset, dimension := available.Superset(used); !superset {
		return false, dimension, used, nil
	}

But it bothers me that changing this wouldn't also cause some tests to fail. Are we missing tests on AllocsFit that exercise this code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so this is very tricky to test.

I found this bug while chasing AllocsFit and initially just removed the available.Subtract(node.ReservedResources.Comparable()) line, because I thought: ah, this just subtracts what's already been subtracted. It solves the issue of course, but making this change is what breaks a lot of tests in the plan applier and scheduler. The reason is: we mock nodes and their resources. The only way to properly test this is an e2e test, but that's also tricky: we'd have to create a client with reserved resources, run a job that requests all the remaining available resources and make sure it succeeds. To do that, we'd have to manipulate client config in the e2e (that is tricky), and dynamically create the jobspec based on the available resources of the e2e node. Can be done, but it's a lot of work I think. Unless I'm missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we could set reserved resources on the mocked nodes on some of those existing scheduler and plan applier tests such that they pass with this change and fail without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. finally found the culprit. every node that has reserved resources set by client conf has an additional field OverrideWitholdCompute set in its Topology on "legacy" systems:

OverrideWitholdCompute: withheld,

This field was missing from the mocked node in our unit tests. After adding it, tests fail without this PR's change.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pkazmierczak pkazmierczak merged commit 8f80bd5 into main Jun 21, 2024
19 checks passed
@pkazmierczak pkazmierczak deleted the b-unallocated-cpu-resources branch June 21, 2024 13:23
Copy link

github-actions bot commented Jan 2, 2025

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.8.x backport to 1.8.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unallocated CPU resources are not being calculated correctly
2 participants