Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: plugins track jobs in addition to allocations, and use job information to set expected counts #8699

Merged
merged 20 commits into from
Aug 27, 2020

Conversation

langmartin
Copy link
Contributor

@langmartin langmartin commented Aug 19, 2020

Expected counts are derived from jobs, we may expect plugins for which
we have no valid fingerprints and therefore no allocations.

  • On job update, update plugin job collection and re-count expected

  • System jobs expected count is the sum of currently running
    allocations + blocked evals. Blocked evals number will improve in
    accuracy with blocking that accounts for driver start time.
    It is called in updateJobSummaryByAllocation

    This does require keeping the expected count indexed by jobID so
    that we can update on each allocation change.

  • Plugin emptiness accounts for jobs

Closes #8503 See also #7974

@langmartin langmartin force-pushed the f-csi-expected branch 3 times, most recently from 509e152 to 933524f Compare August 21, 2020 21:57
@langmartin langmartin added this to the 0.12.4 milestone Aug 22, 2020
@langmartin langmartin marked this pull request as ready for review August 22, 2020 01:54
@langmartin langmartin requested a review from tgross August 22, 2020 01:54
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a good change. I've left a question about how the scheduler considers the plugin updates from job summaries.

The test code we've added here covers the low-level bits pretty well, but I feel like with the size of this changeset we'd benefit from having test coverage at the "boundary" of the RPC endpoints in the nomad package (maybe at nomad/csi_endpoint.go or nomad/job_endpoint.go?)

nomad/state/state_store_test.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/structs/csi.go Show resolved Hide resolved
nomad/state/state_store.go Show resolved Hide resolved
comment in english

Co-authored-by: Tim Gross <[email protected]>
@langmartin
Copy link
Contributor Author

Well, we'd definitely benefit from the bigger test! This change introduces an edge case where a plugin has been created by a job and that job is deleted before any allocation fingerprints make it back to the state store. The plugin will exist, but not have any allocations to use for GC. Working on a fix now.

@langmartin langmartin requested a review from tgross August 25, 2020 01:14
nomad/structs/csi.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we've done some end-to-end testing of this, it looks ok to me. I've left a question about the expected behavior around updates.

@langmartin
Copy link
Contributor Author

Assuming we've done some end-to-end testing of this, it looks ok to me. I've left a question about the expected behavior around updates.

End to end is looking good.

@langmartin langmartin merged commit dd7016b into master Aug 27, 2020
@langmartin langmartin deleted the f-csi-expected branch August 27, 2020 21:20
tgross added a commit that referenced this pull request Apr 15, 2022
The CSI HTTP API has to transform the CSI volume to redact secrets,
remove the claims fields, and to consolidate the allocation stubs into
a single slice of alloc stubs. This was done manually in #8590 but
this is a large amount of code and has proven both very bug prone
(see #8659, #8666, #8699, #8735, and #12150) and requires updating
lots of code every time we add a field to volumes or plugins.

In #10202 we introduce encoding improvements for the `Node` struct
that allow a more minimal transformation. Apply this same approach to
serializing `structs.CSIVolume` to API responses.

Also, the original reasoning behind #8590 for plugins no longer holds
because the counts are now denormalized within the state store, so we
can simply remove this transformation entirely.
tgross added a commit that referenced this pull request Apr 15, 2022
The CSI HTTP API has to transform the CSI volume to redact secrets,
remove the claims fields, and to consolidate the allocation stubs into
a single slice of alloc stubs. This was done manually in #8590 but
this is a large amount of code and has proven both very bug prone
(see #8659, #8666, #8699, #8735, and #12150) and requires updating
lots of code every time we add a field to volumes or plugins.

In #10202 we introduce encoding improvements for the `Node` struct
that allow a more minimal transformation. Apply this same approach to
serializing `structs.CSIVolume` to API responses.

Also, the original reasoning behind #8590 for plugins no longer holds
because the counts are now denormalized within the state store, so we
can simply remove this transformation entirely.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI plugin expected counts reflect expected instance counts more accurately
2 participants