Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volumewatcher: set maximum batch size for raft update #7907

Merged
merged 3 commits into from
May 8, 2020

Conversation

tgross
Copy link
Member

@tgross tgross commented May 8, 2020

For #7838

The volumewatcher has a 250ms batch window so claim updates will not typically be large enough to risk exceeding the maximum raft message size. But large jobs might have enough volume claims that this could be a danger. Set a maximum batch size of 100 messages per batch (roughly 31K), as a very conservative safety/robustness guard.

In the future I'd like to factor this same logic out of the deploymentwatcher batcher so we could share implementations.

The `volumewatcher` has a 250ms batch window so claim updates will not
typically be large enough to risk exceeding the maximum raft message
size. But large jobs might have enough volume claims that this could
be a danger. Set a maximum batch size of 100 messages per
batch (roughly 33K), as a very conservative safety/robustness guard.
@tgross tgross requested review from langmartin and cgbaker May 8, 2020 18:51
@tgross tgross added this to the 0.11.2 milestone May 8, 2020
// Reset the claims list and timer
claims = make(map[string]structs.CSIVolumeClaimRequest)
// Reset the batches list and timer
batches = batches[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm reading this correctly, it looks like the loop relies on work coming in on b.workCh to reset the timer, same as before. however, with batching, it seems like it's possible that enough work could come in during a single batchDuration to create multiple batches. if that happens, the timerCh will signal a claim request for only the first batch of claims. the rest will not be dispatched until more work comes in to reset the timer (which could hypothetically be never).

so either this loop needs to send off all batches or it needs to reset timerCh (to batchDuration or perhaps some appropriate fraction thereof) if len(batches) > 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: while i appreciate the elegance of this approach, why just not use a Ticker for timerCh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so either this loop needs to send off all batches or it needs to reset timerCh (to batchDuration or perhaps some appropriate fraction thereof) if len(batches) > 0.

Good catch, that wasn't a problem when we were sending off the whole batch at once!

aside: while i appreciate the elegance of this approach, why just not use a Ticker for timerCh?

Hm, that's a good point. In the original design (copied right out of deploymentwatcher), a no-op pass would potentially modify some state by swapping out the future. A ticker would make it a little easier to understand at the negligible cost of ticking over and checking the length each pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated with these changes. How's that look?

Copy link
Contributor

@cgbaker cgbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a concern that all batches may not be sent (or sent in a timely manner)
(will mark approve and "take my answer off the air")

Copy link
Contributor

@cgbaker cgbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (minor change to stop the ticker)

@tgross tgross merged commit a334ba3 into master May 8, 2020
@tgross tgross deleted the volumewatcher_batch_size branch May 8, 2020 20:54
@github-actions
Copy link

github-actions bot commented Jan 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants