-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: Improve performance of clearing scopes and histograms by batching them #15876
Conversation
…ing them. Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. Co-authored-by: Joshua Marantz <[email protected]> Signed-off-by: Josh Tway <[email protected]>
Hi @jtway, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
Looks like you have an issue with the test for the main file you are modifying. You might want to iterate locally with:
prior to pushing to CI. |
@jmarantz it didn't fail locally, but I did see this. A colleague was able to recreate, so I will be looking in the morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that this speeds things up significantly, but not by an order of magnitude, by saving a lot of indirect function calls. I don't think it reduce the amount of times we need to take a lock, but it does reduce the indirect function calls through post().
The fact that control-plane things take 10s of seconds though still seems bad to me, and makes me wonder if we need more system-thinking about this problem of scale.
|
||
// This can happen from any thread. We post() back to the main thread which will initiate the | ||
// cache flush operation. | ||
if (!shutting_down_ && main_thread_dispatcher_) { | ||
const uint64_t scope_id = scope->scope_id_; | ||
// Switch to batching of clearing scopes as mentioned in: | ||
// https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to call out the gist anymore since we have incorporated the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove as I find and fix why the integration test failed in the pipeline.
auto central_caches = std::make_shared<std::vector<CentralCacheEntrySharedPtr>>(); | ||
{ | ||
Thread::LockGuard lock(lock_); | ||
scope_ids->swap(scopes_to_cleanup_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always liked this swap() pattern by @antoniovicente has convinced me it's faster to do:
*scope_ids = std::move(scopes_to_cleanup_);
scopes_to_cleanup_.clear();
as there is no chance in this context that *scope_ids had anything in it. You can use that in place of all the swap()s added in this PR.
Note you do need that clear()
because std::move does not strictly guarantee what state it leaves the source in, but it should be a fast no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, never thought about using something like clear()
to return something like this to a "known" state after std::move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I believe that is the only legal operation you can do on an object that has just had its contents moved...other than destruct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern used elsewhere is:
*scope_ids = std::move(scopes_to_cleanup_);
ASSERT(scopes_to_cleanup_.empty());
envoy/source/common/event/dispatcher_impl.cc
Line 370 in c6c2a87
// post_callbacks_ should be empty after the move. |
@jmarantz I agree the time control-plane things take still stands room for much improvement. I am viewing this as too problems: 1. Address how long large config updates take, 2. Address why a single VirtualHost being updated, when there are > 100K VirtualHosts present, ends up being a large config update. We are still looking into the later, but perhaps I should open an issue for that as well. In our particular case, it appears the problem comes from every VirtualHost being recreated even when a single VirtualHost is present in a VHDS update. Then the existing config being destroyed after the new config is propagated, which caused the VirtualHostImpl destructor to run, and in turn the ScopeImpl destructor, for every VirtualHost in the previous config. I could be wrong there, but that appears to be the case. |
@jmarantz getting a few compounding problems, even when trying the fix. How I couldn't recreate yesterday, I am unsure. I am in the process of moving, so it may take me a little time to make sure I get the right fix in here. I know the goal is to try to make sure these get wrapped up in < 7 days. Hopefully I will be able to wrap it up by then, but if not, I hope that is okay here. |
/wait |
Back from my PTO now, but a couple unrelated issues have needed my attention. I should be looking to wrap this up in the next couple days. I'm currently having a couple issues with the integration test due to the addition of runOnAllThreads in the shutdownThreading logic. |
… scopes and histograms by batching them Signed-off-by: Josh Tway <[email protected]>
Looks like this broke other tests, guess that's what I get for focusing on this. The problem is that, usually shutdownGlobalThreading has been called, and that means we can't use runOnAllThreads. I'm thinking this through a little. |
ping again when ready /wait |
…en shutting threading down Signed-off-by: Josh Tway <[email protected]>
Signed-off-by: Josh Tway <[email protected]>
Signed-off-by: Josh Tway <[email protected]>
…ading calls. This way we cover both when it has been called, and when it has not Signed-off-by: Josh Tway <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just a few nits.
const uint64_t scope_id = scope->scope_id_; | ||
// Switch to batching of clearing scopes. This greatly reduces the overhead when there are | ||
// tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of | ||
// thousands of VirtualHosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: end sentences with period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
|
||
// This can happen from any thread. We post() back to the main thread which will initiate the | ||
// cache flush operation. | ||
if (!shutting_down_ && main_thread_dispatcher_) { | ||
const uint64_t scope_id = scope->scope_id_; | ||
// Switch to batching of clearing scopes. This greatly reduces the overhead when there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Switch to batching of clearing scopes/Clear scopes in a batch/ as the reader of this won't be anchoring on the previous implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change
Signed-off-by: Josh Tway <[email protected]>
…ived from jmarantz Signed-off-by: Josh Tway <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good now, but I'm biased since I help write some of this. I think the big question here is whether the improved performance is justified by the increased complexity, but the benefits are outlined in the description. So that's a good start. We also have seen similar issues internally, so are hoping this will be submittable.
/assign-from @envoyproxy/senior-maintainers
@envoyproxy/senior-maintainers assignee is @snowp |
Thanks @jmarantz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I think this batching logic makes sense. Can you explain why we have all these changes to shutdownThreading
et al? Not clear from my first pass
if (tls_.has_value()) { | ||
ASSERT(tls_->isShutdown()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe do ASSERT(!tls_.has_value() || tls_->isShutdown())
? Right now this turns into an empty if block in NDEBUG builds (which would get optimized out, so it doesn't matter that much)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change
} | ||
|
||
// We can't call runOnAllThreads here as global threading has already been shutdown. | ||
// It is okay to simply clear the scopes and central cache entries to cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include an explanation for why this is okay to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update. How is:
// We can't call runOnAllThreads here as global threading has already been shutdown. It is okay // to simply clear the scopes and central cache entries here as they will be cleaned up during thread local data // cleanup in InstanceImpl::shutdownThread().
// Capture all the pending histograms in a local, clearing the list held in | ||
// this. Once this occurs, if a new histogram is deleted, a new post will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit hard to read because talking about this
makes the sentence seem incomplete. Maybe rephrase? Something like Move the histograms pending cleanup into a local variable. Future histogram deletions will be batched until the next time this function is called.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will reword
@@ -37,9 +37,11 @@ class ThreadLocalStorePerf { | |||
for (auto& stat_name_storage : stat_names_) { | |||
stat_name_storage->free(symbol_table_); | |||
} | |||
store_.shutdownThreading(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In production code we were shutting down threading this way, where ThraedLocal::shutdownGlobalThreading() was called first. That means in shutdownThreading() we can't then go through the runOAllThreads sequence to clean up; we just have to remove the elements directly.
This change, and others in some tests, are to make tests shutdown threading in the same order as production code. The ASSERT(tls_->isShutdown())
, validates the claim made below it:
// We can't call runOnAllThreads here as global threading has already been shutdown.
and a lot of the trivial test changes are to avoid having that assert crash in unit tests.
@@ -189,7 +195,8 @@ TEST_P(AdminStatsTest, StatsAsJson) { | |||
})EOF"; | |||
|
|||
EXPECT_THAT(expected_json, JsonStringEq(actual_json)); | |||
store_->shutdownThreading(); | |||
shutdownThreading(); | |||
ENVOY_LOG_MISC(error, "end of StatsAsJson"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this log is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove
// Clear scopes in a batch. This greatly reduces the overhead when there are | ||
// tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of | ||
// thousands of VirtualHosts. | ||
bool need_post = scopes_to_cleanup_.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind including an explanation why we post when this is empty? I assume this is the batching mechanism so that we don't post while there is a post in progress, but some detail would be nice to have here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating releaseScopesCrossThread to have some of the same info as releaseHistogramCrossThread. This should cover this and the additional comment below. :)
// clearHistogramsFromCaches. If a new histogram is deleted before that | ||
// post runs, we add it to our list of histograms to clear, and there's no | ||
// need to issue another post. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is explained here, would be good to have this explanation for the above scope release :)
I'm clearly missing something with this latest failure. I'm not even seeing a multiplexed_integration_test. Only a multiplexed_upstream_integration_test |
can you try to merge main? |
Will do, been waiting for the bulk of review to finish prior to. Pretty sure that's in the contribution guidelines. If not, will keep things up to date in the future. |
The reason I suggested that is that I think some of the tests merge main automatically first, so they can report an error in a test that you cannot even see in your client. |
Signed-off-by: Josh Tway <[email protected]>
Got an error because some don't have DCO sign-off, but they are part of what was merged from upstream. In this case it is okay to use the --no-verify right? |
Looks like that was all successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my end, thank for pushing this all the way through!
@snowp for final approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small comment, otherwise this LGTM!
} | ||
|
||
void ThreadLocalStoreImpl::shutdownThreading() { | ||
// This will block both future cache fills as well as cache flushes. | ||
shutting_down_ = true; | ||
|
||
if (tls_.has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this outer if check anymore
Signed-off-by: Josh Tway <[email protected]>
Looks like it failed in a flake way. I'm going to kick it. |
Signed-off-by: Josh Tway <[email protected]>
Signed-off-by: Josh Tway <[email protected]>
I did a kick-ci, but then a colleague informed me that there were some recent merges in main that might be needed, so I merged and pushed that. I hope the next go will be clean (after the current). |
You might need another main merge, I think main was broken earlier |
Signed-off-by: Josh Tway <[email protected]>
Looks like I don't have the ability to rerun jobs, and not sure as to why the windows CI one is all of a sudden failing. Can some one help me out here? |
WIndows failed 3/7 times in a protoocol integration test. Re-running. I think you can also do this via: /retest |
Retrying Azure Pipelines: |
Good to know! For some reason I read that part of the documentation as comment in the code. |
Looks like the retest got kicked off before the previous finished or more results that I can not explain (as I have not seen them previously). /retest |
Retrying Azure Pipelines: |
all checks passed :) |
@snowp it is still showing that there is a change being requested, but I'm pretty sure I addressed everything you wanted. |
…ing them (envoyproxy#15876) Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a Co-authored-by: Joshua Marantz [email protected] Signed-off-by: Josh Tway [email protected] Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds. Risk Level: Low Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update. Docs Changes: N/A Release Notes: N/a Platform Specific Features: N/A Co-authored-by: Joshua Marantz <[email protected]> Signed-off-by: Josh Tway <[email protected]>
Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a
Co-authored-by: Joshua Marantz [email protected]
Signed-off-by: Josh Tway [email protected]
Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds.
Risk Level: Low
Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update.
Docs Changes: N/A
Release Notes: N/a
Platform Specific Features: N/A