Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: Improve performance of clearing scopes and histograms by batching them #15876

Merged
merged 14 commits into from
May 21, 2021

Conversation

jtway
Copy link
Contributor

@jtway jtway commented Apr 7, 2021

Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a

Co-authored-by: Joshua Marantz [email protected]
Signed-off-by: Josh Tway [email protected]

Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds.
Risk Level: Low
Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update.
Docs Changes: N/A
Release Notes: N/a
Platform Specific Features: N/A

…ing them. Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order.

Co-authored-by: Joshua Marantz <[email protected]>

Signed-off-by: Josh Tway <[email protected]>
@repokitteh-read-only
Copy link

Hi @jtway, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #15876 was opened by jtway.

see: more, trace.

@jmarantz
Copy link
Contributor

jmarantz commented Apr 8, 2021

Looks like you have an issue with the test for the main file you are modifying. You might want to iterate locally with:

bazel test test/common/stats:thread_local_store_test

prior to pushing to CI.

@jtway
Copy link
Contributor Author

jtway commented Apr 8, 2021

@jmarantz it didn't fail locally, but I did see this. A colleague was able to recreate, so I will be looking in the morning.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that this speeds things up significantly, but not by an order of magnitude, by saving a lot of indirect function calls. I don't think it reduce the amount of times we need to take a lock, but it does reduce the indirect function calls through post().

The fact that control-plane things take 10s of seconds though still seems bad to me, and makes me wonder if we need more system-thinking about this problem of scale.


// This can happen from any thread. We post() back to the main thread which will initiate the
// cache flush operation.
if (!shutting_down_ && main_thread_dispatcher_) {
const uint64_t scope_id = scope->scope_id_;
// Switch to batching of clearing scopes as mentioned in:
// https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to call out the gist anymore since we have incorporated the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove as I find and fix why the integration test failed in the pipeline.

auto central_caches = std::make_shared<std::vector<CentralCacheEntrySharedPtr>>();
{
Thread::LockGuard lock(lock_);
scope_ids->swap(scopes_to_cleanup_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always liked this swap() pattern by @antoniovicente has convinced me it's faster to do:

   *scope_ids = std::move(scopes_to_cleanup_);
   scopes_to_cleanup_.clear();

as there is no chance in this context that *scope_ids had anything in it. You can use that in place of all the swap()s added in this PR.

Note you do need that clear() because std::move does not strictly guarantee what state it leaves the source in, but it should be a fast no-op.

Copy link
Contributor Author

@jtway jtway Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, never thought about using something like clear() to return something like this to a "known" state after std::move

Copy link
Contributor

@jmarantz jmarantz Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe that is the only legal operation you can do on an object that has just had its contents moved...other than destruct it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern used elsewhere is:
*scope_ids = std::move(scopes_to_cleanup_);
ASSERT(scopes_to_cleanup_.empty());

// post_callbacks_ should be empty after the move.

source/common/stats/thread_local_store.cc Show resolved Hide resolved
@jtway
Copy link
Contributor Author

jtway commented Apr 8, 2021

The fact that control-plane things take 10s of seconds though still seems bad to me, and makes me wonder if we need more system-thinking about this problem of scale.

@jmarantz I agree the time control-plane things take still stands room for much improvement. I am viewing this as too problems: 1. Address how long large config updates take, 2. Address why a single VirtualHost being updated, when there are > 100K VirtualHosts present, ends up being a large config update.

We are still looking into the later, but perhaps I should open an issue for that as well. In our particular case, it appears the problem comes from every VirtualHost being recreated even when a single VirtualHost is present in a VHDS update. Then the existing config being destroyed after the new config is propagated, which caused the VirtualHostImpl destructor to run, and in turn the ScopeImpl destructor, for every VirtualHost in the previous config. I could be wrong there, but that appears to be the case.

@jtway
Copy link
Contributor Author

jtway commented Apr 8, 2021

@jmarantz getting a few compounding problems, even when trying the fix. How I couldn't recreate yesterday, I am unsure. I am in the process of moving, so it may take me a little time to make sure I get the right fix in here. I know the goal is to try to make sure these get wrapped up in < 7 days. Hopefully I will be able to wrap it up by then, but if not, I hope that is okay here.

@jmarantz
Copy link
Contributor

/wait

@jtway
Copy link
Contributor Author

jtway commented Apr 22, 2021

Back from my PTO now, but a couple unrelated issues have needed my attention. I should be looking to wrap this up in the next couple days. I'm currently having a couple issues with the integration test due to the addition of runOnAllThreads in the shutdownThreading logic.

… scopes and histograms by batching them

Signed-off-by: Josh Tway <[email protected]>
@jtway
Copy link
Contributor Author

jtway commented Apr 27, 2021

Looks like this broke other tests, guess that's what I get for focusing on this. The problem is that, usually shutdownGlobalThreading has been called, and that means we can't use runOnAllThreads. I'm thinking this through a little.

@jmarantz
Copy link
Contributor

ping again when ready

/wait

jtway added 3 commits April 29, 2021 13:45
Signed-off-by: Josh Tway <[email protected]>
Signed-off-by: Josh Tway <[email protected]>
…ading calls. This way we cover both when it has been called, and when it has not

Signed-off-by: Josh Tway <[email protected]>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a few nits.

source/common/stats/thread_local_store.cc Outdated Show resolved Hide resolved
const uint64_t scope_id = scope->scope_id_;
// Switch to batching of clearing scopes. This greatly reduces the overhead when there are
// tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of
// thousands of VirtualHosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: end sentences with period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix


// This can happen from any thread. We post() back to the main thread which will initiate the
// cache flush operation.
if (!shutting_down_ && main_thread_dispatcher_) {
const uint64_t scope_id = scope->scope_id_;
// Switch to batching of clearing scopes. This greatly reduces the overhead when there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Switch to batching of clearing scopes/Clear scopes in a batch/ as the reader of this won't be anchoring on the previous implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

test/common/stats/thread_local_store_test.cc Show resolved Hide resolved
jmarantz
jmarantz previously approved these changes May 7, 2021
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now, but I'm biased since I help write some of this. I think the big question here is whether the improved performance is justified by the increased complexity, but the benefits are outlined in the description. So that's a good start. We also have seen similar issues internally, so are hoping this will be submittable.

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link

@envoyproxy/senior-maintainers assignee is @snowp

🐱

Caused by: a #15876 (review) was submitted by @jmarantz.

see: more, trace.

@jtway
Copy link
Contributor Author

jtway commented May 7, 2021

Thanks @jmarantz

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I think this batching logic makes sense. Can you explain why we have all these changes to shutdownThreading et al? Not clear from my first pass

Comment on lines 207 to 209
if (tls_.has_value()) {
ASSERT(tls_->isShutdown());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe do ASSERT(!tls_.has_value() || tls_->isShutdown())? Right now this turns into an empty if block in NDEBUG builds (which would get optimized out, so it doesn't matter that much)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

}

// We can't call runOnAllThreads here as global threading has already been shutdown.
// It is okay to simply clear the scopes and central cache entries to cleanup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include an explanation for why this is okay to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update. How is:
// We can't call runOnAllThreads here as global threading has already been shutdown. It is okay // to simply clear the scopes and central cache entries here as they will be cleaned up during thread local data // cleanup in InstanceImpl::shutdownThread().

Comment on lines 376 to 377
// Capture all the pending histograms in a local, clearing the list held in
// this. Once this occurs, if a new histogram is deleted, a new post will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit hard to read because talking about this makes the sentence seem incomplete. Maybe rephrase? Something like Move the histograms pending cleanup into a local variable. Future histogram deletions will be batched until the next time this function is called.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reword

@@ -37,9 +37,11 @@ class ThreadLocalStorePerf {
for (auto& stat_name_storage : stat_names_) {
stat_name_storage->free(symbol_table_);
}
store_.shutdownThreading();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production code we were shutting down threading this way, where ThraedLocal::shutdownGlobalThreading() was called first. That means in shutdownThreading() we can't then go through the runOAllThreads sequence to clean up; we just have to remove the elements directly.

This change, and others in some tests, are to make tests shutdown threading in the same order as production code. The ASSERT(tls_->isShutdown()), validates the claim made below it:

// We can't call runOnAllThreads here as global threading has already been shutdown.

and a lot of the trivial test changes are to avoid having that assert crash in unit tests.

@@ -189,7 +195,8 @@ TEST_P(AdminStatsTest, StatsAsJson) {
})EOF";

EXPECT_THAT(expected_json, JsonStringEq(actual_json));
store_->shutdownThreading();
shutdownThreading();
ENVOY_LOG_MISC(error, "end of StatsAsJson");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this log is necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

// Clear scopes in a batch. This greatly reduces the overhead when there are
// tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of
// thousands of VirtualHosts.
bool need_post = scopes_to_cleanup_.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind including an explanation why we post when this is empty? I assume this is the batching mechanism so that we don't post while there is a post in progress, but some detail would be nice to have here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating releaseScopesCrossThread to have some of the same info as releaseHistogramCrossThread. This should cover this and the additional comment below. :)

Comment on lines +310 to +312
// clearHistogramsFromCaches. If a new histogram is deleted before that
// post runs, we add it to our list of histograms to clear, and there's no
// need to issue another post.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is explained here, would be good to have this explanation for the above scope release :)

@jtway
Copy link
Contributor Author

jtway commented May 17, 2021

I'm clearly missing something with this latest failure. I'm not even seeing a multiplexed_integration_test. Only a multiplexed_upstream_integration_test

@jmarantz
Copy link
Contributor

can you try to merge main?

@jtway
Copy link
Contributor Author

jtway commented May 17, 2021

Will do, been waiting for the bulk of review to finish prior to. Pretty sure that's in the contribution guidelines. If not, will keep things up to date in the future.

@jmarantz
Copy link
Contributor

The reason I suggested that is that I think some of the tests merge main automatically first, so they can report an error in a test that you cannot even see in your client.

@jtway
Copy link
Contributor Author

jtway commented May 17, 2021

Got an error because some don't have DCO sign-off, but they are part of what was merged from upstream. In this case it is okay to use the --no-verify right?

@jtway
Copy link
Contributor Author

jtway commented May 17, 2021

Looks like that was all successful

jmarantz
jmarantz previously approved these changes May 17, 2021
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my end, thank for pushing this all the way through!

@snowp for final approval.

@jtway
Copy link
Contributor Author

jtway commented May 19, 2021

@jmarantz Thanks, and no worries. Hoping to get the final 👍 from @snowp :)

snowp
snowp previously requested changes May 20, 2021
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment, otherwise this LGTM!

}

void ThreadLocalStoreImpl::shutdownThreading() {
// This will block both future cache fills as well as cache flushes.
shutting_down_ = true;

if (tls_.has_value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this outer if check anymore

@jtway
Copy link
Contributor Author

jtway commented May 20, 2021

Looks like it failed in a flake way. I'm going to kick it.

jtway added 2 commits May 20, 2021 10:34
@jtway
Copy link
Contributor Author

jtway commented May 20, 2021

I did a kick-ci, but then a colleague informed me that there were some recent merges in main that might be needed, so I merged and pushed that. I hope the next go will be clean (after the current).

@snowp
Copy link
Contributor

snowp commented May 20, 2021

You might need another main merge, I think main was broken earlier

@jtway
Copy link
Contributor Author

jtway commented May 21, 2021

Looks like I don't have the ability to rerun jobs, and not sure as to why the windows CI one is all of a sudden failing. Can some one help me out here?

@jmarantz
Copy link
Contributor

WIndows failed 3/7 times in a protoocol integration test. Re-running. I think you can also do this via:

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15876 (comment) was created by @jmarantz.

see: more, trace.

@jtway
Copy link
Contributor Author

jtway commented May 21, 2021

Good to know! For some reason I read that part of the documentation as comment in the code.

@jtway
Copy link
Contributor Author

jtway commented May 21, 2021

Looks like the retest got kicked off before the previous finished or more results that I can not explain (as I have not seen them previously).

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15876 (comment) was created by @jtway.

see: more, trace.

@jtway
Copy link
Contributor Author

jtway commented May 21, 2021

all checks passed :)

@jtway
Copy link
Contributor Author

jtway commented May 21, 2021

@snowp it is still showing that there is a change being requested, but I'm pretty sure I addressed everything you wanted.

@jmarantz jmarantz dismissed snowp’s stale review May 21, 2021 19:53

all concerns addressed

@jmarantz jmarantz merged commit 7a639c4 into envoyproxy:main May 21, 2021
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
…ing them (envoyproxy#15876)

Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a

Co-authored-by: Joshua Marantz [email protected]
Signed-off-by: Josh Tway [email protected]

Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds.
Risk Level: Low
Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update.
Docs Changes: N/A
Release Notes: N/a
Platform Specific Features: N/A

Co-authored-by: Joshua Marantz <[email protected]>
Signed-off-by: Josh Tway <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants