Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: keep a set of ref-counted parent histograms in ThreadLocalStore so that two with the same name map to the same histogram object. #12275

Merged
merged 27 commits into from
Aug 11, 2020

Conversation

jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Jul 24, 2020

Commit Message: Creates a storage model for thread-local histograms enabling continuity of data across scope reloads. Previously, whenever a Scope was re-created, the counters, gauges, and text-readouts of the same names would retain their previous values. However, fresh histograms were created on every scope reload, and stats dumps would include duplicate histograms with the same name.

This change adds an analogous name-based set of histograms, held in ThreadLocalStore, so that we have a single histogram representing all its generations of data. This is somewhat more complex than for the other stats, since there were thread-local buffers, which previously were owned by TlsScope and needed to be made independent. So this introduces a new tls histogram map in the tls-cache to maintain this.

This should help unblock #12241.

Additional Description:
Risk Level: high (not clear whether this is enough testing of histogram usage)
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Fixes: #3098

…at two with the same name map to the same histogram object.

Signed-off-by: Joshua Marantz <[email protected]>
@jmarantz jmarantz requested a review from ggreenway July 24, 2020 13:06
@jmarantz
Copy link
Contributor Author

This may help unblock #12241

@jmarantz jmarantz marked this pull request as ready for review July 31, 2020 13:17
@jmarantz
Copy link
Contributor Author

We'll see if this flies through CI but I think this is ready for review

jmarantz added 2 commits July 31, 2020 11:02
…uction the cleanup of that set is disabled.

Signed-off-by: Joshua Marantz <[email protected]>
Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code well enough to effectively review most of it. @mattklein123 can you take a look?

.bazelversion Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Outdated Show resolved Hide resolved
jmarantz added 2 commits July 31, 2020 16:46
…y deleting it on cleanup.

Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have definitely not tackled this at the end of the day on a Friday. ;)

Awesome to see this long standing issue fixed. At a high level this makes sense to me though I had to spend quite a bit of time with it to page things in. Beyond my comments below a few high level comments:

  1. Can we update stats.md as part of this PR?
  2. I think it would be good to have more tests. There are lots of very specific threading behaviors here that I think we can test with the synchronizer?

Anyway thanks for working on this!

/wait

source/common/stats/thread_local_store.h Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Outdated Show resolved Hide resolved
test/common/stats/thread_local_store_test.cc Show resolved Hide resolved
source/common/stats/thread_local_store.cc Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Show resolved Hide resolved
source/common/stats/thread_local_store.h Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Show resolved Hide resolved
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Copy link
Contributor Author

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at indirect testing of the cleaning of TLS histograms, as well as testing with real threads.

There are a bunch of other tests in the same file with real threads, so it shouldn't be too hard. I don't know if I need to use the synchronizer unless I specifically want to test a race I'm suspicious of. Is there a particular race you think I should focus on?

source/common/stats/thread_local_store.cc Show resolved Hide resolved
@jmarantz
Copy link
Contributor Author

jmarantz commented Aug 3, 2020

OK I've got a framework for a real-threads histogram test, which I refactored from an existing test that used real threads.

I think there's probably more you had in mind and I should take another crack at some tests myself tomorrow, and I still need to update stats.md.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks in general this LGTM. I don't specifically have ideas for tests, but I think the more you can add the better as this code is so complicated and hard to reason about. Also, can you update stats.md?

/wait

source/common/stats/thread_local_store.cc Show resolved Hide resolved
Copy link
Contributor Author

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready for review; I basically duplicated the histogram "overlap" test but with real threads, and added more comments around the threading model, and updated stats.md. Note that nothing that was in stats.md was actually wrong, but I added detail about the tls structures.

source/common/stats/thread_local_store.cc Show resolved Hide resolved
source/common/stats/thread_local_store.cc Show resolved Hide resolved
@@ -304,7 +305,8 @@ TEST_P(ClusterMemoryTestRunner, MemoryLargeClusterSizeWithFakeSymbolTable) {
// We only run the exact test for ipv6 because ipv4 in some cases may allocate a
// different number of bytes. We still run the approximate test.
if (ip_version_ != Network::Address::IpVersion::v6) {
EXPECT_MEMORY_EQ(m_per_cluster, 45002);
// https://github.com/envoyproxy/envoy/issues/12209
Copy link
Contributor Author

@jmarantz jmarantz Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: #12035 removed this commented-out section, probably due to a git merge conflict error.

This PR actually does reduce memory usage due to reducing the number of map instances, and I put the updated values into the comment log, but I re-tested the flakiness and still measured 3 flakes out of 100 runs ,so I commented out the tests again.

@jmarantz
Copy link
Contributor Author

I didn't add more 'synchronizer' tests. The locking structure is identical to that of counters/gauges for which there are synchronizer tests. Happy to add that if you like.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jmarantz jmarantz dismissed ggreenway’s stale review August 11, 2020 00:00

approved by mklein

@jmarantz jmarantz merged commit 04de1cf into envoyproxy:master Aug 11, 2020
@jmarantz jmarantz deleted the hist-map branch August 11, 2020 00:01
@jmarantz jmarantz mentioned this pull request Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle duplicate histogram names
4 participants