-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: debug page for RocksDB tickers/histograms #36196
Conversation
Hmm, 170 lines is hefty. You mention that we're collecting them already anyways (see snippets below), so what do we win by printing them in the logs? Just easier debuggability in situations in which we can't look at the time series? It seems hard to make good use of them since they're all counters and so you end up manually diffing them. For perf work I doubt we're ever going to want to just look at the time series, which also give you 10s granularity while here we have 60s only. cockroach/pkg/storage/metrics.go Lines 1418 to 1432 in d8eb94c
What are examples of some situations in which this dump would be useful? Can we make the timeseries useful instead? For example, which of the stats should we add to the snippet above? How should we expose the timeseries? For example, we have https://gist.github.com/tbg/006a2cfde07ddbfc9c96f10e5b3d4f33 (I think this means that the snippet is but it's already more pleasant than trying to piece together multiple stats dumps from the logs. Do you think that can work instead? We need to include that info in the debug zip (not all of the ts, that's too much, but some recent interval). |
Sorry "we're collecting" was the wrong word choice. What I meant is RocksDB is collecting them internally. Besides the handful of stats in the snippet you showed, the stats RocksDB is collecting don't appear to be exposed in Cockroach in any form.
I don't know. I find non-granular stats useful sometimes. For example you can compare
It is useful for cases where we didn't predict a stat would be useful so don't have it in the timeseries. Or maybe we upgrade RocksDB and don't notice a new stat. If we can think of a way to include everything in timeseries, that'd be preferable to me.
This is fine for me. I think the only question I have is whether you agree with preemptively exposing all stats as timeseries, or prefer to cherry-pick the ones we think are relevant? |
That's a good point. I talked to @bdarnell and we agreed that we can't add them wholesale to either timeseries or the logs, it's just too much (we have <600 time series in total right now, so we don't want to add 170 at once). Regarding missing a newly added metric, if you worry about that the thing to do - I think - is a unit test that basically extracts all the stats names from your dump string and fails whenever the list changes. I'll let you decide whether that is worth doing, I suspect not. |
This is actually the main thing I'm interested in. Adding a few stats to timeseries is nice too but only works if we know in advance what we're looking for. For example, when I found the data synthesized by our workload tool is uncompressible that was due to noticing the stat "rocksdb.bytes.compressed" is always zero. It's not a very good example, but maybe we'd notice interesting engine-related issues in production too if we examine stats that we don't decide months in advance that we need. |
The *.pb.{h,cc,go} and some other auto-generated-looking files show huge changes on GitHub but do not show any changes on my local machine. I am not sure what causes this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for banging this out so quickly! Only a few minor comments and you should write a test that calls the stats endpoint. I think an analogue of
cockroach/pkg/server/status_test.go
Lines 133 to 156 in 0750a43
// TestStatusGossipJson ensures that the output response for the full gossip | |
// info contains the required fields. | |
func TestStatusGossipJson(t *testing.T) { | |
defer leaktest.AfterTest(t)() | |
s, _, _ := serverutils.StartServer(t, base.TestServerArgs{}) | |
defer s.Stopper().Stop(context.TODO()) | |
var data gossip.InfoStatus | |
if err := getStatusJSONProto(s, "gossip/local", &data); err != nil { | |
t.Fatal(err) | |
} | |
if _, ok := data.Infos["first-range"]; !ok { | |
t.Errorf("no first-range info returned: %v", data) | |
} | |
if _, ok := data.Infos["cluster-id"]; !ok { | |
t.Errorf("no clusterID info returned: %v", data) | |
} | |
if _, ok := data.Infos["node:1"]; !ok { | |
t.Errorf("no node 1 info returned: %v", data) | |
} | |
if _, ok := data.Infos["system-db"]; !ok { | |
t.Errorf("no system config info returned: %v", data) | |
} | |
} |
For what's exposed as time series, do you know of any that should be added? It is helpful to have historical information for some key metrics and I'm curious what you think about the ones we have so far.
Reviewed 19 of 19 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)
c-deps/libroach/engine.cc, line 224 at r1 (raw file):
// `GetTickersAndHistograms` retrieves maps of all RocksDB tickers and histograms. // It differs from `GetStats` by getting _every_ ticker and histogram, and by not // getting anything else (DB properties, for example).
Mention that the caller needs to free stats.{tickers,histograms}
.
pkg/server/serverpb/status.proto, line 231 at r1 (raw file):
} message EngineStatsResponse {
This should also contain the StoreID
, otherwise with multiple stores you won't know which stats belong to which store.
pkg/storage/engine/rocksdb.go, line 1195 at r1 (raw file):
} tickers := (*[1 << 20]C.TickerInfo)(
For consistency, use the same constant value instead of 1 << 20
from here
cockroach/pkg/storage/engine/slice.go
Line 34 in 2558dcc
src := (*[maxArrayLen]byte)(ptr)[:len:len] |
(I know we don't ever expect to see 1 << 20 < tickers_len
, but still.
pkg/storage/engine/rocksdb.go, line 1206 at r1 (raw file):
res.Histograms = make(map[string]*enginepb.HistogramData) histograms := (*[1 << 20]C.HistogramInfo)(
ditto
pkg/storage/engine/rocksdb.go, line 1219 at r1 (raw file):
Sum: uint64(histogram.sum), } res.Histograms[name] = value
My preference would be to have Histograms a map[string]HistogramData
and not map[string]*HistogramData
. You should be able to make this happen in the .proto
file by using the option as here:
map<int64, config.ZoneConfig> zone_configs = 8 [(gogoproto.nullable) = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS I don't know about the diffs. You've definitely done the right thing (i.e. the files needed to change) and they're also supposed to be checked in. Locally git
suppresses these diffs (treats it as binary files, we've set things up that way), so maybe you just didn't notice them.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)
Hm the tests use |
I think you just need to feed it a config that has a cockroach/pkg/base/test_server_args.go Line 66 in 352bdf4
You can get a dir with |
Here's an example cockroach/pkg/server/server_test.go Lines 540 to 548 in 8938dcb
|
That was fast, thanks! |
8016ae1
to
fd7fdf1
Compare
Thanks a lot for the fast and detailed review. Addressed all your comments, I believe.
OK, I went through them and picked a few that might be worth adding. Will do that in a separate PR, somewhat later.
|
446f105
to
d266d6f
Compare
Set up a debug page at "/_status/enginestats/<node_id>" that contains values of all RocksDB ticker stats and histograms. It is also included in debug zip. A portion of the debug page (including both tickers and histograms) looks like this: ![tickershistograms](https://user-images.githubusercontent.com/4780362/55138005-aa88ab00-50ef-11e9-83d6-cc335cc10e6e.png) The debug zip file contains a JSON file named "debug/nodes/<node_id>/enginestats" with all the tickers/histograms, and looks like this: ``` { "stats": [ { "storeId": 1, "tickersAndHistograms": { "tickers": { "rocksdb.blobdb.blob.file.bytes.read": "0", ... "rocksdb.write.wal": "296" }, "histograms": { "rocksdb.blobdb.blob.file.read.micros": { "mean": 0, "p50": 0, "p95": 0, "p99": 0, "max": 0, "count": "0", "sum": "0" }, ... "rocksdb.write.raw.block.micros": { "mean": 1.8235294117647058, "p50": 0.6071428571428571, "p95": 10.749999999999993, "p99": 12, "max": 12, "count": "17", "sum": "31" } } } } ] } ``` Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 14 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajkr)
bors r+ |
Build failed |
CI failed on #36369 bors r+ |
36196: storage: debug page for RocksDB tickers/histograms r=tbg a=ajkr Set up a debug page at "/_status/enginestats/<node_id>" that contains values of all RocksDB ticker stats and histograms. It is also included in debug zip. A portion of the debug page (including both tickers and histograms) looks like this: ![tickershistograms](https://user-images.githubusercontent.com/4780362/55138005-aa88ab00-50ef-11e9-83d6-cc335cc10e6e.png) The debug zip file contains a JSON file named "debug/nodes/<node_id>/enginestats" with all the tickers/histograms, and looks like this: ``` { "stats": [ { "storeId": 1, "tickersAndHistograms": { "tickers": { "rocksdb.blobdb.blob.file.bytes.read": "0", ... "rocksdb.write.wal": "296" }, "histograms": { "rocksdb.blobdb.blob.file.read.micros": { "mean": 0, "p50": 0, "p95": 0, "p99": 0, "max": 0, "count": "0", "sum": "0" }, ... "rocksdb.write.raw.block.micros": { "mean": 1.8235294117647058, "p50": 0.6071428571428571, "p95": 10.749999999999993, "p99": 12, "max": 12, "count": "17", "sum": "31" } } } } ] } ``` Release Note: None Co-authored-by: Andrew Kryczka <[email protected]>
Build succeeded |
@tbg What do you think about backporting this to 19.1? Having the RocksDB stats in debug zip might be useful. |
Absolutely, please backport this. |
Set up a debug page at "/_status/enginestats/<node_id>" that contains
values of all RocksDB ticker stats and histograms. It is also included
in debug zip.
A portion of the debug page (including both tickers and histograms)
looks like this:
The debug zip file contains a JSON file named "debug/nodes/<node_id>/enginestats" with all the tickers/histograms, and looks like this:
Release Note: None