-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eventpb: add storage event types #86277
Conversation
cc: @jbowens - sneak peak at the new event type. I've cherry-picked a bunch of metrics from the Pebble metrics, erring on the side of "too many". We can probably prune this list a bit, but wanted to get a sense check on what we'd be capturing (second commit). |
94689f4
to
5b9ed06
Compare
5b9ed06
to
34a3bb9
Compare
34a3bb9
to
6e5e401
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nicktrav)
pkg/kv/kvserver/store.go
line 130 at r1 (raw file):
var logStoreTelemetryTicks = envutil.EnvOrDefaultInt( "COCKROACH_LOG_STORE_TELEMETRY_TICKS_INTERVAL", 6*60, // Once per hour. (tick interval = 10s) * 6 * 60 = 3600s = 1h.
what's the justification for 1h intervals? Can we miss transient interesting events because of this coarse granularity? What are other layers using as their telemetry logging interval?
pkg/util/log/eventpb/storage_events.proto
line 33 at r1 (raw file):
repeated LevelStats levels = 4 [(gogoproto.nullable) = false, (gogoproto.jsontag) = ""]; // Cache metrics.
Could you add in parentheses whether something is cumulative? I'm assuming everything here is cumulative since node restart. Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)? If yes, would it be simpler to export the deltas? Otherwise I can imagine a more complex calculation where we also export the restart time and then do a delta calculation from 0 if the restart time changed. Metrics systems can sometimes do that under the covers for you (Monarch did), but it is a complicated sql query to write in the absence of built-in support.
6e5e401
to
341e8c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking with some folks internally, I'm going to open this back up for review.
Some points that I've discussed with folks:
PCI compliance / sensitive data. There is no PII emitted in these per-store stats. The data in each even is numerical, and pertains the state of a single LSM (i.e. store).
Data volume. We've landed on a starting rate of one event per store per hour. There were some concerns that logging too much could overload the fluentbit logging daemon. I expect the size of the events to be on the order of low single digit KB per event (uncompressed). The set of stats present in this PR can also be scoped down, if necessary, to bring this number down.
More context in this thread (internal).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pransudash, and @sumeerbhola)
pkg/kv/kvserver/store.go
line 130 at r1 (raw file):
what's the justification for 1h intervals?
My intention was to strike a balance between frequent (basically degenerates into timeseries metrics the lower the interval, and is more expensive, puts more stress on the logging infra) and useful (when emitting infrequently, derived metrics like rates of change become harder to compute, etc.). I believe the lower bound is more important from an operational / cost perspective. I suspect on the upper bound, logging as infrequently as once every 24 hours could still be useful. I'd be hesitant to go higher than that.
Can we miss transient interesting events because of this coarse granularity?
Certainly - however, the goal of this telemetry logging isn't to capture such events. These new telemetry events are intended to roll up to provide an aggregate view across all clusters (cloud, for now). We're more interested in the trends / broad themes. We have the ability to dig into the details if necessary, as we have the cluster, node and store information in the events.
What are other layers using as their telemetry logging interval?
One recent example of telemetry logging is #84761. That emits schema information once per week.
I have a thread going here (internal) that discusses what is reasonable. tldr: it was indicated that an event per store per hour seems like a reasonable starting point, and should not be an issue.
pkg/kv/kvserver/store.go
line 3357 at r2 (raw file):
// if reporting is enabled. These events are intended to be emitted at low // frequency. Trigger on tick 1 for the same reasons as above. if logcrash.DiagnosticsReportingEnabled.Get(&s.ClusterSettings().SV) &&
While the logcrash
package name doesn't seem all that relevant here, the comment on DiagnosticsReportingEnabled
makes me believe it is suitable for use as a gate on whether we should emit the event or not:
// "diagnostics.reporting.enabled" enables reporting of metrics related to a
// node's storage (number, size and health of ranges) back to CockroachDB.
// Collecting this data from production clusters helps us understand and improve
// how our storage systems behave in real-world use cases.
If there's a better cluster setting / env var to use, please let me know. Alternatively, if it's fine to unconditionally log this to the telemetry channel, that could also be an option (less inclined to do this for non-CC clusters, as that data is effectively useless and will eat up disk space).
pkg/util/log/eventpb/storage_events.proto
line 33 at r1 (raw file):
Could you add in parentheses whether something is cumulative?
Done. Clarified whether a metric is a counter or gauge.
I'm assuming everything here is cumulative since node restart.
Correct. Added that clarification at the top.
Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)?
cc: @pransudash - do you know the answer to this? Will it be an issue if we emit counters that contain discontinuities? I assume that's only an issue to consider when we get to performing the analytics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo the question about deltas.
Reviewed 1 of 11 files at r1, 2 of 4 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @nicktrav, @pransudash, and @sumeerbhola)
pkg/kv/kvserver/store.go
line 3358 at r2 (raw file):
// frequency. Trigger on tick 1 for the same reasons as above. if logcrash.DiagnosticsReportingEnabled.Get(&s.ClusterSettings().SV) && tick%logStoreTelemetryTicks == 1 {
To avoid spamming the telemetry if we're crash looping, I think we could report on tick%logStoreTelemetryTicks == logStoreTelemetryTicks-1
.
pkg/storage/engine.go
line 1062 at r2 (raw file):
TableZombieCount: m.Table.ZombieCount, TableZombieSize: m.Table.ZombieSize, }
maybe I'm greedy, but I'm not sure if there's anything I'd want to drop. If anything, maybe WAL.ObsoleteFiles
and WAL.ObsoletePhysicalSize
.
I think eventually we could replace CompactionNumInProgress
with a value that shows the 'effective compaction concurrency' over time, rather than just a single moment. I think including the CompactionNumInProgress
until we have that metric is fine.
Can we add the new Keys.RangeKeySetsCount
metric too?
75c21f3
to
29426b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens, @knz, @pransudash, and @sumeerbhola)
pkg/kv/kvserver/store.go
line 3358 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
To avoid spamming the telemetry if we're crash looping, I think we could report on
tick%logStoreTelemetryTicks == logStoreTelemetryTicks-1
.
Good call. Done.
pkg/storage/engine.go
line 1062 at r2 (raw file):
Can we add the new Keys.RangeKeySetsCount metric too?
Done!
not sure if there's anything I'd want to drop.
+1 - Given the limited concern around the amount of data being emitted and the relative infrequency (even at one event per hour per store), I'm inclined to keep this set of metrics as they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, and @sumeerbhola)
pkg/util/log/eventpb/storage_events.proto
line 33 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Could you add in parentheses whether something is cumulative?
Done. Clarified whether a metric is a counter or gauge.
I'm assuming everything here is cumulative since node restart.
Correct. Added that clarification at the top.
Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)?
cc: @pransudash - do you know the answer to this? Will it be an issue if we emit counters that contain discontinuities? I assume that's only an issue to consider when we get to performing the analytics?
If needed, I can instruct downstream pipelines to reconstruct deltas from the cumulative values and this something we do frequently for other datasets. As long as your counters are always increasing, that shouldn't be an issue. If it ever decreases or goes back to 0 does that mean the node restarted? In any case, I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, and @sumeerbhola)
pkg/util/log/eventpb/storage_events.proto
line 33 at r1 (raw file):
If it ever decreases or goes back to 0 does that mean the node restarted?
Correct.
I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.
Great. In that case I think we should be good here. We have either counters or gauges. No deltas.
29426b4
to
c74c215
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, @pransudash, and @sumeerbhola)
pkg/util/log/eventpb/storage_events.proto
line 33 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
If it ever decreases or goes back to 0 does that mean the node restarted?
Correct.
I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.
Great. In that case I think we should be good here. We have either counters or gauges. No deltas.
Ignore me since I'm being nitpicky, but I do want to stress that we can't really reliably calculate from resetting counters since we may observe monotonically non-decreasing values despite the counter having dropped to 0.
Add the `StoreStats` event type, a per-store event emitted to the `TELEMETRY` logging channel. This event type will be computed from the Pebble metrics for each store. Emit a `StoreStats` event periodically, by default, once per hour, per store. Touches cockroachdb#85589. Release note: None. Release justification: low risk, high benefit changes to existing functionality.
c74c215
to
1fc9149
Compare
Circling back on this one finally. I chatted with @sumeerbhola offline about the delta issue. We'll see how much we can work around this on the backend for now (i.e. in how we do the reporting). This will avoid complicating things too much for the metrics reporting pipeline, which is still being fleshed out (the JSON schema can be considered "internal" and is subject to change). TFTRs! bors r=jbowens |
👎 Rejected by code reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, @pransudash, and @sumeerbhola)
@nicktrav Sounds good! Let me know if there's any need for us to chat about reporting from these log events. I'm assuming the logs are still scheduled to send hourly? |
bors r=jbowens,sumeerbhola |
Build succeeded: |
Add the
StoreStats
event type, a per-store event emitted to theTELEMETRY
logging channel. This event type will be computed from thePebble metrics for each store.
Emit a
StoreStats
event periodically, by default, once per hour, perstore.
Touches #85589.
Release note: None.
Release justification: low risk, high benefit changes to existing
functionality.