-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: add admission.io.overload
metric
#87656
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @KaiSun314 and @tbg)
pkg/kv/kvserver/metrics.go
line 1026 at r1 (raw file):
} metaIOOverload = metric.Metadata{
We should also add it to the chart catalog - which should clear up some of the failing tests in CI.
Here:
pkg/kv/kvserver/metrics.go
line 1028 at r1 (raw file):
metaIOOverload = metric.Metadata{ Name: "admission.io.overload", Help: `1-normalized float to pause replication to raft group followers if its value is at least 1.
nit: this feature isn't enabled by default - so to avoid any confusion we should qualify the statement.
See the cluster setting here:
cockroach/pkg/kv/kvserver/replica_raft_overload.go
Lines 27 to 31 in 6309178
var pauseReplicationIOThreshold = settings.RegisterFloatSetting( | |
settings.SystemOnly, | |
"admission.kv.pause_replication_io_threshold", | |
"pause replication to non-essential followers when their I/O admission control score exceeds the given threshold (zero to disable)", | |
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @KaiSun314 and @tbg)
pkg/kv/kvserver/store.go
line 3155 at r1 (raw file):
s.mu.RUnlock() s.ioThreshold.Lock()
nit: let's add a
// TODO(kaisun314,kvoli): move this to a per-store admission control metrics struct when available. See pkg/util/admission/granter.go.
I don't think we should block the change on this but it would be good to unify them eventually.
See https://github.com/cockroachdb/cockroach/blob/6309178fe9c85b57a27ff0c69d5adaeca3fafccb/pkg/util/admission/granter.go#L710-L703 for more infor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! A few comments.
- We should go over the failing essential CI and figure out what we are missing.
- We should resolve the merge conflicts in
kv/kvserver/store.go
.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @KaiSun314 and @tbg)
b3b9ed4
to
de06fd8
Compare
b6d1e0a
to
4e194de
Compare
3014bb0
to
25b9300
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing this PR!
- The failing essential CIs now pass locally, will monitor to ensure they pass the required checks in Github
- Merge conflicts should be resolved
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/metrics.go
line 1026 at r1 (raw file):
Previously, kvoli (Austen) wrote…
We should also add it to the chart catalog - which should clear up some of the failing tests in CI.
Here:
Done.
25b9300
to
c845785
Compare
Resolves cockroachdb#87424. Previously, only the unnormalized values of the LSM L0 sub-level and file counts is exposed externally, not the store's IOThreshold. This was inadequate because it is tedious to normalize and compare the LSM L0 sub-level and file counts (as they require dividing by different numbers). To address this, this patch adds a metric `admission.io.overload` tracking the store's IOThreshold. Release note (ops change): Added a metric `admission.io.overload` which tracks the store's IOThreshold.
c845785
to
d5571a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @KaiSun314)
We should also backport this to 22.2. let's chat on Monday about it. You can add the label on the rhs, backport-22.2 |
Sounds great! Added the backport-22.2.x label to this PR. |
bors r=kvoli |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d5571a2 to blathers/backport-release-22.2-87656: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! This will come in handy in understanding AC behaviour (x-linking #82743). Just had a minor drive-by comment.
Reviewed all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained
pkg/kv/kvserver/metrics.go
line 1033 at r2 (raw file):
(pause replication feature is disabled if this setting is 0, feature is disabled by default); see pkg/kv/kvserver/replica_raft_overload.go for more details. Composed of LSM L0 sub-level and file counts.`,
I found this help text a bit difficult to follow. Also since this is text that's available in any running binary (SHOW ALL CLUSTER SETTINGS
), I wouldn't expect all readers to have the source code checked out, which would make the reference to this file slightly less than useful. It's also worth mentioning that even though we use this IO threshold for replication pausing, the metric is really it's own thing, so doesn't need to refer to it's one current use (there could be others in the future for ex., or that use could be removed but this metric kept around). I tried a following rephrasing:
1-normalized float indicating whether IO admission control considers the store as overloaded with respect to compaction out of L0.
In a code comment near pauseReplicationIOThreshold (admission.kv.pause_replication_io_threshold) we can mention that the metric it looks at underneath just so happens to be exported through admission.io.overload. If you end up trying this phrasing instead, do update the backport PR as well.
pkg/kv/kvserver/metrics.go
line 1035 at r2 (raw file):
sub-level and file counts.`, Measurement: "Threshold", Unit: metric.Unit_COUNT,
Should this be metric.Unit_PERCENT
instead? That usually covers floats within [0.0, 1.0].
90780: rttanalysis: add benchmark for hasura's introspection query r=rafiss a=rafiss informs #88885 This query exposes an issue with how we construct filters for certain types of joins. It can be used to verify future enhancements. Release note: None 90788: kvserver: improve help text for `admission.io.overload` r=irfansharif a=irfansharif Since this is text that's available in any running binary (`SHOW ALL CLUSTER SETTINGS`), we shouldn't expect readers to have the source code checked out, which makes our referencing of specific file names less than useful. See #87656 (review). Release note: None Release justification: No behavioural change, only changing user-visible help text. 90792: ui: address an old TODO r=yuzefovich a=yuzefovich This commit addresses an old TODO to simplify the check for when we warn about full scans. Epic: None Release note: None 90802: logictest: check for nil SystemConfig before purging zone config cache r=yuzefovich a=msirek Fixes #88398 This fixes a panic which can occur duing cluster startup in a logic test or during retry of a test case in a logic test when an attempt is made to purge the zone config cache and no SystemConfig is available. Release note: None 90804: release: S3 redirects should use abs path r=celiala a=rail Previously, S3 "latest" redirects used relative path, while the AWS API expects the redirects to start with "/", "http://", or "https://". This patch adds a slash to the key name for all redirect calls. The GCS implementation already strips the leading slash and should handle it properly. Release note: None Epic: None Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Mark Sirek <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>
Resolves #87424.
Previously, only the unnormalized values of the LSM L0 sub-level and file counts is exposed externally, not the store's IOThreshold.
This was inadequate because it is tedious to normalize and compare the LSM L0 sub-level and file counts (as they require dividing by different numbers).
To address this, this patch adds a metric
admission.io.overload
tracking the store's IOThreshold.Release note (ops change): Added a metric
admission.io.overload
which tracks the store's IOThreshold.Note: I am not certain on the placement of the
admission.io.overload
metric, but it seems reasonable.Justification: