Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add admission.io.overload metric #87656

Merged
merged 1 commit into from
Sep 19, 2022

Conversation

KaiSun314
Copy link
Contributor

Resolves #87424.

Previously, only the unnormalized values of the LSM L0 sub-level and file counts is exposed externally, not the store's IOThreshold.

This was inadequate because it is tedious to normalize and compare the LSM L0 sub-level and file counts (as they require dividing by different numbers).

To address this, this patch adds a metric admission.io.overload tracking the store's IOThreshold.

Release note (ops change): Added a metric admission.io.overload which tracks the store's IOThreshold.

Note: I am not certain on the placement of the admission.io.overload metric, but it seems reasonable.
Justification:

  1. Pretty much all updates to store metrics outside of ComputeMetrics in pkg/kv/kvstore/store.go seem to be increments (e.g. Inc(1) )
  2. ComputeMetrics (there is a comment in the code: ComputeMetrics immediately computes the current value of store metrics which cannot be computed incrementally) updates the store metrics via updateCapacityGauges , updateReplicationGauges , updateEngineMetrics , and updateEnvStats. Looking at all these functions, updateReplicationGauges (where we are currently updating the IOOverload metric) seems to make the most sense to update the IOOverload metric
  3. There seem to be two other admission metrics:
  • RaftPausedFollowerCount - updated in updateReplicationGauges
  • RaftPausedFollowerDroppedMsgs.Inc - updated in pkg/kv/kvserver/replica_raft.go via r.store.Metrics().RaftPausedFollowerDroppedMsgs.Inc(1)

@KaiSun314 KaiSun314 requested a review from kvoli September 8, 2022 22:28
@KaiSun314 KaiSun314 requested a review from a team as a code owner September 8, 2022 22:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli requested a review from tbg September 9, 2022 18:49
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @KaiSun314 and @tbg)


pkg/kv/kvserver/metrics.go line 1026 at r1 (raw file):

	}

	metaIOOverload = metric.Metadata{

We should also add it to the chart catalog - which should clear up some of the failing tests in CI.

Here:

https://github.com/cockroachdb/cockroach/blob/6309178fe9c85b57a27ff0c69d5adaeca3fafccb/pkg/ts/catalog/chart_catalog.go#L3505-L3500


pkg/kv/kvserver/metrics.go line 1028 at r1 (raw file):

	metaIOOverload = metric.Metadata{
		Name: "admission.io.overload",
		Help: `1-normalized float to pause replication to raft group followers if its value is at least 1.

nit: this feature isn't enabled by default - so to avoid any confusion we should qualify the statement.

See the cluster setting here:

var pauseReplicationIOThreshold = settings.RegisterFloatSetting(
settings.SystemOnly,
"admission.kv.pause_replication_io_threshold",
"pause replication to non-essential followers when their I/O admission control score exceeds the given threshold (zero to disable)",
0,

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @KaiSun314 and @tbg)


pkg/kv/kvserver/store.go line 3155 at r1 (raw file):

	s.mu.RUnlock()

	s.ioThreshold.Lock()

nit: let's add a

// TODO(kaisun314,kvoli): move this to a per-store admission control metrics struct when available. See pkg/util/admission/granter.go.

I don't think we should block the change on this but it would be good to unify them eventually.

See https://github.com/cockroachdb/cockroach/blob/6309178fe9c85b57a27ff0c69d5adaeca3fafccb/pkg/util/admission/granter.go#L710-L703 for more infor

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! A few comments.

  1. We should go over the failing essential CI and figure out what we are missing.
  2. We should resolve the merge conflicts in kv/kvserver/store.go.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @KaiSun314 and @tbg)

@KaiSun314 KaiSun314 force-pushed the store_io_overload_metric branch from b3b9ed4 to de06fd8 Compare September 12, 2022 15:27
@KaiSun314 KaiSun314 requested a review from a team September 12, 2022 15:27
@KaiSun314 KaiSun314 force-pushed the store_io_overload_metric branch 3 times, most recently from b6d1e0a to 4e194de Compare September 12, 2022 15:39
@tbg tbg removed their request for review September 12, 2022 17:58
@KaiSun314 KaiSun314 force-pushed the store_io_overload_metric branch 2 times, most recently from 3014bb0 to 25b9300 Compare September 12, 2022 19:52
Copy link
Contributor Author

@KaiSun314 KaiSun314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing this PR!

  1. The failing essential CIs now pass locally, will monitor to ensure they pass the required checks in Github
  2. Merge conflicts should be resolved

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/metrics.go line 1026 at r1 (raw file):

Previously, kvoli (Austen) wrote…

We should also add it to the chart catalog - which should clear up some of the failing tests in CI.

Here:

https://github.com/cockroachdb/cockroach/blob/6309178fe9c85b57a27ff0c69d5adaeca3fafccb/pkg/ts/catalog/chart_catalog.go#L3505-L3500

Done.

@KaiSun314 KaiSun314 force-pushed the store_io_overload_metric branch from 25b9300 to c845785 Compare September 12, 2022 21:34
Resolves cockroachdb#87424.

Previously, only the unnormalized values of the LSM L0 sub-level and
file counts is exposed externally, not the store's IOThreshold.

This was inadequate because it is tedious to normalize and compare the
LSM L0 sub-level and file counts (as they require dividing by different
numbers).

To address this, this patch adds a metric `admission.io.overload`
tracking the store's IOThreshold.

Release note (ops change): Added a metric `admission.io.overload` which
tracks the store's IOThreshold.
@KaiSun314 KaiSun314 force-pushed the store_io_overload_metric branch from c845785 to d5571a2 Compare September 12, 2022 21:35
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @KaiSun314)

@kvoli
Copy link
Collaborator

kvoli commented Sep 16, 2022

We should also backport this to 22.2.

let's chat on Monday about it. You can add the label on the rhs, backport-22.2

@KaiSun314
Copy link
Contributor Author

We should also backport this to 22.2.

let's chat on Monday about it. You can add the label on the rhs, backport-22.2

Sounds great! Added the backport-22.2.x label to this PR.

@KaiSun314
Copy link
Contributor Author

bors r=kvoli

@craig
Copy link
Contributor

craig bot commented Sep 19, 2022

Build succeeded:

@craig craig bot merged commit 58589fd into cockroachdb:master Sep 19, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d5571a2 to blathers/backport-release-22.2-87656: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This will come in handy in understanding AC behaviour (x-linking #82743). Just had a minor drive-by comment.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/kv/kvserver/metrics.go line 1033 at r2 (raw file):

(pause replication feature is disabled if this setting is 0, feature is disabled by default);
see pkg/kv/kvserver/replica_raft_overload.go for more details. Composed of LSM L0
sub-level and file counts.`,

I found this help text a bit difficult to follow. Also since this is text that's available in any running binary (SHOW ALL CLUSTER SETTINGS), I wouldn't expect all readers to have the source code checked out, which would make the reference to this file slightly less than useful. It's also worth mentioning that even though we use this IO threshold for replication pausing, the metric is really it's own thing, so doesn't need to refer to it's one current use (there could be others in the future for ex., or that use could be removed but this metric kept around). I tried a following rephrasing:

1-normalized float indicating whether IO admission control considers the store as overloaded with respect to compaction out of L0.

In a code comment near pauseReplicationIOThreshold (admission.kv.pause_replication_io_threshold) we can mention that the metric it looks at underneath just so happens to be exported through admission.io.overload. If you end up trying this phrasing instead, do update the backport PR as well.


pkg/kv/kvserver/metrics.go line 1035 at r2 (raw file):

sub-level and file counts.`,
		Measurement: "Threshold",
		Unit:        metric.Unit_COUNT,

Should this be metric.Unit_PERCENT instead? That usually covers floats within [0.0, 1.0].

craig bot pushed a commit that referenced this pull request Oct 27, 2022
90780: rttanalysis: add benchmark for hasura's introspection query r=rafiss a=rafiss

informs #88885

This query exposes an issue with how we construct filters for certain types of joins. It can be used to verify future enhancements.

Release note: None

90788: kvserver: improve help text for `admission.io.overload` r=irfansharif a=irfansharif

Since this is text that's available in any running binary (`SHOW ALL CLUSTER SETTINGS`), we shouldn't expect readers to have the source code checked out, which makes our referencing of specific file names less than useful. See #87656 (review).

Release note: None
Release justification: No behavioural change, only changing user-visible help text.

90792: ui: address an old TODO r=yuzefovich a=yuzefovich

This commit addresses an old TODO to simplify the check for when we warn about full scans.

Epic: None

Release note: None

90802: logictest: check for nil SystemConfig before purging zone config cache r=yuzefovich a=msirek

Fixes #88398

This fixes a panic which can occur duing cluster startup in a logic test or during retry of a test case in a logic test when an attempt is made to purge the zone config cache and no SystemConfig is available.

Release note: None

90804: release: S3 redirects should use abs path r=celiala a=rail

Previously, S3 "latest" redirects used relative path, while the AWS API expects the redirects to start with "/", "http://", or "https://".

This patch adds a slash to the key name for all redirect calls. The GCS implementation already strips the leading slash and should handle it properly.

Release note: None
Epic: None

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: add IOThreshold to metrics
4 participants