Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: log a trace for slow splits #120943

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

miraradeva
Copy link
Contributor

This change enables logging a trace of a split request if it exceeds a certain duration. The duration is controlled by a new cluster setting: kv.split.slow_split_tracing_threshold, which defaults to 2s.

Fixes: #81152

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miraradeva
Copy link
Contributor Author

Looking for feedback on:

  1. What should the default value for the tracing threshold duration be? I have it at 2s now. Should it be a property of the range that's being split, similarly to the replicate queue's threshold.
  2. Do we want to refactor processOneChangeWithTracing as a function of the base queue? E.g. as a wrapper around process. That way any specific queue that wants tracing for slow operations can just set its own tracing threshold. I didn't do this right away because the different queues implement process differently: some have retries built into process, and others don't. I'm assuming we don't want a trace across multiple retries, but happy to discuss.

@miraradeva miraradeva marked this pull request as ready for review March 25, 2024 12:57
@miraradeva miraradeva requested a review from a team as a code owner March 25, 2024 12:57
@miraradeva miraradeva requested a review from kvoli March 25, 2024 12:57
@kvoli
Copy link
Collaborator

kvoli commented Mar 25, 2024

What should the default value for the tracing threshold duration be? I have it at 2s now. Should it be a property of the range that's being split, similarly to the replicate queue's threshold.

2s sounds reasonable. Re-using something similar to the replicate queue would be a bit rough to define an expected rate based off the range size given different hardware/conditions. In comparison, the rate for the replicate queue is mostly bounded by the configured snapshot rate.

Do we want to refactor processOneChangeWithTracing as a function of the base queue? E.g. as a wrapper around process. That way any specific queue that wants tracing for slow operations can just set its own tracing threshold. I didn't do this right away because the different queues implement process differently: some have retries built into process, and others don't. I'm assuming we don't want a trace across multiple retries, but happy to discuss.

I didn't take a look at other queues but if this is doable without too much added effort, then that sounds good. We wouldn't want to trace across multiple retries, since a trace on the last retry would be sufficient.

@miraradeva
Copy link
Contributor Author

miraradeva commented Mar 25, 2024

We wouldn't want to trace across multiple retries, since a trace on the last retry would be sufficient.

I looked at the various queues again. I think if we want to add optional tracing for slow queue processing in the base queue, we'd wrap the timing and tracing logic around this call to process(...). I think that makes sense since the timeout logic is there as well. Almost all queues implement the processing logic as a single attempt, except the replicate queue, which retries within the process(...) function. So I see two options:

  1. Do the above refactor, as long as we're ok with tracing the process in replicate queue with all its retries.
  2. Do this kind of change one queue at a time as needed. I can try to refactor a helper function to do the actual tracing to avoid code duplication.

Edited to add: the refactor in option 2 doesn't seem particularly easy because the internal processing functions (e.g. processAttempt for the split queue, and processOneChange for the replicate queue) don't have the same signatures. So, probably either do option 1 or leave it as is, unless you have another suggestion.

@kvoli
Copy link
Collaborator

kvoli commented Mar 26, 2024

I'd prefer to leave as is for now, unless we won't be backporting this to the 24.1 release branch.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miraradeva)


pkg/kv/kvserver/client_split_test.go line 1070 at r1 (raw file):

// TestStoreRangeSplitWithTracing tests that the split queue logs traces for
// slow splits.
func TestStoreRangeSplitWithTracing(t *testing.T) {

Nice test!


pkg/kv/kvserver/client_split_test.go line 1074 at r1 (raw file):

	defer log.Scope(t).Close(t)
	l := log.ScopeWithoutShowLogs(t)
	_ = log.SetVModule("split_queue=2")

nit: change this to split_queue=1 to match the expensive log check exactly inprocessAttemptWithTracing.

This change enables logging a trace of a split request if it exceeds a
certain duration. The duration is controlled by a new cluster setting:
`kv.split.slow_split_tracing_threshold`, which defaults to 2s.

Fixes: cockroachdb#81152

Release note: None
@miraradeva miraradeva force-pushed the mira-81152-split-tracing branch from b69bfdf to 6f9b7ec Compare March 26, 2024 17:52
@miraradeva
Copy link
Contributor Author

bors r=kvoli

@craig craig bot merged commit 2cf36f7 into cockroachdb:master Mar 26, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: add cluster settings to log traces for slow lease transfers and splits
3 participants