Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server, storage: Throw warning on slow log write, fatal on engine disk stall #55186

Merged
merged 1 commit into from
Oct 10, 2020

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Oct 2, 2020

A previous change disabled the fataling of the Cockroach process
when a disk stall was detected. This change reverts that, as Pebble's
disk stall detection is closer to actual disk operations and should be
less likely to find spurious disk slowdowns.

To only conservatively respond to disk stalls, this change also bumps
up the stall fatal threshold to 60s, up from 30s.

It also makes the same change to the log flusher; the threshold has
gone up to 60s and a new warning threshold of 10s has been added
to match the semantics from Pebble's disk stall detection.

Finally, the node_engine_health goroutine is disabled on Pebble,
as it's redundant there.

Release note (general change): Detect stalled disk operations better
and crash the process if a disk operation is taking longer than
a minute.

@itsbilal itsbilal self-assigned this Oct 2, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal itsbilal changed the title server, storage: Throw warning on slow log write, fatal on engine dis… server, storage: Throw warning on slow log write, fatal on engine disk stall Oct 2, 2020
@itsbilal
Copy link
Member Author

itsbilal commented Oct 2, 2020

I thought about using a cluster setting for the thresholds instead of env variables as before, but changing that would involve a disk write which seems like the wrong place to put it? Thoughts welcome. It's also much easier to read a global variable set by an env variable in multiple places, but that one isn't too hard to overcome.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cluster setting makes more sense. What is the worry about the disk write? Certainly, you're not going to be able to change the cluster setting if the cluster is horked, but my expectation is that you set the cluster setting when the cluster is fine.

This otherwise looks good.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @petermattis)

@itsbilal itsbilal force-pushed the disk-stall-fatal-true branch 2 times, most recently from 0f11e2d to 69fb2ca Compare October 6, 2020 15:42
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've moved the knob to a cluster setting. PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens)

@itsbilal itsbilal force-pushed the disk-stall-fatal-true branch 3 times, most recently from d97847e to 8915e14 Compare October 6, 2020 19:32
@itsbilal itsbilal requested a review from petermattis October 8, 2020 22:01
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @petermattis)


pkg/storage/in_mem.go, line 34 at r2 (raw file):

		fallthrough
	case enginepb.EngineTypePebble:
		return newPebbleInMem(ctx, attrs, cacheSize, nil)

Add a /* settings */ inline comment please.


pkg/storage/mvcc_test.go, line 90 at r2 (raw file):

// createTestPebbleEngine returns a new in-memory Pebble storage engine.
func createTestPebbleEngine() Engine {
	return newPebbleInMem(context.Background(), roachpb.Attributes{}, 1<<20, nil)

Ditto about an inline comment for the nil parameter.


pkg/storage/pebble.go, line 549 at r2 (raw file):

	// rocksdb. This is to make sure the file paths match up, and that we're
	// able to write to both and ingest from both memory filesystems.
	pebbleInMem := newPebbleInMem(ctx, attrs, cacheSize, nil)

Ditto about an inline comment for the nil parameter.


pkg/storage/pebble.go, line 587 at r2 (raw file):

	if p.settings != nil {
		maxSyncDuration = MaxSyncDuration.Get(&p.settings.SV)
		fatalOnExceeded = MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV)

Don't you have to get these values inside the event listener? Otherwise we won't respond to changes in these values after the server has started.

…k stall

A previous change disabled the fataling of the Cockroach process
when a disk stall was detected. This change reverts that, as Pebble's
disk stall detection is closer to actual disk operations and should be
less likely to find spurious disk slowdowns.

To only conservatively respond to disk stalls, this change also bumps
up the stall fatal threshold to 60s, up from 30s.

It also makes the same change to the log flusher; the threshold has
gone up to 60s and a new warning threshold of 10s has been added
to match the semantics from Pebble's disk stall detection.

The knobs for controlling MaxSyncDuration and
MaxSyncDurationFatalOnExceeded have also been moved from env variables
to cluster settings. For instance, the env variable
COCKROACH_ENGINE_MAX_SYNC_DURATION is now replaced with the cluster setting
`storage.max_sync_duration`.

Finally, the node_engine_health goroutine is disabled on Pebble,
as it's redundant there.

Release note (general change): Detect stalled disk operations better
and crash the process if a disk operation is taking longer than
a minute. Add cluster settings to allow for tuning of this behaviour.
@itsbilal itsbilal force-pushed the disk-stall-fatal-true branch from 8915e14 to 1c596ad Compare October 9, 2020 13:30
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @petermattis)


pkg/storage/in_mem.go, line 34 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Add a /* settings */ inline comment please.

Done.


pkg/storage/mvcc_test.go, line 90 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto about an inline comment for the nil parameter.

Done.


pkg/storage/pebble.go, line 549 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto about an inline comment for the nil parameter.

Done.


pkg/storage/pebble.go, line 587 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Don't you have to get these values inside the event listener? Otherwise we won't respond to changes in these values after the server has started.

Oh right, my bad. Fixed.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens)

@itsbilal
Copy link
Member Author

itsbilal commented Oct 9, 2020

TFTR!

bors r=petermattis

@craig
Copy link
Contributor

craig bot commented Oct 9, 2020

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 9, 2020

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 10, 2020

Build succeeded:

@craig craig bot merged commit d816ce9 into cockroachdb:master Oct 10, 2020
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Oct 28, 2020
The disk-stalled roachtest relies on the ability to control disk stall
detection / fatal intervals, as charybdefs only injects 50ms of
delay per syscall. This change adds an env variable, similar to the
one removed in cockroachdb#55186 to set max sync duration, except now it governs
the default of the cluster setting. The roachtest now modifies that
env variable to let disk stall detection trip on short syscall
delays.

Fixes cockroachdb#54332.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Oct 28, 2020
The disk-stalled roachtest relies on the ability to control disk stall
detection / fatal intervals, as charybdefs only injects 50ms of
delay per syscall. This change adds an env variable, similar to the
one removed in cockroachdb#55186 to set max sync duration, except now it governs
the default of the cluster setting. The roachtest now modifies that
env variable to let disk stall detection trip on short syscall
delays.

Fixes cockroachdb#54332.

Release note: None.
craig bot pushed a commit that referenced this pull request Oct 28, 2020
56064: storage: Bring back MaxSyncDuration env var, fix disk-stalled roachtest r=itsbilal a=itsbilal

The disk-stalled roachtest relies on the ability to control disk stall
detection / fatal intervals, as charybdefs only injects 50ms of
delay per syscall. This change adds an env variable, similar to the
one removed in #55186 to set max sync duration, except now it governs
the default of the cluster setting. The roachtest now modifies that
env variable to let disk stall detection trip on short syscall
delays.

Fixes #54332.

Release note: None.

56092: parser: pin pkg/security as a bazel dep for pkg/parser r=irfansharif a=irfansharif

We picked up a dependency on pkg/security within the parser package
after #55398.

We have to pin go dependencies by hand if they're only present in
auto-generated code. It's because it's not otherwise visible to
bazel/gazelle when generating the BUILD files (during the analysis
phase).

Release note: None

56095: partialidx: add benchmarks for two-variable comparisons r=RaduBerinde a=mgartner

Two-variable comparison implication performs similarly to other types of
implications.

    BenchmarkImplicator/single-exact-match-16                         76.5 ns/op
    BenchmarkImplicator/single-inexact-match-16                      342 ns/op
    BenchmarkImplicator/range-inexact-match-16                       782 ns/op
    BenchmarkImplicator/two-var-comparison-16                        302 ns/op
    BenchmarkImplicator/single-exact-match-extra-filters-16          310 ns/op
    BenchmarkImplicator/single-inexact-match-extra-filters-16        609 ns/op
    BenchmarkImplicator/multi-column-and-exact-match-16               82.4 ns/op
    BenchmarkImplicator/multi-column-and-inexact-match-16            722 ns/op
    BenchmarkImplicator/multi-column-and-two-var-comparisons-16      611 ns/op
    BenchmarkImplicator/multi-column-or-exact-match-16                76.1 ns/op
    BenchmarkImplicator/multi-column-or-exact-match-reverse-16       595 ns/op
    BenchmarkImplicator/multi-column-or-inexact-match-16            1081 ns/op
    BenchmarkImplicator/in-implies-or-16                             976 ns/op
    BenchmarkImplicator/and-filters-do-not-imply-pred-16            3710 ns/op
    BenchmarkImplicator/or-filters-do-not-imply-pred-16              917 ns/op
    BenchmarkImplicator/many-columns-exact-match10-16                296 ns/op
    BenchmarkImplicator/many-columns-inexact-match10-16             6853 ns/op
    BenchmarkImplicator/many-columns-exact-match100-16             19817 ns/op
    BenchmarkImplicator/many-columns-inexact-match100-16          447894 ns/op

Release note: None

Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants