server, storage: Throw warning on slow log write, fatal on engine disk stall #55186

itsbilal · 2020-10-02T19:30:12Z

A previous change disabled the fataling of the Cockroach process
when a disk stall was detected. This change reverts that, as Pebble's
disk stall detection is closer to actual disk operations and should be
less likely to find spurious disk slowdowns.

To only conservatively respond to disk stalls, this change also bumps
up the stall fatal threshold to 60s, up from 30s.

It also makes the same change to the log flusher; the threshold has
gone up to 60s and a new warning threshold of 10s has been added
to match the semantics from Pebble's disk stall detection.

Finally, the node_engine_health goroutine is disabled on Pebble,
as it's redundant there.

Release note (general change): Detect stalled disk operations better
and crash the process if a disk operation is taking longer than
a minute.

cockroach-teamcity · 2020-10-02T19:30:19Z

This change is

itsbilal · 2020-10-02T19:33:39Z

I thought about using a cluster setting for the thresholds instead of env variables as before, but changing that would involve a disk write which seems like the wrong place to put it? Thoughts welcome. It's also much easier to read a global variable set by an env variable in multiple places, but that one isn't too hard to overcome.

petermattis

I think a cluster setting makes more sense. What is the worry about the disk write? Certainly, you're not going to be able to change the cluster setting if the cluster is horked, but my expectation is that you set the cluster setting when the cluster is fine.

This otherwise looks good.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @petermattis)

itsbilal

Makes sense. I've moved the knob to a cluster setting. PTAL!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens)

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @petermattis)

pkg/storage/in_mem.go, line 34 at r2 (raw file):

		fallthrough
	case enginepb.EngineTypePebble:
		return newPebbleInMem(ctx, attrs, cacheSize, nil)

Add a /* settings */ inline comment please.

pkg/storage/mvcc_test.go, line 90 at r2 (raw file):

// createTestPebbleEngine returns a new in-memory Pebble storage engine.
func createTestPebbleEngine() Engine {
	return newPebbleInMem(context.Background(), roachpb.Attributes{}, 1<<20, nil)

Ditto about an inline comment for the nil parameter.

pkg/storage/pebble.go, line 549 at r2 (raw file):

	// rocksdb. This is to make sure the file paths match up, and that we're
	// able to write to both and ingest from both memory filesystems.
	pebbleInMem := newPebbleInMem(ctx, attrs, cacheSize, nil)

Ditto about an inline comment for the nil parameter.

pkg/storage/pebble.go, line 587 at r2 (raw file):

	if p.settings != nil {
		maxSyncDuration = MaxSyncDuration.Get(&p.settings.SV)
		fatalOnExceeded = MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV)

Don't you have to get these values inside the event listener? Otherwise we won't respond to changes in these values after the server has started.

…k stall A previous change disabled the fataling of the Cockroach process when a disk stall was detected. This change reverts that, as Pebble's disk stall detection is closer to actual disk operations and should be less likely to find spurious disk slowdowns. To only conservatively respond to disk stalls, this change also bumps up the stall fatal threshold to 60s, up from 30s. It also makes the same change to the log flusher; the threshold has gone up to 60s and a new warning threshold of 10s has been added to match the semantics from Pebble's disk stall detection. The knobs for controlling MaxSyncDuration and MaxSyncDurationFatalOnExceeded have also been moved from env variables to cluster settings. For instance, the env variable COCKROACH_ENGINE_MAX_SYNC_DURATION is now replaced with the cluster setting `storage.max_sync_duration`. Finally, the node_engine_health goroutine is disabled on Pebble, as it's redundant there. Release note (general change): Detect stalled disk operations better and crash the process if a disk operation is taking longer than a minute. Add cluster settings to allow for tuning of this behaviour.

itsbilal

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @petermattis)

pkg/storage/in_mem.go, line 34 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Add a /* settings */ inline comment please.

Done.

pkg/storage/mvcc_test.go, line 90 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto about an inline comment for the nil parameter.

Done.

pkg/storage/pebble.go, line 549 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto about an inline comment for the nil parameter.

Done.

pkg/storage/pebble.go, line 587 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Don't you have to get these values inside the event listener? Otherwise we won't respond to changes in these values after the server has started.

Oh right, my bad. Fixed.

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jbowens)

itsbilal · 2020-10-09T15:53:18Z

TFTR!

bors r=petermattis

craig · 2020-10-09T18:00:44Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2020-10-09T21:21:49Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2020-10-10T00:47:45Z

Build succeeded:

GitHub CI (Cockroach)

The disk-stalled roachtest relies on the ability to control disk stall detection / fatal intervals, as charybdefs only injects 50ms of delay per syscall. This change adds an env variable, similar to the one removed in cockroachdb#55186 to set max sync duration, except now it governs the default of the cluster setting. The roachtest now modifies that env variable to let disk stall detection trip on short syscall delays. Fixes cockroachdb#54332. Release note: None.

56064: storage: Bring back MaxSyncDuration env var, fix disk-stalled roachtest r=itsbilal a=itsbilal The disk-stalled roachtest relies on the ability to control disk stall detection / fatal intervals, as charybdefs only injects 50ms of delay per syscall. This change adds an env variable, similar to the one removed in #55186 to set max sync duration, except now it governs the default of the cluster setting. The roachtest now modifies that env variable to let disk stall detection trip on short syscall delays. Fixes #54332. Release note: None. 56092: parser: pin pkg/security as a bazel dep for pkg/parser r=irfansharif a=irfansharif We picked up a dependency on pkg/security within the parser package after #55398. We have to pin go dependencies by hand if they're only present in auto-generated code. It's because it's not otherwise visible to bazel/gazelle when generating the BUILD files (during the analysis phase). Release note: None 56095: partialidx: add benchmarks for two-variable comparisons r=RaduBerinde a=mgartner Two-variable comparison implication performs similarly to other types of implications. BenchmarkImplicator/single-exact-match-16 76.5 ns/op BenchmarkImplicator/single-inexact-match-16 342 ns/op BenchmarkImplicator/range-inexact-match-16 782 ns/op BenchmarkImplicator/two-var-comparison-16 302 ns/op BenchmarkImplicator/single-exact-match-extra-filters-16 310 ns/op BenchmarkImplicator/single-inexact-match-extra-filters-16 609 ns/op BenchmarkImplicator/multi-column-and-exact-match-16 82.4 ns/op BenchmarkImplicator/multi-column-and-inexact-match-16 722 ns/op BenchmarkImplicator/multi-column-and-two-var-comparisons-16 611 ns/op BenchmarkImplicator/multi-column-or-exact-match-16 76.1 ns/op BenchmarkImplicator/multi-column-or-exact-match-reverse-16 595 ns/op BenchmarkImplicator/multi-column-or-inexact-match-16 1081 ns/op BenchmarkImplicator/in-implies-or-16 976 ns/op BenchmarkImplicator/and-filters-do-not-imply-pred-16 3710 ns/op BenchmarkImplicator/or-filters-do-not-imply-pred-16 917 ns/op BenchmarkImplicator/many-columns-exact-match10-16 296 ns/op BenchmarkImplicator/many-columns-inexact-match10-16 6853 ns/op BenchmarkImplicator/many-columns-exact-match100-16 19817 ns/op BenchmarkImplicator/many-columns-inexact-match100-16 447894 ns/op Release note: None Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

itsbilal requested review from jbowens and petermattis October 2, 2020 19:30

itsbilal self-assigned this Oct 2, 2020

itsbilal changed the title ~~server, storage: Throw warning on slow log write, fatal on engine dis…~~ server, storage: Throw warning on slow log write, fatal on engine disk stall Oct 2, 2020

petermattis reviewed Oct 2, 2020

View reviewed changes

itsbilal force-pushed the disk-stall-fatal-true branch 2 times, most recently from 0f11e2d to 69fb2ca Compare October 6, 2020 15:42

itsbilal commented Oct 6, 2020

View reviewed changes

itsbilal force-pushed the disk-stall-fatal-true branch 3 times, most recently from d97847e to 8915e14 Compare October 6, 2020 19:32

itsbilal requested a review from petermattis October 8, 2020 22:01

petermattis reviewed Oct 8, 2020

View reviewed changes

itsbilal force-pushed the disk-stall-fatal-true branch from 8915e14 to 1c596ad Compare October 9, 2020 13:30

itsbilal commented Oct 9, 2020

View reviewed changes

petermattis approved these changes Oct 9, 2020

View reviewed changes

craig bot merged commit d816ce9 into cockroachdb:master Oct 10, 2020

itsbilal mentioned this pull request Oct 28, 2020

storage: Bring back MaxSyncDuration env var, fix disk-stalled roachtest #56064

Merged

jseldess mentioned this pull request Dec 9, 2020

server, storage: Throw warning on slow log write, fatal on engine disk stall cockroachdb/docs#9078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server, storage: Throw warning on slow log write, fatal on engine disk stall #55186

server, storage: Throw warning on slow log write, fatal on engine disk stall #55186

itsbilal commented Oct 2, 2020 •

edited

Loading

cockroach-teamcity commented Oct 2, 2020

itsbilal commented Oct 2, 2020

petermattis left a comment

itsbilal left a comment

petermattis left a comment

itsbilal left a comment

petermattis left a comment

itsbilal commented Oct 9, 2020

craig bot commented Oct 9, 2020

craig bot commented Oct 9, 2020

craig bot commented Oct 10, 2020

server, storage: Throw warning on slow log write, fatal on engine disk stall #55186

server, storage: Throw warning on slow log write, fatal on engine disk stall #55186

Conversation

itsbilal commented Oct 2, 2020 • edited Loading

cockroach-teamcity commented Oct 2, 2020

itsbilal commented Oct 2, 2020

petermattis left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

itsbilal commented Oct 9, 2020

craig bot commented Oct 9, 2020

craig bot commented Oct 9, 2020

craig bot commented Oct 10, 2020

itsbilal commented Oct 2, 2020 •

edited

Loading