-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c2c: deflake c2c/shutdown roachtests #102382
Conversation
fc91a0c
to
81a07f9
Compare
08dd02c
to
ad9c15d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, a couple of small things..
// Every C2C phase should last at least 30 seconds, so introduce a little | ||
// bit of random waiting before node shutdown to ensure the shutdown occurs | ||
// once we're settled into the target phase. | ||
randomSleep := time.Duration(rand.Intn(6)) | ||
// | ||
// Sleep for a minimum of 5 seconds to ensure the stream processors have | ||
// observed the cutover signal. | ||
randomSleep := time.Duration(5 + rand.Intn(6)) | ||
rrd.t.L().Printf("In target phase! Take a %d second power nap", randomSleep) | ||
time.Sleep(randomSleep * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know my comment isn't really about this pr but should waitForTargetPhase
do what it says, and just wait for the target phase without any sleep? and then the caller (node shutdown or something else) do the sleep instead?
when reading this code it's not so clear why node shutdown is mentioned in the comment.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, see the change i made.
removeTenantRateLimiters(rd.t, rd.setup.dst.sysSQL, rd.setup.dst.name) | ||
|
||
lv := makeLatencyVerifier("stream-ingestion", 0, 2*time.Minute, rd.t.L(), getStreamIngestionJobInfo, rd.t.Status, false) | ||
lv := makeLatencyVerifier("stream-ingestion", 0, 2*time.Minute, rd.t.L(), getStreamIngestionJobInfo, rd.t.Status, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to ignore errors here?
don't we want the test to fail if we can't read from crdb_internal.system_jobs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good q: any query can fail if a node shuts down around the time of query execution. To ensure the latencyVerifier
stays alive during a node shutdown event, we tolerate any of its failed queries. I made the code clearer here.
9f5357e
to
60c7fe4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would also help to explicitly scatter the ranges.
62d5a16
to
40e4a0e
Compare
40e4a0e
to
949e12e
Compare
The user can now run `./workload init kv --scatter ....` which scatters the kv table across the cluster after the initial data load. This flag is best used with `--splits`, `--max-block-bytes`, and `--insert-count`. Epic: none Release note: none
Release note: none
949e12e
to
bedd7fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All ignorable nits, don't mind me.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel and @msbutler)
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 243 at r5 (raw file):
func (kv replicateKV) sourceInitCmd(tenantName string, nodes option.NodeListOption) string { cmd := roachtestutil.NewCommand(`./workload init kv`)
Nit: you can chain all of these calls which arguably makes it nicer to read:
return roachtestutil.NewCommand("./workload init kv").
Flag("foo", 1).
Flag("bar", 2).
String()
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 245 at r5 (raw file):
cmd := roachtestutil.NewCommand(`./workload init kv`) cmd.MaybeFlag(kv.initRows > 0, "insert-count", kv.initRows) cmd.MaybeFlag(kv.initRows > 0, "max-block-bytes", kv.maxBlockBytes)
Did you want to check for maxBlockBytes
here?
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 248 at r5 (raw file):
cmd.Flag("splits", 100) cmd.Option("scatter") cmd.Arg(fmt.Sprintf("{pgurl%s:%s}", nodes, tenantName))
Nit: Arg
is for single argument, and it already understands Printf
-style formatting, so you can just do Arg("{pgurl%s:%s}", nodes, tenantName)
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 1006 at r5 (raw file):
// once we're fully settled into the target phase (e.g. the stream ingestion // processors have observed the cutover signal). randomSleep := time.Duration(5 + rrd.rng.Intn(6))
Nit: making randomSleep
a time.Duration
(i.e., number of nanoseconds), even though the actual intended meaning is for it to be a number of seconds is slightly confusing. Suggestion:
dur := (5 + rrd.rng.Intn(6)) * time.Second
t.L().Printf("taking a %s power nap", dur)
time.Sleep(dur)
This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes cockroachdb#101898 Fixes cockroachdb#102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None
bedd7fc
to
093e2dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel and @renatolabs)
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 243 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: you can chain all of these calls which arguably makes it nicer to read:
return roachtestutil.NewCommand("./workload init kv"). Flag("foo", 1). Flag("bar", 2). String()
TIL. Done.
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 245 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Did you want to check for
maxBlockBytes
here?
Yeah I did this bc the parameter only seems relevant if i actually insert data, and bc I distrust the workload package's parameter handling logic :D
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 248 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit:
Arg
is for single argument, and it already understandsPrintf
-style formatting, so you can just doArg("{pgurl%s:%s}", nodes, tenantName)
Done
pkg/cmd/roachtest/tests/cluster_to_cluster.go
line 1006 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: making
randomSleep
atime.Duration
(i.e., number of nanoseconds), even though the actual intended meaning is for it to be a number of seconds is slightly confusing. Suggestion:dur := (5 + rrd.rng.Intn(6)) * time.Second t.L().Printf("taking a %s power nap", dur) time.Sleep(dur)
Done
thanks for the nits @renatolabs ! |
TFTRs!! bors r=stevendanna |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d16c011 to blathers/backport-release-23.1-102382: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
c2c: deflake c2c/shutdown roachtests