Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c2c: deflake c2c/shutdown roachtests #102382

Merged
merged 3 commits into from
May 11, 2023

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Apr 26, 2023

c2c: deflake c2c/shutdown roachtests

This patch addresses to roachtest failure modes:
- Prevents roachtest failure if a query fails during a node shutdown.

- Prevents the src cluster from returning a single node topology, which could
  cause the stream ingestion job to hang if the participating src node gets
shut down. Longer term, automatic replanning will prevent this. Specifically,
this patch changes the kv workload to split and scatter the kv table across the
cluster before the c2c job begins.

Fixes #101898
Fixes #102111

This patch also makes it easier to reproduce c2c roachtest failures by plumbing
a random seed to several components of the roachtest driver.

Release note: None


c2c: rename completeStreamIngestion to applyCutoverTime

Release note: none


workload: add --scatter flag to kv workload

The user can now run `./workload init kv --scatter ....` which scatters the kv
table across the cluster after the initial data load. This flag is best used
with `--splits`, `--max-block-bytes`, and `--insert-count`.

Epic: none

Release note: none

@msbutler msbutler added T-disaster-recovery backport-23.1.x Flags PRs that need to be backported to 23.1 labels Apr 26, 2023
@msbutler msbutler self-assigned this Apr 26, 2023
@msbutler msbutler requested a review from a team as a code owner April 26, 2023 22:03
@msbutler msbutler requested review from herkolategan and renatolabs and removed request for a team April 26, 2023 22:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-deflake-shutdown branch 4 times, most recently from fc91a0c to 81a07f9 Compare April 27, 2023 21:52
@msbutler msbutler changed the title c2c: prevent single node producer job topology in roachtests c2c: deflake c2c/shutdown roachtests Apr 27, 2023
@msbutler msbutler force-pushed the butler-deflake-shutdown branch 2 times, most recently from 08dd02c to ad9c15d Compare April 28, 2023 18:47
@msbutler msbutler requested a review from a team as a code owner May 1, 2023 18:06
Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, a couple of small things..

Comment on lines 982 to 990
// Every C2C phase should last at least 30 seconds, so introduce a little
// bit of random waiting before node shutdown to ensure the shutdown occurs
// once we're settled into the target phase.
randomSleep := time.Duration(rand.Intn(6))
//
// Sleep for a minimum of 5 seconds to ensure the stream processors have
// observed the cutover signal.
randomSleep := time.Duration(5 + rand.Intn(6))
rrd.t.L().Printf("In target phase! Take a %d second power nap", randomSleep)
time.Sleep(randomSleep * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know my comment isn't really about this pr but should waitForTargetPhase do what it says, and just wait for the target phase without any sleep? and then the caller (node shutdown or something else) do the sleep instead?
when reading this code it's not so clear why node shutdown is mentioned in the comment.
WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, see the change i made.

removeTenantRateLimiters(rd.t, rd.setup.dst.sysSQL, rd.setup.dst.name)

lv := makeLatencyVerifier("stream-ingestion", 0, 2*time.Minute, rd.t.L(), getStreamIngestionJobInfo, rd.t.Status, false)
lv := makeLatencyVerifier("stream-ingestion", 0, 2*time.Minute, rd.t.L(), getStreamIngestionJobInfo, rd.t.Status, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to ignore errors here?
don't we want the test to fail if we can't read from crdb_internal.system_jobs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good q: any query can fail if a node shuts down around the time of query execution. To ensure the latencyVerifier stays alive during a node shutdown event, we tolerate any of its failed queries. I made the code clearer here.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would also help to explicitly scatter the ranges.

@msbutler msbutler force-pushed the butler-deflake-shutdown branch 3 times, most recently from 62d5a16 to 40e4a0e Compare May 10, 2023 18:53
@msbutler msbutler force-pushed the butler-deflake-shutdown branch from 40e4a0e to 949e12e Compare May 10, 2023 19:07
msbutler added 2 commits May 10, 2023 15:43
The user can now run `./workload init kv --scatter ....` which scatters the kv
table across the cluster after the initial data load. This flag is best used
with `--splits`, `--max-block-bytes`, and `--insert-count`.

Epic: none

Release note: none
@msbutler msbutler force-pushed the butler-deflake-shutdown branch from 949e12e to bedd7fc Compare May 10, 2023 19:47
@msbutler
Copy link
Collaborator Author

msbutler commented May 10, 2023

good idea. I added a patch to the workload cmd here. My ad hoc tests suggests that c2c is much more likely to be distributed!
image

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All ignorable nits, don't mind me.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel and @msbutler)


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 243 at r5 (raw file):

func (kv replicateKV) sourceInitCmd(tenantName string, nodes option.NodeListOption) string {
	cmd := roachtestutil.NewCommand(`./workload init kv`)

Nit: you can chain all of these calls which arguably makes it nicer to read:

return roachtestutil.NewCommand("./workload init kv").
    Flag("foo", 1).
    Flag("bar", 2).
    String()

pkg/cmd/roachtest/tests/cluster_to_cluster.go line 245 at r5 (raw file):

	cmd := roachtestutil.NewCommand(`./workload init kv`)
	cmd.MaybeFlag(kv.initRows > 0, "insert-count", kv.initRows)
	cmd.MaybeFlag(kv.initRows > 0, "max-block-bytes", kv.maxBlockBytes)

Did you want to check for maxBlockBytes here?


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 248 at r5 (raw file):

	cmd.Flag("splits", 100)
	cmd.Option("scatter")
	cmd.Arg(fmt.Sprintf("{pgurl%s:%s}", nodes, tenantName))

Nit: Arg is for single argument, and it already understands Printf-style formatting, so you can just do Arg("{pgurl%s:%s}", nodes, tenantName)


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 1006 at r5 (raw file):

	// once we're fully settled into the target phase (e.g. the stream ingestion
	// processors have observed the cutover signal).
	randomSleep := time.Duration(5 + rrd.rng.Intn(6))

Nit: making randomSleep a time.Duration (i.e., number of nanoseconds), even though the actual intended meaning is for it to be a number of seconds is slightly confusing. Suggestion:

dur := (5 + rrd.rng.Intn(6)) * time.Second
t.L().Printf("taking a %s power nap", dur)
time.Sleep(dur)

This patch addresses to roachtest failure modes:
- Prevents roachtest failure if a query fails during a node shutdown.

- Prevents the src cluster from returning a single node topology, which could
  cause the stream ingestion job to hang if the participating src node gets
shut down. Longer term, automatic replanning will prevent this. Specifically,
this patch changes the kv workload to split and scatter the kv table across the
cluster before the c2c job begins.

Fixes cockroachdb#101898
Fixes cockroachdb#102111

This patch also makes it easier to reproduce c2c roachtest failures by plumbing
a random seed to several components of the roachtest driver.

Release note: None
@msbutler msbutler force-pushed the butler-deflake-shutdown branch from bedd7fc to 093e2dd Compare May 11, 2023 16:48
Copy link
Collaborator Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel and @renatolabs)


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 243 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: you can chain all of these calls which arguably makes it nicer to read:

return roachtestutil.NewCommand("./workload init kv").
    Flag("foo", 1).
    Flag("bar", 2).
    String()

TIL. Done.


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 245 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Did you want to check for maxBlockBytes here?

Yeah I did this bc the parameter only seems relevant if i actually insert data, and bc I distrust the workload package's parameter handling logic :D


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 248 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: Arg is for single argument, and it already understands Printf-style formatting, so you can just do Arg("{pgurl%s:%s}", nodes, tenantName)

Done


pkg/cmd/roachtest/tests/cluster_to_cluster.go line 1006 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: making randomSleep a time.Duration (i.e., number of nanoseconds), even though the actual intended meaning is for it to be a number of seconds is slightly confusing. Suggestion:

dur := (5 + rrd.rng.Intn(6)) * time.Second
t.L().Printf("taking a %s power nap", dur)
time.Sleep(dur)

Done

@msbutler
Copy link
Collaborator Author

thanks for the nits @renatolabs !

@msbutler
Copy link
Collaborator Author

TFTRs!!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented May 11, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented May 11, 2023

Build succeeded:

@craig craig bot merged commit aa2c52b into cockroachdb:master May 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 11, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d16c011 to blathers/backport-release-23.1-102382: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 T-disaster-recovery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants