Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ignore duplicated events in fingerprint validator #89332

Merged
merged 2 commits into from
Oct 6, 2022

Conversation

renatolabs
Copy link
Contributor

This commit updates the fingerprint validator (and its use in the
cdc/mixed-versions test) to ignore duplicated events received by the
validator.

A previously implicit assumption of the validator is that any events
that it receives are either not duplicated, or -- if they are
duplicated -- they are within the previous resolved timestamp and the
current resolved timestamp. However, that assumption is not justified
by the changefeed guarantees and depends on how frequently resolved
events are emitted and how often the changefeed checkpoints.

In the specific case of the cdc/mixed-versions roachtest, it was
possible for the changefeed to start from an old checkpoint (older
than the last received resolved timestamp), causing it to re-emit
old events that happened way before the previously observed resolved
event. As a consequence, when the validator applies the update
associated with that event, there is a mismatch with state of the
original table as of the update's timestamp, as the fingerprint
validator relies on the fact that updates are applied in order.

To fix the issue, we now skip events that happen before the timestamp
of the previous resolved event received. In addition, the caller can
also tell the validator to verify that such out-of-order messages
received by the validator have indeed been previously seen; if not,
that would represent a violation of the changefeed's guarantees.

Fixes: #87251.

Release note: None

@renatolabs renatolabs requested a review from a team October 4, 2022 20:53
@renatolabs renatolabs requested a review from a team as a code owner October 4, 2022 20:53
@renatolabs renatolabs requested review from herkolategan and removed request for a team October 4, 2022 20:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs requested a review from samiskin October 4, 2022 20:53
@renatolabs
Copy link
Contributor Author

First commit is a slight variation of what I had done in #88961 to remove retries from the validator.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 228 at r1 (raw file):

				if m == nil {
					t.L().Printf("end of changefeed")
					return nil

This control escape bypasses close(cmvt.validatorDone). Methinks waitForValidator would block indefinitely in this case.

To handle mixed-version CDC tests, the fingerprint validator had
previously been updated to handle retries internally; this complicates
the validator as the retry logic is only ever used in the
mixed-version roachtests.

This commit removes the retry logic from the validator and instead
allows the caller to pass a `DBFunc` to be called whenever a database
connection is needed after initialization. By passing a custom DBFunc,
tests that need it (such as the mixed-versions roachtest) can pass a
function that accounts for temporary unavailability of nodes.
Specifically, we pass a function that blocks while nodes are being
upgraded, to simplify reasoning of this test's behavior.

In order to support that, we also change the signature of
`NewFingerprintValidator` to return the actual concrete
validator (which is now made public) instead of the `Validator`
interface, following the generally-good approach of "accept
interfaces, return structs" in Go.

Release note: None
@renatolabs renatolabs force-pushed the cdc-mixed-versions-fix branch from f4e74b8 to 27b214f Compare October 6, 2022 15:33
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @samiskin, and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 228 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

This control escape bypasses close(cmvt.validatorDone). Methinks waitForValidator would block indefinitely in this case.

validatorDone is removed on the second commit. If that error condition did happen, the validator would hang waiting for resolved timestamps and the test would eventually timeout.

Changed the handling of this situation to call t.Fatal to fail early and more clearly.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 6 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)


pkg/ccl/changefeedccl/cdctest/validator.go line 636 at r4 (raw file):

	}

	v.previouslySeen[row.eventKey()] = struct{}{}

What about duplicates with the same row key? Are those allowed by CDC semantics?


pkg/ccl/changefeedccl/cdctest/validator.go line 649 at r4 (raw file):

	}

	// row's timestamp is after the last resolved timestam; no problem

Nit: spelling timestamp.


pkg/ccl/changefeedccl/cdctest/validator.go line 657 at r4 (raw file):

	// hasn't been seen before; that shouldn't happen
	if _, seen := v.previouslySeen[row.eventKey()]; !seen {
		return fmt.Errorf("unexpected new event at timestamp %s after resolved timestamp %s",

"unexpected out-of-order event at timestamp..."?


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 54 at r4 (raw file):

	// enforcing a maximum number of operations sets a boundary on the
	// validator's memory usage
	maxOps = 12000

Do you have a sense of the memory footprint per maxOps? I am mostly just curious why 12000 is a safe value.

This commit updates the fingerprint validator (and its use in the
`cdc/mixed-versions` test) to ignore duplicated events received by the
validator.

A previously implicit assumption of the validator is that any events
that it receives are either not duplicated, or -- if they are
duplicated -- they are within the previous resolved timestamp and the
current resolved timestamp. However, that assumption is not justified
by the changefeed guarantees and depends on how frequently `resolved`
events are emitted and how often the changefeed checkpoints.

In the specific case of the `cdc/mixed-versions` roachtest, it was
possible for the changefeed to start from an old checkpoint (older
than the last received `resolved` timestamp), causing it to re-emit
old events that happened way before the previously observed resolved
event. As a consequence, when the validator applies the update
associated with that event, there is a mismatch with state of the
original table as of the update's timestamp, as the fingerprint
validator relies on the fact that updates are applied in order.

To fix the issue, we now skip events that happen before the timestamp
of the previous `resolved` event received. In addition, the caller can
also tell the validator to verify that such out-of-order messages
received by the validator have indeed been previously seen; if not,
that would represent a violation of the changefeed's guarantees.

Fixes: cockroachdb#87251.

Release note: None
@renatolabs renatolabs force-pushed the cdc-mixed-versions-fix branch from 27b214f to f350ab7 Compare October 6, 2022 19:17
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @samiskin, and @srosenberg)


pkg/ccl/changefeedccl/cdctest/validator.go line 636 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

What about duplicates with the same row key? Are those allowed by CDC semantics?

In any run there will be multiple updates with the same row key, but with different timestamps. Which is why this event key is a combination of row key and timestamp.


pkg/ccl/changefeedccl/cdctest/validator.go line 649 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: spelling timestamp.

Fixed.


pkg/ccl/changefeedccl/cdctest/validator.go line 657 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

"unexpected out-of-order event at timestamp..."?

Updated the error message.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 54 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Do you have a sense of the memory footprint per maxOps? I am mostly just curious why 12000 is a safe value.

12000 was picked for a combination of: 1) workload finishes within the existing 30-minute timeout for the test; and 2) provides enough activity for the validator to see resolved timestamps during the different parts of the upgrade process validated in this test. I updated the comment associated with this constant with this information.

In terms of memory usage: since the workload is performing 12k operations, and each operation should translate to two keys being updated (account being transferred from, and account being transferred to), we should have ~24k entries in the previouslySeen map (I validated this in a few runs of the test). I also wrote this benchmark [1] to approximate memory usage of a map of the same size and type of data. TLDR: the map should peak at around 4MB of memory used, which is pretty acceptable IMO.

[1] https://gist.github.com/renatolabs/d6fbe68dd4c367bac1cffa692482c4ff

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)


pkg/ccl/changefeedccl/cdctest/validator.go line 636 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

In any run there will be multiple updates with the same row key, but with different timestamps. Which is why this event key is a combination of row key and timestamp.

Right, I meant the same (row key, timestamp) tuple seen multiple times; is this valid?


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 54 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

12000 was picked for a combination of: 1) workload finishes within the existing 30-minute timeout for the test; and 2) provides enough activity for the validator to see resolved timestamps during the different parts of the upgrade process validated in this test. I updated the comment associated with this constant with this information.

In terms of memory usage: since the workload is performing 12k operations, and each operation should translate to two keys being updated (account being transferred from, and account being transferred to), we should have ~24k entries in the previouslySeen map (I validated this in a few runs of the test). I also wrote this benchmark [1] to approximate memory usage of a map of the same size and type of data. TLDR: the map should peak at around 4MB of memory used, which is pretty acceptable IMO.

[1] https://gist.github.com/renatolabs/d6fbe68dd4c367bac1cffa692482c4ff

Thanks for the clarification and a microbenchmark to estimate the memory footprint!

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)


pkg/ccl/changefeedccl/cdctest/validator.go line 636 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Right, I meant the same (row key, timestamp) tuple seen multiple times; is this valid?

Yes, it's valid, changefeed's delivery is at-least-once. An "old" event can be received by the client even after multiple later resolved timestamps are received, which is exactly what the validator couldn't handle before this PR.

@renatolabs
Copy link
Contributor Author

bors r=srosenberg

TFTR!

@craig
Copy link
Contributor

craig bot commented Oct 6, 2022

Build succeeded:

@craig craig bot merged commit bfbd07b into cockroachdb:master Oct 6, 2022
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Oct 12, 2022
In cockroachdb#89332, we started optionally validating changefeed semantics in
the `FingerprintValidator` by making sure that unseen updates are never
observed if a `resolved` message is received. In order to keep bounds on
the validator's memory usage, the `cdc/mixed-versions` test was
setting a maximum number of operations at the workload level. However,
a consequence of that change is that it creates the possibility for the
workload to finish before the roachtest has received all the
`resolved` events it expects (for instance, if draining takes longer
than usual for some reason; a variety of other non-determinism is also
at play).

To deal with this possibility, we instead enforce a maximum number of
previously seen events at the validator's level. For the
`cdc/mixed-versions` test, we set a maximum of 100,000 previously seen
events being stored (20MB memory footprint). This should be equivalent
to roughly 50,000 bank transfers, and give enough time for the test to
finish successfully.

Epic: None.

Release note: None.
craig bot pushed a commit that referenced this pull request Oct 13, 2022
89782: roachtest: limit FingerprintValidator's memory usage internally r=miretskiy,srosenberg a=renatolabs

In #89332, we started optionally validating changefeed semantics in the `FingerprintValidator` by making sure that unseen updates are never observed if a `resolved` message is received. In order to keep bounds on the validator's memory usage, the `cdc/mixed-versions` test was setting a maximum number of operations at the workload level. However, a consequence of that change is that it creates the possibility for the workload to finish before the roachtest has received all the `resolved` events it expects (for instance, if draining takes longer than usual for some reason; a variety of other non-determinism is also at play).

To deal with this possibility, we instead enforce a maximum number of previously seen events at the validator's level. For the `cdc/mixed-versions` test, we set a maximum of 100,000 previously seen events being stored (20MB memory footprint). This should be equivalent to roughly 50,000 bank transfers, and give enough time for the test to finish successfully.

Release note: None

Epic: None.

89847: sql: turn plan sampling back off by default r=j82w a=j82w

The value was originally disabled by default in: #88343. The value was enabled by default: #89020 because of a bug that was fixed in #89418. This PR is disabling it by default again now that the bug has been fixed.

Part Of #89847

Release note (sql change): Turn the default value of sql.metrics.statement_details.plan_collection.enabled to false.

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: j82w <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: cdc/mixed-versions failed
3 participants