Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: fix bug in poller that allows it to emit a resolved ts before emitting row updates at the ts #41665

Merged

Conversation

aayushshah15
Copy link
Contributor

Fixes: #41415

Currently, the poller keeps track of the most recent table descriptor
version for every table being watched by it. It polls to get the current
table descriptor periodically, and triggers a backfill if it detects
that the previous version of the table descriptor has a mutation but
the current one doesn't.

It triggers the aforementioned backfill scan at the ModificationTime
of the new table descriptor (meaning that the row updates resulting from
this backfill scan will have the ModificationTime of the of the table
descriptor as their updated timestamps). Now, it is possible that the
poller forwarded a resolved timestamp up to the changeAggregator that
is equal to or greater than this ModificationTime before the
backfill scan was actually triggered.

This PR fixes this problem by first making sure that the concerned
goroutine waits until the backfill scan boundary has been detected and
then skips forwarding all resolved timestamps until the scan boundary is
executed upon.

Release note (changefeedccl): Fix bug in poller in case of schema change.

Release justification: Fix bug in poller that causes it to emit row
updates at a timestamp less than or equal to an already forwarded
resolved timestamp.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the changefeed_resolved_ts_when_backfill branch from 7fac9e0 to d8029c2 Compare October 17, 2019 01:06
@danhhz danhhz requested review from danhhz and ajwerner October 17, 2019 15:03
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I look forward to the nemesis test which exercises this code path from existing. Can we at least add a small unit test that asserts that we do not emit a resolved timestamp at that scan boundary (we always do before this change). Perhaps it just requires adding an assertion to an existing test.

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz)

@aayushshah15 aayushshah15 force-pushed the changefeed_resolved_ts_when_backfill branch from d8029c2 to ea0ebbf Compare October 17, 2019 15:31
@danhhz
Copy link
Contributor

danhhz commented Oct 17, 2019

FYI I'm looking at this, but it's taking a sec to get the scanBoundaries stuff loaded up again

@aayushshah15
Copy link
Contributor Author

Can we at least add a small unit test that asserts that we do not emit a resolved timestamp at that scan boundary (we always do before this change). Perhaps it just requires adding an assertion to an existing test.

Wouldn't the assertion in emitEntries in changefeed.go supersede anything I add on the testing side?

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @ajwerner)


pkg/ccl/changefeedccl/changefeed.go, line 173 at r1 (raw file):

		// that have timestamps less than or equal to any resolved timestamp its forwarded before.
		if !sf.Frontier().Less(row.updated) {
			return errors.AssertionFailedf("error: detected timestamp %s that is less than "+

I'm 100% in favor of making this error instead of skip in 20.1 early, but I'm against backporting it at this point. Too risky. I'm not convinced that every edge case is handled for us to get this strong of a guarantee and if this error fires, we'll stop the changefeed. This means it should get its own PR (or at least its own commit)


pkg/ccl/changefeedccl/poller.go, line 269 at r1 (raw file):

					// Make sure scan boundaries less than or equal to `resolvedTS` were
					// added to the `scanBoundaries` list before proceeding.
					if err := p.tableHist.WaitForTS(ctx, resolvedTS); err != nil {

good catch


pkg/ccl/changefeedccl/poller.go, line 283 at r1 (raw file):

						// to trigger the scan first before resolving its scan boundary
						// timestamp.
						frontier.Forward(e.resolved.Span, resolvedTS)

it's strange to me that the frontier doesn't match what we've sent to AddResolved. instead of moving the AddResolved call to an else, can we replace the resolvedTS = p.mu.scanBoundaries[0] line with resolvedTS = p.mu.scanBoundaries[0].Prev()?

@danhhz
Copy link
Contributor

danhhz commented Oct 17, 2019

Not for this PR of course, but this whole poller/tableHistory thing is in serious need of some better abstractions. I can see where I was going with it, but the they really didn't hold up as we found things I didn't initially think of

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming my suggestion below works

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @ajwerner)


pkg/ccl/changefeedccl/poller.go, line 283 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

it's strange to me that the frontier doesn't match what we've sent to AddResolved. instead of moving the AddResolved call to an else, can we replace the resolvedTS = p.mu.scanBoundaries[0] line with resolvedTS = p.mu.scanBoundaries[0].Prev()?

Hmm, on second thought I'm coming around to what you have here. My suggestion would basically guarantee that we get strange looking resolved timestamps, which seems likely to cause some UX confusion

Can we do both? how about we both skip sending the resolved timestamp as you have here, and also stick a resolvedTs = resolvedTs.Prev() at the beginning of the if boundaryBreak { branch with a comment. Something like "A boundary means we're about to do a backfill at this timestamp, so at the changefeed level the boundary itself is not resolved."

ts before emitting row updates at said ts.

Fixes: cockroachdb#41415

Currently, the poller keeps track of the most recent table descriptor
version for every table being watched by it. It polls to get the current
table descriptor periodically, and triggers a backfill if it detects
that the previous version of the table descriptor has a mutation but
the current one doesn't.

It triggers the aforementioned backfill scan at the `ModificationTime`
of the new table descriptor (meaning that the row updates resulting from
this backfill scan will have the `ModificationTime` of the of the table
descriptor as their `updated` timestamps). Now, it is possible that the
poller forwarded a resolved timestamp up to the `changeAggregator` that
is equal to or greater than this `ModificationTime` *before* the
backfill scan was actually triggered.

This PR fixes this problem by first making sure that the concerned
goroutine waits until the backfill scan boundary has been detected and
then skips forwarding all resolved timestamps until the scan boundary is
executed upon.

Release note (changefeedccl): Fix bug in poller in case of schema change.

Release justification: Fix bug in poller that causes it to emit row
updates at a timestamp less than or equal to an already forwarded
resolved timestamp.
@aayushshah15 aayushshah15 force-pushed the changefeed_resolved_ts_when_backfill branch from ea0ebbf to cd19443 Compare October 17, 2019 17:26
@aayushshah15
Copy link
Contributor Author


pkg/ccl/changefeedccl/changefeed.go, line 173 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I'm 100% in favor of making this error instead of skip in 20.1 early, but I'm against backporting it at this point. Too risky. I'm not convinced that every edge case is handled for us to get this strong of a guarantee and if this error fires, we'll stop the changefeed. This means it should get its own PR (or at least its own commit)

Done. It now just logs an error message and the tests simply ensure that this message wasn't logged.

Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/ccl/changefeedccl/poller.go, line 283 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Hmm, on second thought I'm coming around to what you have here. My suggestion would basically guarantee that we get strange looking resolved timestamps, which seems likely to cause some UX confusion

Can we do both? how about we both skip sending the resolved timestamp as you have here, and also stick a resolvedTs = resolvedTs.Prev() at the beginning of the if boundaryBreak { branch with a comment. Something like "A boundary means we're about to do a backfill at this timestamp, so at the changefeed level the boundary itself is not resolved."

Done.

@aayushshah15
Copy link
Contributor Author

This is ready for a final look.

@aayushshah15
Copy link
Contributor Author

Thanks for the quick reviews!

bors r+

@andy-kimball andy-kimball mentioned this pull request Oct 17, 2019
53 tasks
@craig
Copy link
Contributor

craig bot commented Oct 17, 2019

Build failed

@aayushshah15
Copy link
Contributor Author

This seems like a flake, trying again.

bors r+

@ajwerner
Copy link
Contributor

bors r+

craig bot pushed a commit that referenced this pull request Oct 17, 2019
41665: changefeedccl: fix bug in poller that allows it to emit a resolved ts before emitting row updates at the ts r=ajwerner a=aayushshah15

Fixes: #41415

Currently, the poller keeps track of the most recent table descriptor
version for every table being watched by it. It polls to get the current
table descriptor periodically, and triggers a backfill if it detects
that the previous version of the table descriptor has a mutation but
the current one doesn't.

It triggers the aforementioned backfill scan at the `ModificationTime`
of the new table descriptor (meaning that the row updates resulting from
this backfill scan will have the `ModificationTime` of the of the table
descriptor as their `updated` timestamps). Now, it is possible that the
poller forwarded a resolved timestamp up to the `changeAggregator` that
is equal to or greater than this `ModificationTime` *before* the
backfill scan was actually triggered.

This PR fixes this problem by first making sure that the concerned
goroutine waits until the backfill scan boundary has been detected and
then skips forwarding all resolved timestamps until the scan boundary is
executed upon.

Release note (changefeedccl): Fix bug in poller in case of schema change.

Release justification: Fix bug in poller that causes it to emit row
updates at a timestamp less than or equal to an already forwarded
resolved timestamp.

Co-authored-by: Aayush Shah <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 17, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants