Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix bug whereby backfiller would drop spans on txn restart #54755

Conversation

ajwerner
Copy link
Contributor

This bug was caught by testing with #54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.

This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
@ajwerner ajwerner requested a review from thoszhang September 24, 2020 16:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I think someone on Bulk IO should take a look too.

What are the errors you can get from this? Does this ever fail at index validation?

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@thoszhang thoszhang requested review from a team and miretskiy and removed request for a team September 24, 2020 16:28
@thoszhang
Copy link
Contributor

(@miretskiy I just added cockroachdb/bulk-prs but feel free to reassign to someone else)

@ajwerner
Copy link
Contributor Author

The main error I was seeing was:

panic: no spans [recovered]
	panic: no spans

goroutine 1955 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000277cb0, 0x5bdabe0, 0xc001572d80)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:207 +0x11f
panic(0x44d5140, 0x5ae4be0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PartitionSpans(0xc0011c8580, 0xc000aae0e0, 0xc001e46210, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x1d00000034)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_physical_planner.go:787 +0x130a
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).createBackfiller(0xc0011c8580, 0xc000aae0e0, 0x2, 0xc00162a0fc, 0x4, 0x300000035, 0x1637bf5bb6646088, 0x0, 0x0, 0x0, ...)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_plan_backfill.go:65 +0x16b
github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func4(0x5bdaca0, 0xc00215b8f0, 0xc0015f0f30, 0x0, 0x0)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:1013 +0x65d
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1(0x5bdaca0, 0xc00215b8f0, 0xc0015f0f30, 0x5b45420, 0xc001f063c0)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707 +0x43
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc0015f0f30, 0x5bdaca0, 0xc00215b8f0, 0xc001b1df40, 0xc0015f0f30, 0x0)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:826 +0xd9
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(0xc0008fac80, 0x5bdaca0, 0xc00215b8f0, 0xc001b1e0a8, 0x2, 0x4beed0f)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0xfa
github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill(0xc00219a5b0, 0x5bdaca0, 0xc00215b8f0, 0x3, 0x2, 0xc350, 0x5177828, 0xc00123ef90, 0x1, 0x1, ...)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:959 +0x407
github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).backfillIndexes(0xc00219a5b0, 0x5bdaca0, 0xc00215b8f0, 0x3, 0xc00123ef90, 0x1, 0x1, 0x1, 0xc00162b548)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:1498 +0x193
github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).runBackfill(0xc00219a5b0, 0x5bdaca0, 0xc00215b8f0, 0x0, 0x0)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:305 +0x717
github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).runStateMachineAndBackfill(0xc00219a5b0, 0x5bdaca0, 0xc00215b8f0, 0x0, 0x0)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/schema_changer.go:1406 +0x89
github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).exec(0xc00219a5b0, 0x5bdaca0, 0xc00215b8c0, 0x0, 0x0)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/schema_changer.go:670 +0x4b8
github.com/cockroachdb/cockroach/pkg/sql.schemaChangeResumer.Resume.func1(0x100000035, 0xc000000000, 0xc00219a540, 0x5d4d900)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/schema_changer.go:2045 +0x52e
github.com/cockroachdb/cockroach/pkg/sql.schemaChangeResumer.Resume(0xc0021b7320, 0x5bdaca0, 0xc00215b8c0, 0x4ba27e0, 0xc000dcc900, 0xc0010613e0, 0x461c600, 0x841ac18)
	/home/ajwerner/src/github.com/cockroachdb/cockroach/pkg/sql/schema_changer.go:2156 +0x7c3
github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine(0xc00017e780, 0x5bdaca0, 0xc00218a6c0, 0x4ba27e0, 0xc000dcc900, 0x5b8bae0, 0xc000459ad8, 0xc0010613e0, 0xc0021b7320, 0xc001edc828, ...)

@thoszhang
Copy link
Contributor

Thanks. I wonder if the panic specifically is due to an inconsistency in in-memory state that goes away if the job restarts, and if there are more subtle possible corruption bugs that don't manifest an immediate crash in the backfiller.

I think if that happens, we'd catch it in the validation step, but it would be ideal to confirm that.

@ajwerner
Copy link
Contributor Author

I wonder if the panic specifically is due to an inconsistency in in-memory state that goes away if the job restarts, and if there are more subtle possible corruption bugs that don't manifest an immediate crash in the backfiller.

I'm pretty sure the operation of subtracting spans ends up being idempotent so we shouldn't have persistent correctness problems due to this one. The issue here is just that we might create an empty set of spans and then upon retry hit the panic. Given I don't see any sentry reports on this one, we might just be in luck

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy)

@miretskiy miretskiy self-requested a review September 24, 2020 17:28
@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 24, 2020

Build succeeded:

@craig craig bot merged commit 3c81612 into cockroachdb:master Sep 24, 2020
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 28, 2020
Fixes a bug introduced in cockroachdb#54755 whereby we'd always subtract from the original
set of spans rather than from the updated set of spans meaning that we could
backfill the same span multiple times.

Fixes cockroachdb#54775.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 29, 2020
54900: sql: fix a bug tracking spans in a backfill r=lucy-zhang a=ajwerner

Fixes a bug introduced in #54755 whereby we'd always subtract from the original
set of spans rather than from the updated set of spans meaning that we could
backfill the same span multiple times.

Fixes #54775.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
jayshrivastava pushed a commit that referenced this pull request Oct 8, 2020
Fixes a bug introduced in #54755 whereby we'd always subtract from the original
set of spans rather than from the updated set of spans meaning that we could
backfill the same span multiple times.

Fixes #54775.

Release note: None
@rafiss rafiss added this to the 20.2 milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants