Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: range does not match splits #32784

Closed
tbg opened this issue Dec 3, 2018 · 2 comments
Closed

storage: range does not match splits #32784

tbg opened this issue Dec 3, 2018 · 2 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@tbg
Copy link
Member

tbg commented Dec 3, 2018

--- FAIL: TestSplitTriggerRaftSnapshotRace (10.79s)
	client_split_test.go:562: "key-53": split unexpected error: split at key "key-53" failed: replica corruption (processed=true): range does not match splits: ("key-45"-"key-53") + ("key-53"-/Table/SystemConfigSpan/Start) != [n1,s1,r10/1:/Table/1{3-4}]


11946 runs completed, 1 failures, over 57m7s

via

git fetch tbg && git checkout fix/split-trigger-wait && git reset --hard @{upstream} && make stress PKG=./pkg/storage/ TESTS=SplitTriggerRaftSnap | tee log

(i.e. just use master once #32782 is in).

I think this is a merge bug, based on this not failing before I changed the test to allow merges.

Two repros attached.
crash1.txt
crash2.txt

@tbg tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Dec 3, 2018
@tbg
Copy link
Member Author

tbg commented Dec 4, 2018

Got another one after 6s:
crash3.txt

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 4, 2018
tbg added a commit to tbg/cockroach that referenced this issue Dec 7, 2018
It exposes a bug (likely in range merges), cockroachdb#32784.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 14, 2018
It exposes a bug (likely in range merges), cockroachdb#32784.

Release note: None
@benesch
Copy link
Contributor

benesch commented Dec 20, 2018

Ok, I've tracked this down to a problem in the internal executor:

I181220 06:11:25.172707 3094 storage/replica_command.go:350  [n2,s2,r197/2:key-{38-50}] initiating a split of this range at key "key-45" [r322] with desc r197:key-{38-45} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=2] and txn b1780318
I181220 06:11:25.276770 3094 storage/replica_command.go:350  [n2,s2,r197/2:key-{38-50}] initiating a split of this range at key "key-45" [r322] with desc r197:key-{38-45} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=2] and txn 651f0584
I181220 06:11:25.289284 2753 storage/replica_command.go:394  [n2,s2,r197/2:key-{38-50}] split internal executor changed id from cf654e35-8c9e-4909-b864-4c302c4d5277 to 4dac1d01-c88c-498e-9749-b0a592054dfb
I181220 06:11:25.300231 2655 storage/replica_command.go:497  [n2,merge,s2,r151/2:key-3{7-8}] initiating a merge of r197:key-{38-50} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=1] into this range r151:key-3{7-8} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=1] txn=70b5ffaa
I181220 06:11:25.312588 6567 storage/batcheval/cmd_end_transaction.go:331  [n1,s1,r9/1:/Table/1{3-4}] commit split txn: baHeader={1545286285.280656231,0 (n1,s1):1 9 normal "split" id=4dac1d01 key=/Table/13/1/2018-12-20T06:11:22.278246Z/410307412906573826/0 rw=true pri=0.06488501 stat=PENDING epo=0 ts=1545286285.280656231,0 orig=1545286285.280656231,0 max=1545286285.280882424,0 wto=false seq=16 CONSISTENT %!s(int64=0) %!s(bool=false) %!s(bool=false) 2 <nil> %!s(bool=false)} txn=4dac1d01 args=&roachpb.SplitTrigger{LeftDesc:roachpb.RangeDescriptor{RangeID:197, StartKey:roachpb.RKey{0x6b, 0x65, 0x79, 0x2d, 0x33, 0x38}, EndKey:roachpb.RKey{0x6b, 0x65, 0x79, 0x2d, 0x34, 0x38}, Replicas:[]roachpb.ReplicaDescriptor{roachpb.ReplicaDescriptor{NodeID:1, StoreID:1, ReplicaID:1}, roachpb.ReplicaDescriptor{NodeID:2, StoreID:2, ReplicaID:2}, roachpb.ReplicaDescriptor{NodeID:3, StoreID:3, ReplicaID:3}}, NextReplicaID:4, Generation:(*int64)(0xc4203f8dc8)}, RightDesc:roachpb.RangeDescriptor{RangeID:259, StartKey:roachpb.RKey{0x6b, 0x65, 0x79, 0x2d, 0x34, 0x38}, EndKey:roachpb.RKey{0x6b, 0x65, 0x79, 0x2d, 0x35, 0x30}, Replicas:[]roachpb.ReplicaDescriptor{roachpb.ReplicaDescriptor{NodeID:1, StoreID:1, ReplicaID:1}, roachpb.ReplicaDescriptor{NodeID:2, StoreID:2, ReplicaID:2}, roachpb.ReplicaDescriptor{NodeID:3, StoreID:3, ReplicaID:3}}, NextReplicaID:4, Generation:(*int64)(nil)}}
F181220 06:11:25.312648 6567 storage/batcheval/cmd_end_transaction.go:778  [n1,s1,r9/1:/Table/1{3-4}] range does not match splits: ("key-38"-"key-48") + ("key-48"-"key-50") != [n1,s1,r9/1:/Table/1{3-4}]

The salient line is

I181220 06:11:25.289284 2753 storage/replica_command.go:394  [n2,s2,r197/2:key-{38-50}] split internal executor changed id from cf654e35-8c9e-4909-b864-4c302c4d5277 to 4dac1d01-c88c-498e-9749-b0a592054dfb

which indicates that the internal executor used to add an entry to the rangelog about the split decides to restart the txn with a new ID (!!!!!). This is kind of insane. I can only assume that turning on merges somehow causes the split txns to hit some sort of non-retryable conflicts that cause the internal executor to decide to restart with a new txn.

I'll dig in more tomorrow.

benesch added a commit to benesch/cockroach that referenced this issue Dec 21, 2018
An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (cockroachdb#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes cockroachdb#32784.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 28, 2018
This path is seldom taken, but when it does and the node keeps running,
it won't be performing as expected. Besides, individual replicas can
become "corrupt" due to problems that affect a larger scope, as seen
in cockroachdb#33375 (which is caused by a txn anomaly, cockroachdb#32784).

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Jan 2, 2019
An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (cockroachdb#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes cockroachdb#32784.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 2, 2019
This path is seldom taken, but when it does and the node keeps running,
it won't be performing as expected. Besides, individual replicas can
become "corrupt" due to problems that affect a larger scope, as seen
in cockroachdb#33375 (which is caused by a txn anomaly, cockroachdb#32784).

Release note: None
craig bot pushed a commit that referenced this issue Jan 2, 2019
33391: storage: fatal on replica corruption r=petermattis a=tbg

This path is seldom taken, but when it does and the node keeps running,
it won't be performing as expected. Besides, individual replicas can
become "corrupt" due to problems that affect a larger scope, as seen
in #33375 (which is caused by a txn anomaly, #32784).

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
benesch added a commit to benesch/cockroach that referenced this issue Jan 3, 2019
An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (cockroachdb#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes cockroachdb#32784.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 3, 2019
This path is seldom taken, but when it does and the node keeps running,
it won't be performing as expected. Besides, individual replicas can
become "corrupt" due to problems that affect a larger scope, as seen
in cockroachdb#33375 (which is caused by a txn anomaly, cockroachdb#32784).

Release note: None
craig bot pushed a commit that referenced this issue Jan 3, 2019
33312: sql/opt: avoid swallowing TransactionAbortedErrors r=benesch a=benesch

An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes #32784.

Release note: None

/cc @tbg

33417: opt: Inline constant values r=andy-kimball a=andy-kimball

Inline constants in expressions like:

  SELECT x, x+1 FROM (VALUES (1)) AS t(x) ;

with the new inlining rules, this becomes:

  VALUES (1, 2)

The new inlining rules are useful for mutation expressions (e.g. UPDATE),
which can nest multiple Project and Values expressions that often use
constant values. For example:

  CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a + 1) STORED);
  UPDATE ab SET a=1

This now gets mapped by the optimizer to this internal equivalent:

  UPDATE ab SET a=1, b=2

Release note: None

33421: opt: Tighten up InlineProjectInProject rule r=andy-kimball a=andy-kimball

Allow inlining nested Project in case where there are duplicate refs
to same inner passthrough column. Previously, this case prevented
inlining. However, only duplicate references to inner *synthesized*
columns need to be detected.

Release note: None

Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: Andrew Kimball <[email protected]>
@craig craig bot closed this as completed in #33312 Jan 3, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 23, 2020
The merge queue was disabled in this test by df26cf6 due to the bug
found in cockroachdb#32784. That bug was fixed by cockroachdb#33312, so we can address the
TODO and re-enable merges in the test.

Release note: None
Release justification: test only
craig bot pushed a commit that referenced this issue Mar 23, 2020
46408: opt: fix buggy EliminateUpsertDistinctOnNoColumns rule r=andy-kimball a=andy-kimball

The fix for #37880 added a new ErrorOnDup field to the UpsertDistinctOn
operator. However, the EliminateErrorDistinctNoColumns rule wasn't changed
to respect this flag, and so there's still a case where ON CONFLICT DO
NOTHING returns an error.

This commit eliminates the error by separating out the ErrorOnDup=True case
from the EliminateErrorDistinctNoColumns rule (which raises an error) and
adding it instead to the EliminateDistinctNoColumns rule (which does not
raise an error).

Fixes #46395

Release justification: Bug fixes and low-risk updates to new functionality.
This was always an error, so it's low-risk to make it a non-error.

Release note: None

46431: kv: don't disable the merge queue needlessly in more tests r=nvanbenschoten a=nvanbenschoten

Follow up to #46383.

These tests were disabling the queue to not interfere with its
AdminSplits, but since the tests were written, AdminSplit got
a TTL.

Release note: None
Release justification: test only

46433: kv: re-enable merge queue in TestSplitTriggerRaftSnapshotRace r=nvanbenschoten a=nvanbenschoten

The merge queue was disabled in this test by df26cf6 due to the bug
found in #32784. That bug was fixed by #33312, so we can address the
TODO and re-enable merges in the test.

Release note: None
Release justification: test only

46444: kv: deflake TestRangeInfo by using a manual clock r=nvanbenschoten a=nvanbenschoten

Fixes #46215.

Fallout from #45984.

Release justification: testing only
Release note: None.

Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

2 participants