-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: range does not match splits #32784
Comments
Got another one after 6s: |
It exposes a bug (likely in range merges), cockroachdb#32784. Release note: None
It exposes a bug (likely in range merges), cockroachdb#32784. Release note: None
Ok, I've tracked this down to a problem in the internal executor:
The salient line is
which indicates that the internal executor used to add an entry to the rangelog about the split decides to restart the txn with a new ID (!!!!!). This is kind of insane. I can only assume that turning on merges somehow causes the split txns to hit some sort of non-retryable conflicts that cause the internal executor to decide to restart with a new txn. I'll dig in more tomorrow. |
An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (cockroachdb#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes cockroachdb#32784. Release note: None
This path is seldom taken, but when it does and the node keeps running, it won't be performing as expected. Besides, individual replicas can become "corrupt" due to problems that affect a larger scope, as seen in cockroachdb#33375 (which is caused by a txn anomaly, cockroachdb#32784). Release note: None
An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (cockroachdb#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes cockroachdb#32784. Release note: None
This path is seldom taken, but when it does and the node keeps running, it won't be performing as expected. Besides, individual replicas can become "corrupt" due to problems that affect a larger scope, as seen in cockroachdb#33375 (which is caused by a txn anomaly, cockroachdb#32784). Release note: None
33391: storage: fatal on replica corruption r=petermattis a=tbg This path is seldom taken, but when it does and the node keeps running, it won't be performing as expected. Besides, individual replicas can become "corrupt" due to problems that affect a larger scope, as seen in #33375 (which is caused by a txn anomaly, #32784). Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (cockroachdb#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes cockroachdb#32784. Release note: None
This path is seldom taken, but when it does and the node keeps running, it won't be performing as expected. Besides, individual replicas can become "corrupt" due to problems that affect a larger scope, as seen in cockroachdb#33375 (which is caused by a txn anomaly, cockroachdb#32784). Release note: None
33312: sql/opt: avoid swallowing TransactionAbortedErrors r=benesch a=benesch An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes #32784. Release note: None /cc @tbg 33417: opt: Inline constant values r=andy-kimball a=andy-kimball Inline constants in expressions like: SELECT x, x+1 FROM (VALUES (1)) AS t(x) ; with the new inlining rules, this becomes: VALUES (1, 2) The new inlining rules are useful for mutation expressions (e.g. UPDATE), which can nest multiple Project and Values expressions that often use constant values. For example: CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a + 1) STORED); UPDATE ab SET a=1 This now gets mapped by the optimizer to this internal equivalent: UPDATE ab SET a=1, b=2 Release note: None 33421: opt: Tighten up InlineProjectInProject rule r=andy-kimball a=andy-kimball Allow inlining nested Project in case where there are duplicate refs to same inner passthrough column. Previously, this case prevented inlining. However, only duplicate references to inner *synthesized* columns need to be detected. Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Andrew Kimball <[email protected]>
The merge queue was disabled in this test by df26cf6 due to the bug found in cockroachdb#32784. That bug was fixed by cockroachdb#33312, so we can address the TODO and re-enable merges in the test. Release note: None Release justification: test only
46408: opt: fix buggy EliminateUpsertDistinctOnNoColumns rule r=andy-kimball a=andy-kimball The fix for #37880 added a new ErrorOnDup field to the UpsertDistinctOn operator. However, the EliminateErrorDistinctNoColumns rule wasn't changed to respect this flag, and so there's still a case where ON CONFLICT DO NOTHING returns an error. This commit eliminates the error by separating out the ErrorOnDup=True case from the EliminateErrorDistinctNoColumns rule (which raises an error) and adding it instead to the EliminateDistinctNoColumns rule (which does not raise an error). Fixes #46395 Release justification: Bug fixes and low-risk updates to new functionality. This was always an error, so it's low-risk to make it a non-error. Release note: None 46431: kv: don't disable the merge queue needlessly in more tests r=nvanbenschoten a=nvanbenschoten Follow up to #46383. These tests were disabling the queue to not interfere with its AdminSplits, but since the tests were written, AdminSplit got a TTL. Release note: None Release justification: test only 46433: kv: re-enable merge queue in TestSplitTriggerRaftSnapshotRace r=nvanbenschoten a=nvanbenschoten The merge queue was disabled in this test by df26cf6 due to the bug found in #32784. That bug was fixed by #33312, so we can address the TODO and re-enable merges in the test. Release note: None Release justification: test only 46444: kv: deflake TestRangeInfo by using a manual clock r=nvanbenschoten a=nvanbenschoten Fixes #46215. Fallout from #45984. Release justification: testing only Release note: None. Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
via
(i.e. just use master once #32782 is in).
I think this is a merge bug, based on this not failing before I changed the test to allow merges.
Two repros attached.
crash1.txt
crash2.txt
The text was updated successfully, but these errors were encountered: