Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: properly handle DeleteRange in batches that hit WriteTooOld errors #71236

Open
AlexTalks opened this issue Oct 6, 2021 · 4 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@AlexTalks
Copy link
Contributor

AlexTalks commented Oct 6, 2021

Currently, batches that have DeleteRange may read a different set of keys to delete if they have subsequent blind writes that hit a WriteTooOld error (and leave intents behind). If this happens and one of the keys that need to be deleted at the refreshed timestamp already has an intent on it, we won't be able to replay the batch as an intent that should be generated by the DeleteRange won't be present in the intent history.

This manifests in errors such as:

transaction b81df504-9bf5-4e2b-9a94-f9b19cee01ab with sequence 2 missing an intent with lower sequence 1

An example batch that could hit this error is below, assuming the Put operation hits a WriteTooOld error:

b := &Batch{}
b.DelRange(ctx, /Table/50/"006f9bc1", /Table/50/"32de5dba", true)
b.Put(ctx, /Table/50/"130f043b", v-1)
txn.Run(ctx, b)

This was discovered upon adding DeleteRange to kvnemesis in #68003

Jira issue: CRDB-10452

@AlexTalks AlexTalks added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Oct 6, 2021
@nvanbenschoten
Copy link
Member

@AlexTalks and I discussed what to do about this. KV batches are never used in a way that hits this currently, and the error already means that this won't introduce silent corruption, so there's not a lot of urgency to support this. Instead, we decided to update the error message to point at this issue as a possible suspect and then place the resolution of this on hold.

We also decided that when updating the error message, we'll update the errors in replayTransactionalWrite from errors.Errorf to errors.AssertionFailedf.

@mwang1026
Copy link

If the plan is to update an error message, can we pick that up quickly and close this out?

AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Dec 6, 2021
Enhance the error message on sequence number errors when replaying a
transactional batch with a link to the possible cause, cockroachdb#71236, stemming
from an issue where a `DelRange` operation finds new keys to delete upon
replay.  This also changes the error from a generic error to an
`AssertionFailed` error.

Release note: None
@AlexTalks
Copy link
Contributor Author

If the plan is to update an error message, can we pick that up quickly and close this out?

The update to the error message is in #73496, however I believe that @nvanbenschoten and I discussed leaving this issue open as a reference to the actual issue here (which is that DelRange operations in co-mingled batches are not supported), and linking here in the error message. If it's okay to close this out, I'll remove the issue link and we can do that however.

AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Dec 9, 2021
Enhance the error message on sequence number errors when replaying a
transactional batch with a link to the possible cause, cockroachdb#71236, stemming
from an issue where a `DelRange` operation finds new keys to delete upon
replay.  This also changes the error from a generic error to an
`AssertionFailed` error.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Dec 10, 2021
Enhance the error message on sequence number errors when replaying a
transactional batch with a link to the possible cause, cockroachdb#71236, stemming
from an issue where a `DelRange` operation finds new keys to delete upon
replay.  This also changes the error from a generic error to an
`AssertionFailed` error.

Release note: None
craig bot pushed a commit that referenced this issue Dec 10, 2021
73496: storage: add issue for sequence number errors on replaying `DelRange` r=AlexTalks a=AlexTalks

Enhance the error message on sequence number errors when replaying a
transactional batch with a link to the possible cause, #71236, stemming
from an issue where a `DelRange` operation finds new keys to delete upon
replay.  This also changes the error from a generic error to an
`AssertionFailed` error.

Release note: None

73578: storage: remove leftover logic related to interleaved intents r=sumeerbhola a=nvanbenschoten

This commit is a follow-up to #72536.

It addresses a few of the remaining items left over from removing the bulk of the interleaved intent logic. Specifically, it removes:
- the `PrecedingIntentState` type
- the `PrecedingIntentState` parameter in `Writer.ClearIntent`
- the `Writer.OverrideTxnDidNotUpdateMetaToFalse` method
- the `txnDidNotUpdateMetaHelper` type

The commit does not include any behavioral changes.

73591: ui: show per-node series for "Read Amplification" and "SSTables" graphs r=dhartunian a=nvanbenschoten

This commit addresses a longstanding usability issue with the Storage dashboard. Previously, the dashboard would show the average read amplification and the average sstable count across the cluster. When looking at these metrics, we are
specifically interested in the outliers, so this made little sense. As a result, a few of our runbooks (e.g. [RocksDB inverted LSM](https://cockroachlabs.atlassian.net/wiki/spaces/TS/pages/1157890147/RocksDB+inverted+LSM)) require operators to grab custom graphs with the "Per Node" option.

This commit fixes this by splitting these graphs out to show per-node series.

_Example:_

<img width="1132" alt="Screen Shot 2021-12-07 at 10 24 14 PM" src="https://user-images.githubusercontent.com/5438456/145142909-0babdd04-54a6-46d3-9d4e-002a2d375811.png">


Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Sep 20, 2022
This shouldn't have an effect because we still zero out
`config.Ops.Batch` owing to cockroachdb#46081.

See also cockroachdb#71236.

Release note: None
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants