Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

Closed
ajwerner opened this issue Mar 25, 2021 · 10 comments
Closed

kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

ajwerner opened this issue Mar 25, 2021 · 10 comments
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Mar 25, 2021

Is your feature request related to a problem? Please describe.

The KV layer has two non-MVCC operations which can affect range data: AddSSTable and ClearRange. The SQL layer, in general, goes through great pains to ensure that readers do not read under these operations. These operations, however, have a bad interaction with physical incremental backups and newer streaming asynchronous replication.

The problem for incremental backups, in particular, is AddSSTable. The problem with AddSSTable is that it can rewrite history. In particular, it is possible for an AddSSTable to write keys at timestamps which fall before the end time of the previous incremental backup. These keys will not be picked up by the next incremental. This turns out to be a big correctness problems for indexes which are backed up while they are being built (#62564). To mitigate this correctness problem in the short term, we are going to stop including indexes which are non-public in incremental backups. This will be problematic for use cases where the incremental window in use is short relative to the duration of index backfills.

Streaming replication, based on RangeFeed, has problems with both of these requests. In particular, when we do an AddSSTable, we end up restarting the RangeFeed but the catch-up scan will be from the last resolved timestamp. Any writes in the SST from before that timestamp will not be seen. ClearRange is effectively invisible to the RangeFeed.

Given the importance both of streaming replication and of incremental backups, we need to do something here.

Describe the solution you'd like

The thrust of this proposal is that we should do some bookkeeping in the replicated range state to note when these non-MVCC operations have occurred. We'll use this information to drive correct behavior during ExportRequest with time bounds and during RangeFeed catch up scans.

This issue proposes that we add a new range-local, replicated prefix under which we record, keyed on the timestamp at which non-MVCC operations occurred, the relevant spans they covered, the operation, and other metadata.

The sketch of the keyspace layout will be something like:

// LocalRangeNonMVCCEvents is a prefix in the replicated, range-local keyspace 
// for a timestamp-oriented store of the history of non-MVCC operations
LocalRangeNonMVCCEvents = roachpb.Key(makeKey(LocalPrefix, roachpb.RKey("nmvcc")))

The keys which will be used in this keyspace will have the above prefix and a timestamp. This timestamp represents the timestamp at which these operations occurred. I believe this is safe because the server assigns the timestamp and these operations require a single request in the batch [1].

Inside the values we'll store metadata. For a ClearRange we just need the span. For an AddSSTable, we'll want both the span and, at least, the minimum timestamp of any key in the SST. We'll use this information to inform what information to make visible to a time bound iteration. Namely, in addition to the normal keys we'd expose to a time-bound iteration, we should expose all keys in time bounds and spans that may have been covered by AddSSTable events which occurred in that timebound. I'm not totally clear on the interaction of ClearRange on incremental backups but I do think that we'll need to at least know about them for streaming replication and they naturally fit in the same model. To account for the streaming replication, we will probably want to add some new RangeFeed message.

We can GC these records by their timestamp. Today we require for both catch-up scans and export requests that the entire time interval fall into the GC threshold. A related note is that when reading the values in some span and time interval due to an AddSSTable, if the oldest key in the SST is older than the GC Threshold, it should be safe to perform the read at the GC threshold. In fact, limiting the lower bound on the timestamp is, in a sense, an optimization.

All of this, however, does sort of redefine what these lower bounds mean on these various requests. In order to clarify the behavior, all of this interaction with non-MVCC operations should be opt-in. In fact, normal RangeFeeds which do not opt into this behavior should instead fail with an error if such a non-MVCC operation is encountered.

Describe alternatives you've considered

There was a proposal to try to force the SST timestamps to fall before the closed timestamp but that's a tough sell for a variety of reasons.

Additional context

This tracking may also enable mitigation for #31563.

Update: [1] We used to never assign a timestamp to the batch header for AddSSTable but we've very recently (#64023) started to set that to a timestamp at which the relevant read occurred. This is important to ensure that a delete tombstone is not lost to GC, erroneously revealing deleted data.

Epic: CRDB-2624

@ajwerner ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. labels Mar 25, 2021
@ajwerner
Copy link
Contributor Author

cc @dt

@ajwerner
Copy link
Contributor Author

The logic about what data to return in the export request. I'm curious if it would suffice to just do a normal (non-timebound) scan of the data at the StartTime over the span of the SST. If there were deletes in the SST that then were not covered by writes, that would be a problem, but I'm not sure that that ever happens.

@dt
Copy link
Member

dt commented Mar 25, 2021

I think we want to do a time-bound scan from the SST's min time since we ingest wide, covering SSTs all the time in index backfills. Doing a full non-TBI scan could then emit lots of extra keys vs just those in the SST's time span? As long as we're persisting the SST's key bounds, persisting it's time-bounds should be easy -- we already have them in the table props where we can read them during application in o(1) without iterating.

Then we need to pass those as a start_time skyline to inc iterator.

@ajwerner
Copy link
Contributor Author

@dt now this has a fun interaction with #64023. Namely, we want the timestamp of the AddSSTable request to be above the closed timestamp but the read timestamp we're applying in that change is definitely not going to have that property. I think all of this points to wanting another timestamp in these requests such that the AddSSTable is said to be proposed at "now" but somewhere else we stash a timestamp which we ensure is above the GC threshold (or TTL).

stevendanna added a commit to stevendanna/cockroach that referenced this issue Apr 28, 2021
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes cockroachdb#64276

See also cockroachdb#62585, cockroachdb#43784
stevendanna added a commit to stevendanna/cockroach that referenced this issue Apr 28, 2021
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes cockroachdb#64276

See also cockroachdb#62585, cockroachdb#43784
craig bot pushed a commit that referenced this issue Apr 28, 2021
64323: changefeedccl: fail changefeeds when tables go offline r=miretskiy a=stevendanna

In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes #64276

See also #62585, #43784

Co-authored-by: Steven Danna <[email protected]>
stevendanna added a commit to stevendanna/cockroach that referenced this issue Apr 29, 2021
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes cockroachdb#64276

See also cockroachdb#62585, cockroachdb#43784
@erikgrinaker erikgrinaker self-assigned this May 6, 2021
@nvanbenschoten
Copy link
Member

Thanks for writing this up, @ajwerner. There's a lot to unpack in here, but I think this is moving in the right direction.

The general idea to store a versioned sequence of bulk operations on a Range allows us to start rationalizing interactions between these bulk operations and other components (rangefeed, closed timestamps, export, etc.) in the KV layer.

Interactions with closed timestamps, in particular, have been concerning me recently, as we have no way to express that a table/index should not be brought online until the closed timestamps in its key range can be trusted and won't be violated by AddSSTable operations (whose writes do not respect the closed timestamp) that are still replicating to some follower replicas. But with this scheme, we could ensure that the timestamp of the bulk operation itself respects the closed timestamp. So we could say that a table/index can only be brought online after the timestamp of its bulk operations. This would prevent follower reads served shortly after a schema change from missing some of the effects of the schema change.

I'm also happy to see that we filed #65128 after our discussion last week. In my mind, these are very related issues. A versioned sequence of bulk operations allows us to start talking about idempotency and about replay protection. For instance, if we assign a timestamp to each bulk operation then a replay can be easily detected and rejected. Similarly, if we assign a timestamp to each bulk operation then we can box out delayed application on a rollback by writing no-op guards at the maximum timestamp that the bulk operation could have been performed at. An approach like this seems safer than relying on clock synchronization (#65128 (comment)), which we are ok with for very specific cases with known failure modes (i.e. stale reads) but probably shouldn't be ok with for situations that could result in more severe failure modes like data corruption.

This issue proposes that we add a new range-local, replicated prefix under which we record, keyed on the timestamp at which non-MVCC operations occurred, the relevant spans they covered, the operation, and other metadata.

I assume these records will be stored in the range-ID local keyspace. In other words, since they aren't bound to a specific key, they need to be associated with the entire range. This also means that they need to be copied onto both sides of a Range split.

Inside the values we'll store metadata. For a ClearRange we just need the span. For an AddSSTable, we'll want both the span and, at least, the minimum timestamp of any key in the SST. We'll use this information to inform what information to make visible to a time bound iteration.

The use of range-level metadata to dictate the behavior of an MVCC scan reminds me of #16294.

@ajwerner
Copy link
Contributor Author

I assume these records will be stored in the range-ID local keyspace. In other words, since they aren't bound to a specific key, they need to be associated with the entire range.

I don't know about that. While they aren't associated with a key, they are associated with a span. Consider a ClearRange over some table span. We generally expect that range to get merged away. We definitely want to retain the history of that operation post-merge.

My sense is that thinking through split and merge behavior will be handy. In my straw-man vision for this, I'd suspect that on scan we'd scan the entire key span recording these events and copy out metadata that covers the right hand side to the right hand side. I don't think I feel the same way about merging state back together (which smells).

@nvanbenschoten
Copy link
Member

I don't know about that. While they aren't associated with a key, they are associated with a span. Consider a ClearRange over some table span. We generally expect that range to get merged away. We definitely want to retain the history of that operation post-merge.

Range-ID local keys don't necessarily disappear during a range merge. For instance, we copy over the abort span from the RHS of a merge to the LHS. So we could encode the desired splitting/merging logic regardless of whether we associate these operations with keys (range local) or not (range-ID local). If we did associate these operations with keys, I guess we'd store them at their start key and encode their end key in their value.

The other consideration is whether these operations should be indexed by their timestamp. This would be important if we want some kind of monotonicity property for writes to this log. Indexing by timestamp could be done by either including a timestamp in the keys manually or by using MVCC-keys. I'm not sure whether there's precedent for MVCC range-ID local keys, but I don't think there's anything fundamentally wrong with them.

@erikgrinaker
Copy link
Contributor

As a first stab, I'm thinking we'll use range local keys keyed by timestamp and start key (in that order), with the end key in the metadata value.

Range local seems to make sense since they're associated with a key (span). Keying primarily by timestamp should be more efficient because MVCC reads have to check for non-MVCC operations in their future and most reads happen in the recent past -- this way, we can skip more keys than if we'd index by start key first and scan all historical non-MVCC ops that intersect the read span.

Indexing by timestamp could be done by either including a timestamp in the keys manually or by using MVCC-keys. I'm not sure whether there's precedent for MVCC range-ID local keys, but I don't think there's anything fundamentally wrong with them.

While MVCC keys have the same structure (key and timestamp), we'd want to encode them differently (timestamp first). I also think using MVCC keys to represent non-MVCC operations would just get confusing since their semantics are so different.

@dt
Copy link
Member

dt commented Jun 2, 2021

Another point in favor of timestamp-first indexing: many non-mvcc ops have very wide spans (in many ClearRange cases it is the range’s entire span), so span intersection might not be very selective, if at all, compared to time-bounds.

@erikgrinaker
Copy link
Contributor

Replaced by #70429.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants