Skip to content

Commit

Permalink
storage,kv: fixes for enabling separated intents
Browse files Browse the repository at this point in the history
The main part of this PR is a reworking of the bounds behavior
of intentInterleavingIter and a clarification of that behavior
in engine.go. The existing behavior was subtly broken in two
ways:
- iteration up from local keys without an upper bound or an
  upper bound that was a global key could cause the iterator
  to enter an error state when it found a separated intent
  without the corresponding provisional value. Similarly, the
  behavior when iterating backwards from global keys without
  a lower bound, or a lower bound that was a local key would
  encounter an error. There are additional tests that exercise
  this (the earlier tests did not construct an engine with
  both local and global keys and therefore missed this
  problem).
- even if we could fix the inconsistency from the previous
  bullet, there is a semantic concern that a caller that
  sets a global upper bound (local lower bound) and seeks
  to a local key (global key) and then iterates forward
  (backward) is expecting this iteration to seamlessly go
  from local (global) to global (local) keys, which is
  not something we can do with intentInterleavingIter since
  it would require it to transparently jump over the
  intermediate lock table.
  I didn't expect there were callers trying to do this, but
  there are: we have multiple places that have both
  local and global spans, and create an iterator with an
  upperbound that is the end of the highest global span and
  then SeekGE to the start of the smallest local span.
  They are not actually trying to go from local to global
  without an intermediate seek, but it is impossible to
  know their intention from the iterator interface. Such
  behavior is no longer allowed, since intentInterleavingIter
  is very strict about ensuring that if it is constrained to
  local (global) keys, all seek and SetUpperBound calls
  are using the same kind of key. Without such strictness
  we are at risk of introducing hard-to-find correctness
  bugs.

Additionally, we now place the lock table keys at the
end of the local key space, after the local store keys.
This is convenient for cleanly separating the two parts
of the key space that a caller iterates over using an
MVCCIterator (even though local store keys don't have
versions).

intentInterleavingIter is simpler with the above fixes:
- The compex upper-bound behavior for prefix iteration is
  ripped out. It was never needed since pebble.Iterator
  is very strict about not exposing keys that don't match
  the prefix.
- The constraint on whether the iterator is over local or
  global keys or unconstrained (only for prefix iteration
  because of the previous bullet), is computed at iterator
  construction time.

The callers mentioned above are now fixed. They now create
new iterators for each span, or lazily create a new one if
the existing iterator was constrained differently. Creating
a new iterator is usually cheap because of iterator reuse.
One exception is that pebbleSnapshot does not reuse
iterators -- I tried to add reuse there but I can't add
it directly to pebbleSnapshot itself since callers may
be expecting to the use the same snapshot in multiple
goroutines. And wrapping pebbleSnapshot for iterator reuse
in a single goroutine is complicated because the snapshot
may have been wrapped in other Reader implementations.
I suspect the iterator construction is not going to
performance critical for these snapshot use cases, but
could use a more informed opinion.

Related to the above, SetUpperBound is now deprecated for
MVCCIterator and EngineIterator.

Informs #41720

Release note: None
  • Loading branch information
sumeerbhola committed Feb 2, 2021
1 parent 995d716 commit 83fbe72
Show file tree
Hide file tree
Showing 36 changed files with 1,243 additions and 333 deletions.
1 change: 1 addition & 0 deletions pkg/ccl/storageccl/engineccl/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ go_test(
"//pkg/base",
"//pkg/ccl/baseccl",
"//pkg/ccl/storageccl/engineccl/enginepbccl:enginepbccl_go_proto",
"//pkg/keys",
"//pkg/roachpb",
"//pkg/settings/cluster",
"//pkg/storage",
Expand Down
3 changes: 2 additions & 1 deletion pkg/ccl/storageccl/engineccl/bench_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (
"testing"

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/keys"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/settings/cluster"
"github.com/cockroachdb/cockroach/pkg/storage"
Expand Down Expand Up @@ -153,7 +154,7 @@ func runIterate(
}
it := makeIterator(eng, startTime, endTime)
defer it.Close()
for it.SeekGE(storage.MVCCKey{}); ; it.Next() {
for it.SeekGE(storage.MVCCKey{Key: keys.LocalMax}); ; it.Next() {
if ok, err := it.Valid(); !ok {
if err != nil {
b.Fatal(err)
Expand Down
3 changes: 2 additions & 1 deletion pkg/ccl/storageccl/export_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -729,14 +729,15 @@ func TestRandomKeyAndTimestampExport(t *testing.T) {
return keys, timestamps
}

localMax := keys.LocalMax
testWithTargetSize := func(t *testing.T, targetSize uint64) {
e, cleanup := mkEngine(t)
defer cleanup()
rnd, _ := randutil.NewPseudoRand()
numKeys := getNumKeys(t, rnd, targetSize)
keys, timestamps := mkData(t, e, rnd, numKeys)
var (
keyMin = roachpb.KeyMin
keyMin = localMax
keyMax = roachpb.KeyMax

tsMin = hlc.Timestamp{WallTime: 0, Logical: 0}
Expand Down
8 changes: 4 additions & 4 deletions pkg/ccl/storageccl/import_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ func TestMaxImportBatchSize(t *testing.T) {
func slurpSSTablesLatestKey(
t *testing.T, dir string, paths []string, kr prefixRewriter,
) []storage.MVCCKeyValue {
start, end := storage.MVCCKey{Key: keys.MinKey}, storage.MVCCKey{Key: keys.MaxKey}
start, end := storage.MVCCKey{Key: keys.LocalMax}, storage.MVCCKey{Key: keys.MaxKey}

e := storage.NewDefaultInMem()
defer e.Close()
Expand Down Expand Up @@ -111,9 +111,9 @@ func slurpSSTablesLatestKey(
v := roachpb.Value{RawBytes: newKv.Value}
v.ClearChecksum()
v.InitChecksum(newKv.Key.Key)
// TODO(sumeer): this will not be correct with the separated
// lock table. We should iterate using EngineKey on the sst,
// and expose a PutEngine method to write directly.
// NB: import data does not contain intents, so data with no timestamps
// is inline meta and not intents. Therefore this is not affected by the
// choice of interleaved or separated intents.
if newKv.Key.Timestamp.IsEmpty() {
if err := batch.PutUnversioned(newKv.Key.Key, v.RawBytes); err != nil {
t.Fatal(err)
Expand Down
62 changes: 31 additions & 31 deletions pkg/keys/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ var (
localSuffixLength = 4

// There are five types of local key data enumerated below: replicated
// range-ID, unreplicated range-ID, range local, range lock, and
// store-local keys.
// range-ID, unreplicated range-ID, range local, store-local, and range lock
// keys.

// 1. Replicated Range-ID keys
//
Expand Down Expand Up @@ -145,36 +145,10 @@ var (
// transaction records. The additional detail is the transaction id.
LocalTransactionSuffix = roachpb.RKey("txn-")

// 4. Lock table keys
// 4. Store local keys
//
// LocalRangeLockTablePrefix specifies the key prefix for the lock
// table. It is immediately followed by the LockTableSingleKeyInfix,
// and then the key being locked.
//
// The lock strength and txn UUID are not in the part of the key that
// the keys package deals with. They are in the versioned part of the
// key (see EngineKey.Version). This permits the storage engine to use
// bloom filters when searching for all locks for a lockable key.
//
// Different lock strengths may use different value types. The exclusive
// lock strength uses MVCCMetadata as the value type, since it does
// double duty as a reference to a provisional MVCC value.
// TODO(sumeer): remember to adjust this comment when adding locks of
// other strengths, or range locks.
LocalRangeLockTablePrefix = roachpb.Key(makeKey(localPrefix, roachpb.RKey("l")))
LockTableSingleKeyInfix = []byte("k")
// LockTableSingleKeyStart is the inclusive start key of the key range
// containing single key locks.
LockTableSingleKeyStart = roachpb.Key(makeKey(LocalRangeLockTablePrefix, LockTableSingleKeyInfix))
// LockTableSingleKeyEnd is the exclusive end key of the key range
// containing single key locks.
LockTableSingleKeyEnd = roachpb.Key(
makeKey(LocalRangeLockTablePrefix, roachpb.Key(LockTableSingleKeyInfix).PrefixEnd()))

// 5. Store local keys
//
// localStorePrefix is the prefix identifying per-store data.
localStorePrefix = makeKey(localPrefix, roachpb.Key("s"))
// LocalStorePrefix is the prefix identifying per-store data.
LocalStorePrefix = makeKey(localPrefix, roachpb.Key("s"))
// localStoreSuggestedCompactionSuffix stores suggested compactions to
// be aggregated and processed on the store.
localStoreSuggestedCompactionSuffix = []byte("comp")
Expand Down Expand Up @@ -210,6 +184,32 @@ var (
// LocalStoreCachedSettingsKeyMax is the end of span of possible cached settings keys.
LocalStoreCachedSettingsKeyMax = LocalStoreCachedSettingsKeyMin.PrefixEnd()

// 5. Lock table keys
//
// LocalRangeLockTablePrefix specifies the key prefix for the lock
// table. It is immediately followed by the LockTableSingleKeyInfix,
// and then the key being locked.
//
// The lock strength and txn UUID are not in the part of the key that
// the keys package deals with. They are in the versioned part of the
// key (see EngineKey.Version). This permits the storage engine to use
// bloom filters when searching for all locks for a lockable key.
//
// Different lock strengths may use different value types. The exclusive
// lock strength uses MVCCMetadata as the value type, since it does
// double duty as a reference to a provisional MVCC value.
// TODO(sumeer): remember to adjust this comment when adding locks of
// other strengths, or range locks.
LocalRangeLockTablePrefix = roachpb.Key(makeKey(localPrefix, roachpb.RKey("z")))
LockTableSingleKeyInfix = []byte("k")
// LockTableSingleKeyStart is the inclusive start key of the key range
// containing single key locks.
LockTableSingleKeyStart = roachpb.Key(makeKey(LocalRangeLockTablePrefix, LockTableSingleKeyInfix))
// LockTableSingleKeyEnd is the exclusive end key of the key range
// containing single key locks.
LockTableSingleKeyEnd = roachpb.Key(
makeKey(LocalRangeLockTablePrefix, roachpb.Key(LockTableSingleKeyInfix).PrefixEnd()))

// The global keyspace includes the meta{1,2}, system, system tenant SQL
// keys, and non-system tenant SQL keys.

Expand Down
41 changes: 22 additions & 19 deletions pkg/keys/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,9 @@
// be range local keys. If not, they're meant to be range-ID local keys. Any key
// we need to re-write during splits/merges will needs to go through Raft. We
// have limits set on the size of Raft proposals so we generally don’t want to
// be re-writing lots of data.
// be re-writing lots of data. Range lock keys (see below) are separate from
// range local keys, but behave similarly in that they split and merge along
// range boundaries.
//
// This naturally leads to range-id local keys being used to store metadata
// about a specific Range and range local keys being used to store metadata
Expand All @@ -157,8 +159,9 @@ var _ = [...]interface{}{
MinKey,

// There are five types of local key data enumerated below: replicated
// range-ID, unreplicated range-ID, range local, range lock, and
// store-local keys.
// range-ID, unreplicated range-ID, range local, store-local, and range lock
// keys. Range lock keys are required to be last category of keys in the
// lock key space.
// Local keys are constructed using a prefix, an optional infix, and a
// suffix. The prefix and infix are used to disambiguate between the four
// types of local keys listed above, and determines inter-group ordering.
Expand All @@ -169,14 +172,14 @@ var _ = [...]interface{}{
// - RangeID unreplicated keys all share `LocalRangeIDPrefix` and
// `localRangeIDUnreplicatedInfix`.
// - Range local keys all share `LocalRangePrefix`.
// - Store keys all share `localStorePrefix`.
// - Range lock (which are also local keys) all share
// `LocalRangeLockTablePrefix`.
// - Store keys all share `localStorePrefix`.
//
// `LocalRangeIDPrefix`, `localRangePrefix`, `LocalRangeLockTablePrefix`,
// and `localStorePrefix` all in turn share `localPrefix`. `localPrefix` was
// chosen arbitrarily. Local keys would work just as well with a different
// prefix, like 0xff, or even with a suffix.
// `LocalRangeIDPrefix`, `localRangePrefix`, `localStorePrefix`, and
// `LocalRangeLockTablePrefix` all in turn share `localPrefix`.
// `localPrefix` was chosen arbitrarily. Local keys would work just as well
// with a different prefix, like 0xff, or even with a suffix.

// 1. Replicated range-ID local keys: These store metadata pertaining to a
// range as a whole. Though they are replicated, they are unaddressable.
Expand Down Expand Up @@ -210,17 +213,7 @@ var _ = [...]interface{}{
RangeDescriptorKey, // "rdsc"
TransactionKey, // "txn-"

// 4. Range lock keys for all replicated locks. All range locks share
// LocalRangeLockTablePrefix. Locks can be acquired on global keys and on
// range local keys. Currently, locks are only on single keys, i.e., not
// on a range of keys. Only exclusive locks are currently supported, and
// these additionally function as pointers to the provisional MVCC values.
// Single key locks use a byte, LockTableSingleKeyInfix, that follows
// the LocalRangeLockTablePrefix. This is to keep the single-key locks
// separate from (future) range locks.
LockTableSingleKey,

// 5. Store local keys: These contain metadata about an individual store.
// 4. Store local keys: These contain metadata about an individual store.
// They are unreplicated and unaddressable. The typical example is the
// store 'ident' record. They all share `localStorePrefix`.
StoreClusterVersionKey, // "cver"
Expand All @@ -231,6 +224,16 @@ var _ = [...]interface{}{
StoreLastUpKey, // "uptm"
StoreCachedSettingsKey, // "stng"

// 5. Range lock keys for all replicated locks. All range locks share
// LocalRangeLockTablePrefix. Locks can be acquired on global keys and on
// range local keys. Currently, locks are only on single keys, i.e., not
// on a range of keys. Only exclusive locks are currently supported, and
// these additionally function as pointers to the provisional MVCC values.
// Single key locks use a byte, LockTableSingleKeyInfix, that follows
// the LocalRangeLockTablePrefix. This is to keep the single-key locks
// separate from (future) range locks.
LockTableSingleKey,

// The global keyspace includes the meta{1,2}, system, system tenant SQL
// keys, and non-system tenant SQL keys.
//
Expand Down
18 changes: 6 additions & 12 deletions pkg/keys/keys.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ func makeKey(keys ...[]byte) []byte {
// MakeStoreKey creates a store-local key based on the metadata key
// suffix, and optional detail.
func MakeStoreKey(suffix, detail roachpb.RKey) roachpb.Key {
key := make(roachpb.Key, 0, len(localStorePrefix)+len(suffix)+len(detail))
key = append(key, localStorePrefix...)
key := make(roachpb.Key, 0, len(LocalStorePrefix)+len(suffix)+len(detail))
key = append(key, LocalStorePrefix...)
key = append(key, suffix...)
key = append(key, detail...)
return key
Expand All @@ -37,11 +37,11 @@ func MakeStoreKey(suffix, detail roachpb.RKey) roachpb.Key {
// DecodeStoreKey returns the suffix and detail portions of a local
// store key.
func DecodeStoreKey(key roachpb.Key) (suffix, detail roachpb.RKey, err error) {
if !bytes.HasPrefix(key, localStorePrefix) {
return nil, nil, errors.Errorf("key %s does not have %s prefix", key, localStorePrefix)
if !bytes.HasPrefix(key, LocalStorePrefix) {
return nil, nil, errors.Errorf("key %s does not have %s prefix", key, LocalStorePrefix)
}
// Cut the prefix, the Range ID, and the infix specifier.
key = key[len(localStorePrefix):]
key = key[len(LocalStorePrefix):]
if len(key) < localSuffixLength {
return nil, nil, errors.Errorf("malformed key does not contain local store suffix")
}
Expand Down Expand Up @@ -481,12 +481,6 @@ func IsLocal(k roachpb.Key) bool {
return bytes.HasPrefix(k, localPrefix)
}

// IsLocalStoreKey performs a cheap check that returns true iff the parameter
// is a local store key.
func IsLocalStoreKey(k roachpb.Key) bool {
return bytes.HasPrefix(k, localStorePrefix)
}

// Addr returns the address for the key, used to lookup the range containing the
// key. In the normal case, this is simply the key's value. However, for local
// keys, such as transaction records, the address is the inner encoded key, with
Expand All @@ -512,7 +506,7 @@ func Addr(k roachpb.Key) (roachpb.RKey, error) {
}

for {
if bytes.HasPrefix(k, localStorePrefix) {
if bytes.HasPrefix(k, LocalStorePrefix) {
return nil, errors.Errorf("store-local key %q is not addressable", k)
}
if bytes.HasPrefix(k, LocalRangeIDPrefix) {
Expand Down
6 changes: 3 additions & 3 deletions pkg/keys/printer.go
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ var (
// KeyDict drives the pretty-printing and pretty-scanning of the key space.
KeyDict = KeyComprehensionTable{
{Name: "/Local", start: localPrefix, end: LocalMax, Entries: []DictEntry{
{Name: "/Store", prefix: roachpb.Key(localStorePrefix),
{Name: "/Store", prefix: roachpb.Key(LocalStorePrefix),
ppFunc: localStoreKeyPrint, PSFunc: localStoreKeyParse},
{Name: "/RangeID", prefix: roachpb.Key(LocalRangeIDPrefix),
ppFunc: localRangeIDKeyPrint, PSFunc: localRangeIDKeyParse},
Expand Down Expand Up @@ -218,11 +218,11 @@ func localStoreKeyPrint(_ []encoding.Direction, key roachpb.Key) string {
if bytes.HasPrefix(key, v.key) {
if v.key.Equal(localStoreNodeTombstoneSuffix) {
return v.name + "/" + nodeTombstoneKeyPrint(
append(roachpb.Key(nil), append(localStorePrefix, key...)...),
append(roachpb.Key(nil), append(LocalStorePrefix, key...)...),
)
} else if v.key.Equal(localStoreCachedSettingsSuffix) {
return v.name + "/" + cachedSettingsKeyPrint(
append(roachpb.Key(nil), append(localStorePrefix, key...)...),
append(roachpb.Key(nil), append(LocalStorePrefix, key...)...),
)
}
return v.name
Expand Down
67 changes: 60 additions & 7 deletions pkg/kv/kvserver/batcheval/cmd_end_transaction.go
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,56 @@ func IsEndTxnTriggeringRetryError(

const lockResolutionBatchSize = 500

// iterManager provides a storage.IterAndBuf appropriate for working with a
// span of keys that are either all local or all global keys, identified by
// the start key of the span, that is passed to getIterAndBuf. This is to deal
// with the constraint that a single MVCCIterator using
// MVCCKeyAndIntentsIterKind can either iterate over local keys or global
// keys, but not both. We don't wish to create a new iterator for each span,
// so iterManager lazily creates a new one when needed.
type iterManager struct {
reader storage.Reader
globalKeyUpperBound roachpb.Key
iterAndBuf storage.IterAndBuf

iter storage.MVCCIterator
isLocalIter bool
}

func (im *iterManager) getIterAndBuf(key roachpb.Key) storage.IterAndBuf {
isLocal := keys.IsLocal(key)
if im.iter != nil {
if im.isLocalIter == isLocal {
return im.iterAndBuf
}
im.iterAndBuf.SwitchIter(nil /* iter */)
im.iter.Close()
im.iter = nil
}
if isLocal {
im.iter = im.reader.NewMVCCIterator(
storage.MVCCKeyAndIntentsIterKind, storage.IterOptions{
UpperBound: keys.LocalMax,
})
im.isLocalIter = true
im.iterAndBuf.SwitchIter(im.iter)
} else {
im.iter = im.reader.NewMVCCIterator(
storage.MVCCKeyAndIntentsIterKind, storage.IterOptions{
UpperBound: im.globalKeyUpperBound,
})
im.isLocalIter = false
im.iterAndBuf.SwitchIter(im.iter)
}
return im.iterAndBuf
}

func (im *iterManager) Close() {
im.iterAndBuf.Cleanup()
im.iterAndBuf = storage.IterAndBuf{}
im.iter = nil
}

// resolveLocalLocks synchronously resolves any locks that are local to this
// range in the same batch and returns those lock spans. The remainder are
// collected and returned so that they can be handed off to asynchronous
Expand All @@ -439,11 +489,12 @@ func resolveLocalLocks(
desc = &mergeTrigger.LeftDesc
}

iter := readWriter.NewMVCCIterator(storage.MVCCKeyAndIntentsIterKind, storage.IterOptions{
UpperBound: desc.EndKey.AsRawKey(),
})
iterAndBuf := storage.GetBufUsingIter(iter)
defer iterAndBuf.Cleanup()
iterManager := &iterManager{
reader: readWriter,
globalKeyUpperBound: desc.EndKey.AsRawKey(),
iterAndBuf: storage.GetBufUsingIter(nil),
}
defer iterManager.Close()

var resolveAllowance int64 = lockResolutionBatchSize
if args.InternalCommitTrigger != nil {
Expand All @@ -466,7 +517,8 @@ func resolveLocalLocks(
return nil
}
resolveMS := ms
ok, err := storage.MVCCResolveWriteIntentUsingIter(ctx, readWriter, iterAndBuf, resolveMS, update)
ok, err := storage.MVCCResolveWriteIntentUsingIter(
ctx, readWriter, iterManager.getIterAndBuf(span.Key), resolveMS, update)
if err != nil {
return err
}
Expand All @@ -483,7 +535,8 @@ func resolveLocalLocks(
externalLocks = append(externalLocks, outSpans...)
if inSpan != nil {
update.Span = *inSpan
num, resumeSpan, err := storage.MVCCResolveWriteIntentRangeUsingIter(ctx, readWriter, iterAndBuf, ms, update, resolveAllowance)
num, resumeSpan, err := storage.MVCCResolveWriteIntentRangeUsingIter(
ctx, readWriter, iterManager.getIterAndBuf(update.Span.Key), ms, update, resolveAllowance)
if err != nil {
return err
}
Expand Down
Loading

0 comments on commit 83fbe72

Please sign in to comment.