Skip to content

Commit

Permalink
engine: find split keys in the first range of a partition
Browse files Browse the repository at this point in the history
MVCCFindSplitKey would previously fail to find any split keys in the
first range of a partition. As a result, partitioned tables have been
observed with multi-gigabyte ranges. This commit fixes the bug.

Specifically, MVCCFindSplitKey was assuming that the start key of a
range within a table was also the row prefix for the first row of data
in the range. This does not hold true for the first range of a table or
a partition of a table--that range begins at, for example, /Table/51,
while the row begins at /Table/51/1/aardvark. The old code had a special
case for the first range in a table, but not for the first range in a
partition. (It predates partitioning.)

Remove the need for special casing by actually looking in RocksDB to
determine the row prefix for the first row of data rather than
attempting to derive it from the range start key. This properly handles
partitioning and is robust against future changes to range split
boundaries.

See the commit within for more details on the approach.

Release note (bug fix): Ranges in partitioned tables now properly split
to respect their configured maximum size.
  • Loading branch information
benesch committed Apr 18, 2018
1 parent 5967a0e commit cb1b6d7
Show file tree
Hide file tree
Showing 2 changed files with 122 additions and 50 deletions.
63 changes: 54 additions & 9 deletions pkg/storage/engine/mvcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -2531,17 +2531,61 @@ func MVCCFindSplitKey(
it := engine.NewIterator(IterOptions{})
defer it.Close()

// We must never return a split key that falls within a table row. (Rows in
// tables with multiple column families are comprised of multiple keys, one
// key per column family.)
//
// Managing this is complicated: the logic for picking a split key that
// creates ranges of the right size lives in C++, while the logic for
// determining whether a key falls within a table row lives in Go.
//
// Most of the time, we can let C++ pick whatever key it wants. If it picks a
// key in the middle of a row, we simply rewind the key to the start of the
// row. This is handled by keys.EnsureSafeSplitKey.
//
// If, however, that first row in the range is so large that it exceeds the
// range size threshold on its own, and that row is comprised of multiple
// column families, we have a problem. C++ will hand us a key in the middle of
// that row, keys.EnsureSafeSplitKey will rewind the key to the beginning of
// the row, and... we'll end up with what's likely to be the start key of the
// range. The higher layers of the stack will take this to mean that no splits
// are required, when in fact the range is desperately in need of a split.
//
// Note that the first range of a table or a partition of a table does not
// start on a row boundary and so we have a slightly different problem.
// Instead of not splitting the range at all, we'll create a split at the
// start of the first row, resulting in an unnecessary empty range from the
// beginning of the table to the first row in the table (e.g., from /Table/51
// to /Table/51/1/aardvark...). The right-hand side of the split will then be
// susceptible to never being split as outlined above.
//
// To solve both of these problems, we find the end of the first row in Go,
// then plumb that to C++ as a "minimum split key." We're then guaranteed that
// the key C++ returns will rewind to the start key of the range.
//
// On a related note, we find the first row by actually looking at the first
// key in the the range. A previous version of this code attempted to derive
// the first row only by looking at `key`, the start key of the range; this
// was dangerous because partitioning can split off ranges that do not start
// at valid row keys. The keys that are present in the range, by contrast, are
// necessarily valid row keys.
it.Seek(MakeMVCCMetadataKey(key.AsRawKey()))
if ok, err := it.Valid(); err != nil {
return nil, err
} else if !ok {
return nil, nil
}
minSplitKey := key
// If this is a table, we can only split at row boundaries.
if remainder, _, err := keys.DecodeTablePrefix(roachpb.Key(key)); err == nil {
// If this is the first range containing a table, its start key won't
// actually contain any row information, just the table ID. We don't want
// to restrict splits on such tables, since key.PrefixEnd will just be the
// end of the table span.
if len(remainder) > 0 {
minSplitKey = roachpb.RKey(roachpb.Key(key).PrefixEnd())
if _, _, err := keys.DecodeTablePrefix(it.UnsafeKey().Key); err == nil {
// The first key in this range represents a row in a SQL table. Advance the
// minSplitKey past this row to avoid the problems described above.
firstRowKey, err := keys.EnsureSafeSplitKey(it.Key().Key)
if err != nil {
return nil, err
}
minSplitKey = roachpb.RKey(firstRowKey.PrefixEnd())
}

splitKey, err := it.FindSplitKey(
MakeMVCCMetadataKey(key.AsRawKey()),
MakeMVCCMetadataKey(endKey.AsRawKey()),
Expand All @@ -2551,7 +2595,8 @@ func MVCCFindSplitKey(
if err != nil {
return nil, err
}
// The family ID has been removed from this key, making it a valid split point.
// Ensure the key is a valid split point that does not fall in the middle of a
// SQL row by removing the column family ID, if any, from the end of the key.
return keys.EnsureSafeSplitKey(splitKey.Key)
}

Expand Down
109 changes: 68 additions & 41 deletions pkg/storage/engine/mvcc_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3438,21 +3438,19 @@ func TestFindSplitKey(t *testing.T) {
func TestFindValidSplitKeys(t *testing.T) {
defer leaktest.AfterTest(t)()

const userID = keys.MaxReservedDescID + 1
// Manually creates rows corresponding to the schema:
// CREATE TABLE t (id STRING PRIMARY KEY, col INT)
encodeTableKey := func(rowVal string, colFam uint32) roachpb.Key {
tableKey := keys.MakeTablePrefix(keys.MaxReservedDescID + 1)
// CREATE TABLE t (id1 STRING, id2 STRING, ... PRIMARY KEY (id1, id2, ...))
tablePrefix := func(id uint32, rowVals ...string) roachpb.Key {
tableKey := keys.MakeTablePrefix(id)
rowKey := roachpb.Key(encoding.EncodeVarintAscending(append([]byte(nil), tableKey...), 1))
rowKey = encoding.EncodeStringAscending(encoding.EncodeVarintAscending(rowKey, 1), rowVal)
colKey := keys.MakeFamilyKey(append([]byte(nil), rowKey...), colFam)
return colKey
}
splitKeyFromTableKey := func(tableKey roachpb.Key) roachpb.Key {
splitKey, err := keys.EnsureSafeSplitKey(tableKey)
if err != nil {
t.Fatal(err)
for _, rowVal := range rowVals {
rowKey = encoding.EncodeStringAscending(rowKey, rowVal)
}
return splitKey
return rowKey
}
addColFam := func(rowKey roachpb.Key, colFam uint32) roachpb.Key {
return keys.MakeFamilyKey(append([]byte(nil), rowKey...), colFam)
}

testCases := []struct {
Expand All @@ -3474,11 +3472,12 @@ func TestFindValidSplitKeys(t *testing.T) {
// All system span cannot be split.
{
keys: []roachpb.Key{
roachpb.Key(keys.MakeTablePrefix(1)),
roachpb.Key(keys.MakeTablePrefix(keys.MaxSystemConfigDescID)),
addColFam(tablePrefix(1, "some", "data"), 1),
addColFam(tablePrefix(keys.MaxSystemConfigDescID, "blah"), 1),
},
expSplit: nil,
expError: false,
rangeStart: keys.MakeTablePrefix(1),
expSplit: nil,
expError: false,
},
// Between meta1 and meta2, splits at meta2.
{
Expand Down Expand Up @@ -3565,47 +3564,75 @@ func TestFindValidSplitKeys(t *testing.T) {
// or return the start key of the range.
{
keys: []roachpb.Key{
encodeTableKey("a", 1),
encodeTableKey("a", 2),
encodeTableKey("a", 3),
encodeTableKey("a", 4),
encodeTableKey("a", 5),
encodeTableKey("b", 1),
encodeTableKey("c", 1),
addColFam(tablePrefix(userID, "a"), 1),
addColFam(tablePrefix(userID, "a"), 2),
addColFam(tablePrefix(userID, "a"), 3),
addColFam(tablePrefix(userID, "a"), 4),
addColFam(tablePrefix(userID, "a"), 5),
addColFam(tablePrefix(userID, "b"), 1),
addColFam(tablePrefix(userID, "c"), 1),
},
rangeStart: splitKeyFromTableKey(encodeTableKey("a", 1)),
expSplit: splitKeyFromTableKey(encodeTableKey("b", 1)),
rangeStart: tablePrefix(userID, "a"),
expSplit: tablePrefix(userID, "b"),
expError: false,
},
// More example table data. Make sure ranges at the start of a table can
// be split properly - this checks that the minSplitKey logic doesn't
// break for such ranges.
{
keys: []roachpb.Key{
encodeTableKey("a", 1),
encodeTableKey("b", 1),
encodeTableKey("c", 1),
encodeTableKey("d", 1),
addColFam(tablePrefix(userID, "a"), 1),
addColFam(tablePrefix(userID, "b"), 1),
addColFam(tablePrefix(userID, "c"), 1),
addColFam(tablePrefix(userID, "d"), 1),
},
rangeStart: keys.MakeTablePrefix(keys.MaxReservedDescID + 1),
expSplit: splitKeyFromTableKey(encodeTableKey("c", 1)),
rangeStart: keys.MakeTablePrefix(userID),
expSplit: tablePrefix(userID, "c"),
expError: false,
},
// More example table data. Make sure ranges at the start of a table can
// be split properly (even if "properly" means creating an empty LHS,
// splitting here will at least allow the resulting RHS to split again).
// be split properly even in the presence of a large first row.
{
keys: []roachpb.Key{
encodeTableKey("a", 1),
encodeTableKey("a", 2),
encodeTableKey("a", 3),
encodeTableKey("a", 4),
encodeTableKey("a", 5),
encodeTableKey("b", 1),
encodeTableKey("c", 1),
addColFam(tablePrefix(userID, "a"), 1),
addColFam(tablePrefix(userID, "a"), 2),
addColFam(tablePrefix(userID, "a"), 3),
addColFam(tablePrefix(userID, "a"), 4),
addColFam(tablePrefix(userID, "a"), 5),
addColFam(tablePrefix(userID, "b"), 1),
addColFam(tablePrefix(userID, "c"), 1),
},
rangeStart: keys.MakeTablePrefix(keys.MaxReservedDescID + 1),
expSplit: splitKeyFromTableKey(encodeTableKey("a", 1)),
expSplit: tablePrefix(userID, "b"),
expError: false,
},
// One partition where partition key is the first column. Checks that
// split logic is not confused by the special partition start key.
{
keys: []roachpb.Key{
addColFam(tablePrefix(userID, "a", "a"), 1),
addColFam(tablePrefix(userID, "a", "b"), 1),
addColFam(tablePrefix(userID, "a", "c"), 1),
addColFam(tablePrefix(userID, "a", "d"), 1),
},
rangeStart: tablePrefix(userID, "a"),
expSplit: tablePrefix(userID, "a", "c"),
expError: false,
},
// One partition with a large first row. Checks that our logic to avoid
// splitting in the middle of a row still applies.
{
keys: []roachpb.Key{
addColFam(tablePrefix(userID, "a", "a"), 1),
addColFam(tablePrefix(userID, "a", "a"), 2),
addColFam(tablePrefix(userID, "a", "a"), 3),
addColFam(tablePrefix(userID, "a", "a"), 4),
addColFam(tablePrefix(userID, "a", "a"), 5),
addColFam(tablePrefix(userID, "a", "b"), 1),
addColFam(tablePrefix(userID, "a", "c"), 1),
},
rangeStart: tablePrefix(userID, "a"),
expSplit: tablePrefix(userID, "a", "b"),
expError: false,
},
}
Expand Down

0 comments on commit cb1b6d7

Please sign in to comment.