kv/bulk: chunk SSTs to row boundaries #79020

dt · 2022-03-30T04:08:23Z

Previously an sstable might end due to size at /table/i/rowX/col/Y, if
some, but not all, families for rowX fit in that file. This is OK as far
as KV and SQL are concerned, since after we add the next file which will
start with rowX/colZ, the row is complete from the point of view of any
scan. However it does mean that if, after adding this file we determine
that we need to split before adding the next file, that split, as it
must be at a row boundary, will be at rowX, not rowX/colZ. This too is
OK, but has the slight downside of meaning that when we scatter the new
RHS, starting at rowX, we have to move the colY family KV we just added
in the prior prior file. While it is typically a trivial amount of data,
it does make the RHS non-empty and thus require some cost to move.

This changes the size-based limit that triggers a file flush to wait for
the next row boundary after the size is exceeded, so that SST bounds now
also fall on row, and thus any future range split, bounds.

This is particularly relevant in conjunction with #78218.

Release note: none.

cockroach-teamcity · 2022-03-30T04:08:34Z

This change is

stevendanna

LGTM.

Perhaps an interesting project at some point would be something to make writing unit tests for the nuances in the splitting and chunking logic a bit easier, if for no other reason than to document the current behaviour we expect.

pkg/kv/bulk/sst_batcher.go

Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with cockroachdb#78218. Release note: none.

dt · 2022-03-30T17:00:09Z

TFTR!

bors r+

craig · 2022-03-30T18:45:36Z

Build succeeded:

GitHub CI (Cockroach)

dt requested review from nvanbenschoten and adityamaru March 30, 2022 04:08

dt requested a review from a team as a code owner March 30, 2022 04:08

dt mentioned this pull request Mar 30, 2022

kv: scan empty right-hand side of split for stats #78218

Merged

stevendanna approved these changes Mar 30, 2022

View reviewed changes

pkg/kv/bulk/sst_batcher.go Show resolved Hide resolved

dt force-pushed the split-row branch from 2598ea7 to f2abeda Compare March 30, 2022 12:32

dt force-pushed the split-row branch from f2abeda to 060c0ef Compare March 30, 2022 15:02

craig bot merged commit 4bce84e into cockroachdb:master Mar 30, 2022

dt deleted the split-row branch March 30, 2022 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/bulk: chunk SSTs to row boundaries #79020

kv/bulk: chunk SSTs to row boundaries #79020

dt commented Mar 30, 2022

cockroach-teamcity commented Mar 30, 2022

stevendanna left a comment

dt commented Mar 30, 2022

craig bot commented Mar 30, 2022

kv/bulk: chunk SSTs to row boundaries #79020

kv/bulk: chunk SSTs to row boundaries #79020

Conversation

dt commented Mar 30, 2022

cockroach-teamcity commented Mar 30, 2022

stevendanna left a comment

Choose a reason for hiding this comment

dt commented Mar 30, 2022

craig bot commented Mar 30, 2022