-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libroach: Migrate IncrementalIterator logic to C++ #37496
Conversation
Also tagging @ajkr on this as he's familiar with both C++ and RocksDB. |
d2ee230
to
e7e6203
Compare
997083d
to
f508ca2
Compare
var rows bulk.RowCounter | ||
// TODO(dan): Move all this iteration into cpp to avoid the cgo calls. | ||
// TODO(dan): Consider checking ctx periodically during the MVCCIterate call. | ||
iter := engineccl.NewMVCCIncrementalIterator(batch, engineccl.IterOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: you could pull this code into a go function that takes the same args are your new c++ function (and keep mvcc.go around for now). Then you could write a unit test that calls both functions on the same engine and compares the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but still investigating why it is taking the test so long to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 12 of 14 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27 and @ajkr)
c-deps/libroach/db.h, line 104 at r2 (raw file):
// preferred function for Go callers. DBStatus DBSstFileWriterAddRaw(DBSstFileWriter* fw, const ::rocksdb::Slice key, const ::rocksdb::Slice val);
nit: newline at end of file
c-deps/libroach/db.cc, line 950 at r2 (raw file):
// Skip tombstone (len=0) records when start time is zero (non-incremental) // and we are not exporting all versions. if (!export_all_revisions && iter.value().size() == 0 && start.wall_time == 0 &&
2¢: I might pull the three conditions that go into "are we skipping deletes?" into a bool before the loop, so we only check "are we skipping deletes && is this a delete"
c-deps/libroach/db.cc, line 974 at r2 (raw file):
return res; }
nit: newline at end of file
c-deps/libroach/incremental_iterator.h, line 1 at r2 (raw file):
// Copyright 2018 The Cockroach Authors.
nit: 2019
c-deps/libroach/incremental_iterator.h, line 52 at r2 (raw file):
cockroach::util::hlc::LegacyTimestamp start_time; cockroach::util::hlc::LegacyTimestamp end_time; };
nit: newline at end of file
c-deps/libroach/incremental_iterator.cc, line 1 at r2 (raw file):
// Copyright 2018 The Cockroach Authors.
nit: 2019
c-deps/libroach/incremental_iterator.cc, line 223 at r2 (raw file):
const rocksdb::Slice DBIncrementalIterator::key() { return iter.get()->rep->key(); } const rocksdb::Slice DBIncrementalIterator::value() { return iter.get()->rep->value(); }
nit: newline at end of file
pkg/storage/engine/rocksdb.go, line 3201 at r1 (raw file):
func ExportToSst( ctx context.Context, e Reader, start, end MVCCKey, exportAllRevisions bool, io IterOptions, ) ([]byte, C.int64_t, error) {
I'd return an int64
instead of the c type.
pkg/storage/engine/rocksdb.go, line 3213 at r1 (raw file):
var intentErr C.DBString err := statusToError(C.DBExportToSst(goToCKey(start), goToCKey(end), C.bool(exportAllRevisions),
I think you might have inverted the order of the commits?
pkg/storage/engine/rocksdb.go, line 3223 at r1 (raw file):
} return nil, dataSize, &e
I'd return 0 instead of the undefined value (right?) in dataSize.
1d49170
to
d8df6cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @dt)
c-deps/libroach/db.cc, line 950 at r2 (raw file):
Previously, dt (David Taylor) wrote…
2¢: I might pull the three conditions that go into "are we skipping deletes?" into a bool before the loop, so we only check "are we skipping deletes && is this a delete"
Done.
c-deps/libroach/db.cc, line 974 at r2 (raw file):
Previously, dt (David Taylor) wrote…
nit: newline at end of file
Done.
c-deps/libroach/incremental_iterator.h, line 1 at r2 (raw file):
Previously, dt (David Taylor) wrote…
nit: 2019
Done.
c-deps/libroach/incremental_iterator.h, line 52 at r2 (raw file):
Previously, dt (David Taylor) wrote…
nit: newline at end of file
Done.
c-deps/libroach/incremental_iterator.cc, line 1 at r2 (raw file):
Previously, dt (David Taylor) wrote…
nit: 2019
Done.
c-deps/libroach/incremental_iterator.cc, line 223 at r2 (raw file):
Previously, dt (David Taylor) wrote…
nit: newline at end of file
Done.
pkg/storage/engine/rocksdb.go, line 3201 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I'd return an
int64
instead of the c type.
Done.
pkg/storage/engine/rocksdb.go, line 3213 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I think you might have inverted the order of the commits?
Done.
pkg/storage/engine/rocksdb.go, line 3223 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I'd return 0 instead of the undefined value (right?) in dataSize.
Done.
964369c
to
d6628ec
Compare
@ajkr thanks for waiting -- this is now ready for review now! |
Thanks, taking a look. Are there any benchmark results on export time/CPU with vs without this PR? |
Hey @ajkr! Not yet, but I am working on compiling some results from roachprod and existing export/backup benchmarks. I shall update the PR as soon as I have compiled them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, all my comments are minor.
A couple high-level questions.
- why does
ExportToSst
return a byte array rather than take a filename as an argument and export to that file? - what limits the size of the SST created by
ExportToSst
?
Reviewed 1 of 18 files at r3, 14 of 14 files at r4, 4 of 4 files at r5, 1 of 1 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27 and @dt)
c-deps/libroach/db.h, line 103 at r8 (raw file):
// DBSstFileWriterAddRaw is used internally -- DBSstFileWriterAdd is the // preferred function for Go callers. DBStatus DBSstFileWriterAddRaw(DBSstFileWriter* fw, const ::rocksdb::Slice key,
Do we need this declaration in db.h
? I think it's only used in db.cc
, so we should be able to declare and define it in an anonymous namespace there. (Similar to DBIterGetState()
, for example.)
c-deps/libroach/db.cc, line 939 at r8 (raw file):
DBSstFileWriterClose(writer); return state.status; } else if (!state.valid || kComparator.Compare(iter.key(), EncodeKey(end)) >= 0) {
We could avoid an allocation/memcpy on each iteration by getting the result of EncodeKey(end)
before the for-loop. I don't know whether it'll make a noticeable difference, but typically I have seen memcpy/malloc/free near the top of CPU profiles for other range scan workloads.
c-deps/libroach/incremental_iterator.h, line 28 at r8 (raw file):
DBString* write_intent); ~DBIncrementalIterator(); void advanceKey();
can advanceKey()
be private?
c-deps/libroach/incremental_iterator.cc, line 52 at r8 (raw file):
assert(!EmptyTimestamp(opts.max_timestamp_hint)); DBIterOptions nontimebound_opts = DBIterOptions(); nontimebound_opts.upper_bound = opts.upper_bound;
I'm kind of curious about the relationship between opts.upper_bound
and this->end
. Will they be the same?
pkg/ccl/storageccl/export.go, line 164 at r8 (raw file):
} rows.BulkOpSummary.DataSize += int64(len(it.UnsafeKey().Key)) + int64(len(it.UnsafeValue()))
Is the result being computed here different from what's stored in the variable dataSize
?
In some cases we return the content in the response rather than write a file at all, and even when we're writing a "file" it is usually to remote, cloud storage, not the local file system. The cloud storage accessors are in Go, so passing the content to write back up and letting go decide what to do with it seems to make sense. FWIW, prior to this change, we were also passing the same slice back from c++ to go when we called
Range size. |
d6628ec
to
24765a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @dt)
c-deps/libroach/db.h, line 103 at r8 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Do we need this declaration in
db.h
? I think it's only used indb.cc
, so we should be able to declare and define it in an anonymous namespace there. (Similar toDBIterGetState()
, for example.)
Done.
c-deps/libroach/db.cc, line 939 at r8 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
We could avoid an allocation/memcpy on each iteration by getting the result of
EncodeKey(end)
before the for-loop. I don't know whether it'll make a noticeable difference, but typically I have seen memcpy/malloc/free near the top of CPU profiles for other range scan workloads.
makes sense, done.
c-deps/libroach/incremental_iterator.h, line 28 at r8 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
can
advanceKey()
be private?
yes it can, done.
c-deps/libroach/incremental_iterator.cc, line 52 at r8 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
I'm kind of curious about the relationship between
opts.upper_bound
andthis->end
. Will they be the same?
opts.upper_bound
is a DBKey with empty wall_time
and logical
fields (i.e. it only enforces a key upper bound). this->end
has those fields populated, and is used in advanceKey()
for the purpose of checking that we are within the required time bounds.
pkg/ccl/storageccl/export.go, line 164 at r8 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Is the result being computed here different from what's stored in the variable
dataSize
?
So using dataSize instead of recomputing it here was causing existing backup tests to fail, as the computed bytes were more than the expected bytes. I initially thought this could be because we are iterating more keys than we should on the C++ side, but the number of rows and the SSTable data itself matched up to what was expected.
I use dataSize to ensure that we are in fact exporting some data (otherwise the levelDB iterator collecting stats after the cgo call panics) and then recompute the actual rows.BulkOpSummary.DataSize
as was the case in the previous implementation.
Eventually, we would like to move the stats collection (including RowCounter) to C++, but decided it was not in the scope of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 20 files at r9, 18 of 19 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27 and @dt)
c-deps/libroach/db.cc, line 937 at r10 (raw file):
bool skip_current_key_versions = !export_all_revisions; DBIterState state; const rocksdb::Slice end_key = EncodeKey(end);
I think we need to store the string result otherwise the memory pointed to by the Slice
can be freed.
pkg/ccl/storageccl/export.go, line 164 at r8 (raw file):
Previously, adityamaru27 (Aditya Maru) wrote…
So using dataSize instead of recomputing it here was causing existing backup tests to fail, as the computed bytes were more than the expected bytes. I initially thought this could be because we are iterating more keys than we should on the C++ side, but the number of rows and the SSTable data itself matched up to what was expected.
I use dataSize to ensure that we are in fact exporting some data (otherwise the levelDB iterator collecting stats after the cgo call panics) and then recompute the actualrows.BulkOpSummary.DataSize
as was the case in the previous implementation.
Eventually, we would like to move the stats collection (including RowCounter) to C++, but decided it was not in the scope of this change.
Got it, thanks for the explanation.
24765a1
to
6418148
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27, @ajkr, and @dt)
c-deps/libroach/db.cc, line 937 at r10 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
I think we need to store the string result otherwise the memory pointed to by the
Slice
can be freed.
Thanks for this catch! I was scratching my head as to why the tests were failing with these changes.
6418148
to
84b0d2b
Compare
@dt changed the headers of the newly added files to BSL. Will merge once you have a quick look! |
84b0d2b
to
0bc1776
Compare
The export logic requires access to the C.DBEngine of an engine.Reader. In the case of a spanSetReader this requires us to recursively access the engine.Reader member until we reach either a RocksDB or rocksDBReadOnly engine. A type switch construct is deemed necessary in storageccl/export.go, requiring the SpanSetReadWriter type to be exported. Note: lint fails if we name it SpanSetReadWriter post exporting, as other packages would have to access it using spanset.SpanSetReadWriter which introduces a "stutter". It has been renamed to ReadWriter. Release note: None
All of the IncrementalIterator logic was previously written in Go. This change migrates all the required implementations to C++ and allows the DBExportToSst method to use this iterator to export the required keys to an SSTable. Release note: Performance improvement to the BACKUP process.
Export was previously executed using an MVCCIncrementalIterator where the logic handling iteration over the diff of the key range [startKey, endKey) and time range (startTime, endTime] was written in Go. This iteration involved making a cgo call to find every key, along with another cgo call for writing each key to the sstable. This migration resolves the aforementioned performance bottleneck (cockroachdb#18884), by pushing all the required logic to C++ and exposing a single export method. Release note: Performance improvement to the BACKUP process.
0bc1776
to
a82cb8c
Compare
Added a test which generates random KVPs and timestamps, subsequently attempting to export the data between provided key and time ranges. Correctness of the new export logic is checked against the old implementation which used the now non-existant MVCCIncrementalIterator. Release note: None
a82cb8c
to
959f577
Compare
bors r=dt |
🔒 Permission denied Existing reviewers: click here to make adityamaru27 a reviewer |
bors r=dt |
37496: libroach: Migrate IncrementalIterator logic to C++ r=dt a=adityamaru27 Export was previously executed using an `MVCCIncrementalIterator` where the logic handling iteration over the diff of the key range `[startKey, endKey)` and time range `(startTime, endTime]` was written in Go. This iteration involved making a cgo call to find every key, along with another cgo call for writing each key to the sstable. This migration resolves the aforementioned performance bottleneck (#18884), by pushing all the required logic to C++ and exposing a single export method. Based on a performance benchmark by running BACKUP on a tpcc database with 1000 warehouses we observe the following: - Over an average of 3 runs we see a 1.1x improvement in time performance. While the original binary took ~32m04s the changed implementation took ~28m55s. This is due to the elimination of a cgo call per key. Co-authored-by: Aditya Maru <[email protected]>
Build succeeded |
Export was previously executed using an
MVCCIncrementalIterator
where the logic handling iteration over the diff of the key
range
[startKey, endKey)
and time range(startTime, endTime]
was written in Go. This iteration involved making a cgo call
to find every key, along with another cgo call for writing
each key to the sstable.
This migration resolves the aforementioned performance
bottleneck (#18884), by pushing all the required logic to C++
and exposing a single export method.
Based on a performance benchmark by running BACKUP on a tpcc database with 1000 warehouses we observe the following: