-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IngestExternalFile with either DeleteRange support or an "ingest_in_front" option #3391
Comments
Instead of adding DeleteRange to SstFileWriter, why don't you just issue DeleteRange to the DB? @ajkr any comment on adding DeleteRange support to SstFileWriter? |
The key is that we want it all to be atomic. If we just perform a DeleteRange on the DB first then we run into the issues of needing extra locking and needing a recovery mechanism in case of crashes between the |
The solution we have elsewhere is to set the file's end key to the range deletion's user key with max seqnum. Max seqnum ensures it falls before the next file's first internal key: rocksdb/db/range_del_aggregator.cc Lines 498 to 514 in 1bdb44d
For precedence, good find that point writes override range deletions with the same seqnum. It wasn't written like that intentionally, as previously there was no way for the two to have same seqnum. I think you can rely on this - maybe make sure to have a good test case though :) For |
Summary: This is a small amount of general cleanup I made while experimenting with facebook#3391. Closes facebook#3392 Differential Revision: D6788365 Pulled By: yiwu-arbug fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2
@ajkr I'm beginning to revisit this issue. I've run into a few questions that I'm hoping you can answer.
I also have one general concern. I had most of this change working but I began noticing ingested ssts with multiple range deletions having weird issues in tests. I tracked the issue to This all pointed me to what looks like a memory management issue. When iterating over the I confirmed that the following diff (a poor man's memory arena bound to the desired object lifetime) fixes the issue:
Of course, this isn't what we'll actually want to do, but I'm wondering if you have any advice on how I can fix this issue. I'm also not convinced the problem is local to just my change, although it doesn't sound like you've hit the issue I'm seeing before. It's also possible this isn't an issue at all, as changes like #1739 indicate that there may be a lot more going on here than I'm aware of. Thanks in advance for the help! |
Friendly ping @ajkr. |
Hi @nvanbenschoten, sorry for the long delay. Let me answer your questions first.
|
Regarding the corruption you saw, we do intend to pin the iterators' underlying key-values for the RangeDelAggregator's lifetime: rocksdb/db/range_del_aggregator.cc Line 215 in bc0da4b
There could be a bug though. If you provide an OPTIONS file I can help debug. |
Hi @ajkr, thanks for the response! I've gone ahead and opened #3778, which incorperates your answers. For question 4, I do believe that the assigned seqno of an ingested sst is supposed to be unique. This is what Regarding the corruption, I've included an extra commit in my PR. Without this commit, |
Fixes facebook#3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range.
…k#3778) Summary: Fixes facebook#3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: facebook#3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
I began implementing this myself and got far enough to realize that it should be discussed in an issue beforehand.
I'll start with the motivation. As described in this comment, there are instances in CockroachDB where we'd like to be able to ingest a series of SST files atomically and have the ingestion clear out all overlapping data. Currently, this is not possible because
SstFileWriter
only supportsPut
,Merge
, andDelete
operations. Given a range of keys that we'd like to completely replace usingIngestExternalFile
, our only real option at the moment is to lock the range of keys using some external locking mechanism, iterate over the key range and clear out each key, callIngestExternalFile
, then unlock the range. The need for external locking while we iterate over the entire range so that new keys aren't written underneath us isn't ideal. Even worse, doing this two-step process means that we're susceptible to state corruption in the presence of untimely crashes. We could avoid this second issue by creating new SST files with deletion tombstones for all existing keys and copies of all operations from the original set of SST files (keeping everything ordered, of course), but this still requires the external locking and a scan over the entire range, and now means that we're doubling the number of SST files.Ideally, we'd be able to specify the range of keys that should be subsumed by each new SST file, so that a single call to
IngestExternalFile
would atomically ingest all keys in the file and delete any overlapping keys. There are two ways I have thought about allowing this.First, I looked into adding
DeleteRange
support toSstFileWriter
. While doing so, I came to the conclusion that this would only be useful if theDeleteRange
was able to overlap other keys but given a "lower" precedence. If this was not the case then any user trying to do what we're doing would need to add aDeleteRange
between ever pair of subsequent keys, which I expect would be bad for a number of reasons. Luckily, it looks likeBlockBasedTableBuilder
already storesDeleteRange
operations in their own meta block, so the requirement to add all keys to theSstFileWriter
in order should not be an issue. It also looks likeRangeDelAggregator
already gives point operations with the same sequence number as aDeleteRange
operation priority, so we will get the behavior we want without any extra changes (please note, I'm new to the RocksDB codebase, so it's likely some of these assumptions are misguided). With these issues out of the way, most of the work here has to do with correctly setting theExternalSstFileInfo
and handling it correctly inExternalSstFileIngestionJob
. The biggest problem that sticks out to me is that the key range spanned byExternalSstFileInfo
is[smallest_key, largest_key]
(inclusive upper bound). This doesn't work well withDeleteRange's
[start, end)
bounds. I'm guessing this has already been solved somewhere else, but solving it here will require some work.This prototype got me thinking about alternative approaches. The general idea here seems useful enough operation that it might justify first-class support. For instance, I can imagine an
ingest_in_front
IngestExternalFileOptions
option that parallels the currentingest_behind
option. This could perform the task of ensuring that all keys that overlap an ingested SST's bounds are deleted. By handling this inExternalSstFileIngestionJob
, I think this could be done a lot more efficiently than by relying onDeleteRange
alone. For instance, with this option, we could avoid flushing any parts of the memtable that overlap the SST.ExternalSstFileIngestionJob
could also employDeleteFilesInRange
to make most of the deletes more efficient.I'd appreciate any input or advice people more familiar with this can give. @ajkr and @IslamAbdelRahman, it looks like you two are the experts here :)
The text was updated successfully, but these errors were encountered: