Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IngestExternalFile with either DeleteRange support or an "ingest_in_front" option #3391

Closed
nvanbenschoten opened this issue Jan 20, 2018 · 8 comments
Assignees

Comments

@nvanbenschoten
Copy link
Contributor

I began implementing this myself and got far enough to realize that it should be discussed in an issue beforehand.

I'll start with the motivation. As described in this comment, there are instances in CockroachDB where we'd like to be able to ingest a series of SST files atomically and have the ingestion clear out all overlapping data. Currently, this is not possible because SstFileWriter only supports Put, Merge, and Delete operations. Given a range of keys that we'd like to completely replace using IngestExternalFile, our only real option at the moment is to lock the range of keys using some external locking mechanism, iterate over the key range and clear out each key, call IngestExternalFile, then unlock the range. The need for external locking while we iterate over the entire range so that new keys aren't written underneath us isn't ideal. Even worse, doing this two-step process means that we're susceptible to state corruption in the presence of untimely crashes. We could avoid this second issue by creating new SST files with deletion tombstones for all existing keys and copies of all operations from the original set of SST files (keeping everything ordered, of course), but this still requires the external locking and a scan over the entire range, and now means that we're doubling the number of SST files.

Ideally, we'd be able to specify the range of keys that should be subsumed by each new SST file, so that a single call to IngestExternalFile would atomically ingest all keys in the file and delete any overlapping keys. There are two ways I have thought about allowing this.

First, I looked into adding DeleteRange support to SstFileWriter. While doing so, I came to the conclusion that this would only be useful if the DeleteRange was able to overlap other keys but given a "lower" precedence. If this was not the case then any user trying to do what we're doing would need to add a DeleteRange between ever pair of subsequent keys, which I expect would be bad for a number of reasons. Luckily, it looks like BlockBasedTableBuilder already stores DeleteRange operations in their own meta block, so the requirement to add all keys to the SstFileWriter in order should not be an issue. It also looks like RangeDelAggregator already gives point operations with the same sequence number as a DeleteRange operation priority, so we will get the behavior we want without any extra changes (please note, I'm new to the RocksDB codebase, so it's likely some of these assumptions are misguided). With these issues out of the way, most of the work here has to do with correctly setting the ExternalSstFileInfo and handling it correctly in ExternalSstFileIngestionJob. The biggest problem that sticks out to me is that the key range spanned by ExternalSstFileInfo is [smallest_key, largest_key] (inclusive upper bound). This doesn't work well with DeleteRange's [start, end) bounds. I'm guessing this has already been solved somewhere else, but solving it here will require some work.

This prototype got me thinking about alternative approaches. The general idea here seems useful enough operation that it might justify first-class support. For instance, I can imagine an ingest_in_front IngestExternalFileOptions option that parallels the current ingest_behind option. This could perform the task of ensuring that all keys that overlap an ingested SST's bounds are deleted. By handling this in ExternalSstFileIngestionJob, I think this could be done a lot more efficiently than by relying on DeleteRange alone. For instance, with this option, we could avoid flushing any parts of the memtable that overlap the SST. ExternalSstFileIngestionJob could also employ DeleteFilesInRange to make most of the deletes more efficient.

I'd appreciate any input or advice people more familiar with this can give. @ajkr and @IslamAbdelRahman, it looks like you two are the experts here :)

@yiwu-arbug
Copy link
Contributor

Instead of adding DeleteRange to SstFileWriter, why don't you just issue DeleteRange to the DB?

@ajkr any comment on adding DeleteRange support to SstFileWriter?

@nvanbenschoten
Copy link
Contributor Author

Instead of adding DeleteRange to SstFileWriter, why don't you just issue DeleteRange to the DB?

The key is that we want it all to be atomic. If we just perform a DeleteRange on the DB first then we run into the issues of needing extra locking and needing a recovery mechanism in case of crashes between the DeleteRange and the IngestExternalFile.

facebook-github-bot pushed a commit that referenced this issue Jan 23, 2018
Summary:
This is a small amount of general cleanup I made while experimenting with #3391.
Closes #3392

Differential Revision: D6788365

Pulled By: yiwu-arbug

fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2
@ajkr
Copy link
Contributor

ajkr commented Jan 23, 2018

The solution we have elsewhere is to set the file's end key to the range deletion's user key with max seqnum. Max seqnum ensures it falls before the next file's first internal key:

// Pretend the largest key has the same user key as upper_bound (the
// min key in the following table or subcompaction) in order for files
// to appear key-space partitioned.
//
// Choose highest seqnum so this file's largest internal key comes
// before the next file's/subcompaction's smallest. The fake seqnum is
// OK because the read path's file-picking code only considers the user
// key portion.
//
// Note Seek() also creates InternalKey with (user_key,
// kMaxSequenceNumber), but with kTypeDeletion (0x7) instead of
// kTypeRangeDeletion (0xF), so the range tombstone comes before the
// Seek() key in InternalKey's ordering. So Seek() will look in the
// next file for the user key.
largest_candidate = InternalKey(*upper_bound, kMaxSequenceNumber,
kTypeRangeDeletion);
}
.

For precedence, good find that point writes override range deletions with the same seqnum. It wasn't written like that intentionally, as previously there was no way for the two to have same seqnum. I think you can rely on this - maybe make sure to have a good test case though :)

For ingest_in_front, I think it will be hard. When the ingested file is compacted it'll be rewritten with different bounds. Then we'll still need something like range deletion to track what range it deleted.

amytai pushed a commit to amytai/rocksdb that referenced this issue Mar 9, 2018
Summary:
This is a small amount of general cleanup I made while experimenting with facebook#3391.
Closes facebook#3392

Differential Revision: D6788365

Pulled By: yiwu-arbug

fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2
@nvanbenschoten
Copy link
Contributor Author

@ajkr I'm beginning to revisit this issue. I've run into a few questions that I'm hoping you can answer.

  1. Is it valid for an sstable to only have range deletion tombstones and no normal keys?
  2. If an sstable's largest key is from a range deletion tombstone, should it indicate that it's largest key is exclusive? This is related to what you posted above, but that looks like it only takes effect when upper_bound != nullptr and I think it's accomplishing a different goal.
  3. On that note, if we don't indicate that the largest key in the sst is exclusive, do we just accept that the sst may look like it overlaps with files that it doesn't actually overlap with. For instance, an sst with a range deletion from [1,10) should not overlap with one with a merge record at 10, but if we don't do anything special then it will look like it does. This affects the assigned global sequence number in ingested ssts and the level at which they can be ingested.
  4. Also kind of related - ExternalSstFileIngestionJob doesn't take any care to distinguish the value type of its smallest and largest keys here. Is this important?

I also have one general concern. I had most of this change working but I began noticing ingested ssts with multiple range deletions having weird issues in tests. I tracked the issue to RangeDelAggregator. The symptoms were that certain keys weren't being deleted and that I was occasionally hitting this assertion. I also noticed that tombstone_map wasn't always sorted, which was very odd because it's an std::multimap. I then determined that this only happened when an iterator with multiple tombstones was passed to AddTombstones.

This all pointed me to what looks like a memory management issue. When iterating over the InternalIterator in AddTombstones, we parse the iterator's key() and construct a RangeTombstone from it. This doesn't actually perform a deep copy of the underlying storage. The RangeTombstone is added to the multimap (with a lot of intermediate logic) and we then iterate to the Next() value. I don't think this is always safe, because InternalIterator::key says that The underlying storage for the returned slice is valid only until the next modification of the iterator and we're not requiring that the iterator has IsKeyPinned() == true. So I think the underlying storage of the Slice added to tombstone_map was being overwritten, causing the corruption.

I confirmed that the following diff (a poor man's memory arena bound to the desired object lifetime) fixes the issue:

diff --git a/db/range_del_aggregator.cc b/db/range_del_aggregator.cc
index fdd847a7..9fc62f53 100644
--- a/db/range_del_aggregator.cc
+++ b/db/range_del_aggregator.cc
@@ -204,7 +204,8 @@ Status RangeDelAggregator::AddTombstones(
       first_iter = false;
     }
     ParsedInternalKey parsed_key;
-    if (!ParseInternalKey(input->key(), &parsed_key)) {
+    rep_->pinned_slices_.push_back(input->key().ToString());
+    if (!ParseInternalKey(rep_->pinned_slices_.back(), &parsed_key)) {
       return Status::Corruption("Unable to parse range tombstone InternalKey");
     }
     RangeTombstone tombstone(parsed_key, input->value());

Of course, this isn't what we'll actually want to do, but I'm wondering if you have any advice on how I can fix this issue. I'm also not convinced the problem is local to just my change, although it doesn't sound like you've hit the issue I'm seeing before. It's also possible this isn't an issue at all, as changes like #1739 indicate that there may be a lot more going on here than I'm aware of. Thanks in advance for the help!

@nvanbenschoten
Copy link
Contributor Author

Friendly ping @ajkr.

@ajkr
Copy link
Contributor

ajkr commented Apr 24, 2018

Hi @nvanbenschoten, sorry for the long delay. Let me answer your questions first.

  1. Yes, it is valid and supported. For example, during memtable flush, range deletions could cause all the keys to be dropped, in which case the output SST file only contains range deletions.
  2. Thanks for pushing for clarity, I didn't properly explain it earlier. upper_bound holds the start key of the next file in the same level if one exists (or if one will exist due to the running compaction job). It is used only if a range deletion contained in the compaction output file extends into the key-range of the next file in the same level (such overlap is not allowed). In that case, the largest range deletion endpoint is not used as the file's max key; instead, upper_bound is used with kMaxSeqnum. If upper_bound is nullptr or the file doesn't have range deletions extending into the next file, that's fine, we'll just use the largest range deletion endpoint.
  3. Yeah, I understand the problem you're referring to now. It actually causes other problems, too. For example, in compaction picking we need to pull in files as long as their endpoints overlap (see GetCleanInputsWithinInterval). The reason is, if we compacted only the file containing the newer version of a key, then newer data would be at a lower level than older data. So if you have a range deletion that spans a whole level, compaction will always pull in every file at that level since file endpoints will always overlap.
  4. I'm not that familiar with file ingestion actually. Is the assigned seqno supposed to be unique? In that case the ValueType would never be needed as tiebreaker in comparisons.

@ajkr
Copy link
Contributor

ajkr commented Apr 24, 2018

Regarding the corruption you saw, we do intend to pin the iterators' underlying key-values for the RangeDelAggregator's lifetime:

rep_->pinned_iters_mgr_.PinIterator(input.release(), false /* arena */);

There could be a bug though. If you provide an OPTIONS file I can help debug.

@nvanbenschoten
Copy link
Contributor Author

Hi @ajkr, thanks for the response! I've gone ahead and opened #3778, which incorperates your answers. For question 4, I do believe that the assigned seqno of an ingested sst is supposed to be unique. This is what ExternalSstFileIngestionJob::AssignLevelAndSeqnoForIngestedFile is meant to ensure.

Regarding the corruption, I've included an extra commit in my PR. Without this commit, ExternalSSTFileBasicTest.IngestFileWithMixedValueType hits the assertion described above. I did notice that we attempted to pin the iterator's underlying key-values for the RangeDelAggregator's lifetime, but my debugging showed that doing so didn't fully work. Specifically, I was able to observe the keys in a PositionalTombstoneMap's multimap get out of order because of this corruption before we even reached that line. Let me know how you'd like to proceed in debugging this. It might be easiest for you to just play with the commit I pushed locally.

nvanbenschoten added a commit to nvanbenschoten/rocksdb that referenced this issue Jul 14, 2018
Fixes facebook#3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
rcane pushed a commit to rcane/rocksdb that referenced this issue Sep 13, 2018
…k#3778)

Summary:
Fixes facebook#3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: facebook#3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants