-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: excessive RocksDB compaction from reasonable suggestion #26693
Comments
Caught another instance of this with additional debug logs:
This shows that the compaction queue is processing a compaction which it thinks covers 262 MiB of data. The subsequent lines show the sstables which overlap this range. RocksDB in turn thinks the compaction encompasses 24 GiB of data:
The problem is with @benesch This lends additional weight to not going with the hack and actually fixing the selection of sstable boundaries in |
Some other logging from the above event revealed the inputs RocksDB was using for the compaction:
If we decode the start and end keys for these sstables we see:
Notice that the 2nd and 3rd sstable overlap in their end/start key. That isn't supposed to happen. RocksDB has consistency checks to make sure sstables do not overlap, but they are disabled by default in release mode. We need to be setting |
Heh, running with a RocksDB debug build, as soon as the void VersionStorageInfo::AddFile(int level, FileMetaData* f, Logger* info_log) {
auto* level_files = &files_[level];
// Must not overlap
#ifndef NDEBUG
if (level > 0 && !level_files->empty() &&
internal_comparator_->Compare(
(*level_files)[level_files->size() - 1]->largest, f->smallest) >= 0) {
auto* f2 = (*level_files)[level_files->size() - 1];
if (info_log != nullptr) {
Error(info_log, "Adding new file %" PRIu64
" range (%s, %s) to level %d but overlapping "
"with existing file %" PRIu64 " %s %s",
f->fd.GetNumber(), f->smallest.DebugString(true).c_str(),
f->largest.DebugString(true).c_str(), level, f2->fd.GetNumber(),
f2->smallest.DebugString(true).c_str(),
f2->largest.DebugString(true).c_str());
LogFlush(info_log);
}
assert(false);
}
#endif
f->refs++;
level_files->push_back(f);
} |
Wait, what? That's a whole new world of badness. Is this caused by anything we've done or is this just a way of saying "RocksDB is completely broken"? |
I have a diagram in my head I should add to a comment explaining why this
is better. However, I just had another thought: we’re effectively forcing a
compaction of the bounds of every DeleteRange. I’d like to try resurrecting
Nikhil’s patch to mark sstables containing range tombstones as needing
compaction. His patch only did that if the sstable contained 5 or more
range tombstones. Seems like marking every sstable containing at least 1
range tombstone would completely remove the need for the compaction queue.
Internally RocksDB can be smarter, only compacting sstables that contain
range tombstones. The compaction queue is only making a guess.
…On Thu, Jun 14, 2018 at 5:37 AM Tobias Schottdorf ***@***.***> wrote:
Wait, what? That's a whole new world of badness. Is this caused by
anything we've done or is this just a way of saying "RocksDB is completely
broken"?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#26693 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF6f95v4jqzWJfVsB4_c0yC23Z_wxhdFks5t8i7IgaJpZM4UmjGL>
.
|
Yes and no, though I'd be tempted to try it. Yes because for the range tombstones, it mostly does. There is an argument that the "tombstone = compaction" heuristic is too aggressive for tombstones that come from replica GC (though maybe it's ok). Then there's the bigger problem of running compactions for non-range tombstones. The other day a user managed to get to 60gb of on-disk usage when their dataset was really more like 3gb -- a queue-type workload. I think the original idea of the compaction queue also included alleviating that kind of thing. However I'm not too hopeful that it will, and RocksDB's built-in tombstone-sensitive filters might address this problem better. |
@tschottdorf Oh, and to answer your earlier question: yes, this assertion firing is a whole new world of brokenness. I'm trying to make sure it isn't due to one of the RocksDB patches that Nikhil and I have been using. I don't think it is, but need to make sure. |
Possibly relevant: facebook/rocksdb#3926 |
That does look relevant. |
I've reproduced the assertion mentioned above using a cockroach built from master but with rocksdb assertions enabled. I'm going to try the patch @benesch pointed to above to see if that fixes the issue. |
facebook/rocksdb#3926 appears to fix the problem. This seems somewhat serious as the bug causes RocksDB to violate a basic invariant that 2 sstables in the same level do not overlap. I think the result could be missing keys during iteration. I need to verify that whatever bug that RocksDB PR is fixing is also present in 2.0. Hopefully I'll get to that tomorrow. If I don't, @benesch will likely need to pick this up when I'm on vacation. Cc @bdarnell in case you haven't been following along. |
We enable rocksdb assertions in race-enabled builds (grep |
Well, the good news is that so far I'm unable to reproduce the assertion failure on 2.0. It was happening right away when doing a |
Well, I looked through the changes in our RocksDB upgrade and nothing jumped out at me, but there were a lot of changes so perhaps I missed something. |
One other thing we could try: backporting the RocksDB unit test that
exposes the bug to 5.9.0+patches and seeing whether it fails.
…On Fri, Jun 15, 2018 at 11:54 AM, Peter Mattis ***@***.***> wrote:
Well, I looked through the changes in our RocksDB upgrade and nothing
jumped out at me, but there were a lot of changes so perhaps I missed
something.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26693 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IMjIAR26h_C0-iSAXy3BFTQ7X8cXks5t89i0gaJpZM4UmjGL>
.
|
Good point. I’ll do that this afternoon.
On Fri, Jun 15, 2018 at 12:03 PM Nikhil Benesch <[email protected]>
wrote:
… One other thing we could try: backporting the RocksDB unit test that
exposes the bug to 5.9.0+patches and seeing whether it fails.
On Fri, Jun 15, 2018 at 11:54 AM, Peter Mattis ***@***.***>
wrote:
> Well, I looked through the changes in our RocksDB upgrade and nothing
> jumped out at me, but there were a lot of changes so perhaps I missed
> something.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#26693 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AA15IMjIAR26h_C0-iSAXy3BFTQ7X8cXks5t89i0gaJpZM4UmjGL
>
> .
>
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#26693 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF6f97T4rdkxYjFp_DuyWa1rMNZDoNvLks5t89rbgaJpZM4UmjGL>
.
|
https://github.com/cockroachdb/rocksdb/tree/0edac964ca804c97c69ef18123b5e82d1b57e19e is the SHA used by the 2.0 branch. I'll work on patching master and backporting to 2.0. |
RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes cockroachdb#26693 Release note: None
RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes cockroachdb#26693 Release note: None
26755: release-2.0: Bump RocksDB pointer to grab facebook/rocksdb#3926 r=benesch,a-robinson a=petermattis RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes #26693 Release note: None Co-authored-by: Peter Mattis <[email protected]>
26753: storage: Add extra event to allocator rebalancing r=a-robinson a=a-robinson Helps make the output of simulated allocator runs less confusing, since otherwise it's not clear why we're considering removal from the range and why the replicas being considered for removal includes one that isn't even a real member of the range. Release note: None Would have made looking at the simulated allocator output from https://forum.cockroachlabs.com/t/how-to-enable-leaseholder-load-balancing/1732/3 a little more pleasant. 26754: Bump RocksDB pointer to grab facebook/rocksdb#3926 r=benesch,a-robinson a=petermattis RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes #26693 Release note: None Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Peter Mattis <[email protected]>
I'm going to leave this open as a reminder that I should try and reproduce the original badness in this issue (excessive compactions) now that the RocksDB bug with overlapping sstables has been fixed. |
The current implementation of range deletion tombstones in RocksDB suffers from a performance bug that causes excessive CPU usage on every read operation in a database with many range tombstones. Dropping a large table can easily result in several thousand range deletion tombstones in one store, resulting in an unusable cluster as documented in cockroachdb#24029. Backport a refactoring of range deletion tombstone that fixes the performance problem. This refactoring has also been proposed upstream as facebook/rocksdb#4014. A more minimal change was also proposed in facebook/rocksdb#3992--and that patch better highlights the exact nature of the bug than the patch backported here, for those looking to understand the problem. But this refactoring, though more invasive, gets us one step closer to solving a related problem where range deletions can cause excessively large compactions (cockroachdb#26693). These large compactions do not appear to brick the cluster but undoubtedly have some impact on performance. Fix cockroachdb#24029. Release note: None
I believe the remaining work here is upstream in RocksDB. See facebook/rocksdb#3977. Moving to Later as it isn't clear if that work is necessary. |
We have no plans to fix this issue in RocksDB. Pebble is somewhat better in this area as it can cut sstables at grandparent boundaries even if that means cutting a range tombstone. There are a number of open Pebble issues for further improvements for compactions, but this issue has done its service and can be closed. |
This is forked from #24029. One of the recurring symptoms we've seen in that issue is that RocksDB compactions go crazy, consuming excessive CPU and compacting way more data than we expect. We've debugged a few problems in how RocksDB handles range tombstones. This issue occurred with mitigations in place for those earlier issues.
I180613 14:43:08.584878 478 storage/compactor/compactor.go:367 [n8,s8,compactor] processing compaction #1-10/38 (/Table/53/1/56300577-/Table/53/1/56738067) for 288 MiB (reasons: size=true used=false avail=false)
...
I180613 14:43:08.586579 478 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_impl_compaction_flush.cc:971] [default] Manual compaction starting L5 -> L6
I180613 14:43:08.586651 478 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:432] [default] L5 -> L6: compaction 970 overlapping inputs: 3215B + 21GB: 'BD89F9035B2DB188001536E587A4E0547209' seq:4235335, type:0 - 'BD89F903D8A50700' seq:72057594037927935, type:15
Here we can see that the compaction queue executed a compaction for what it believes is 288 MiB of data. RocksDB translated this into a compaction for 21 GiB of data!
Those hex decoded keys indicate that RocksDB expanded our suggested compaction range to:
/Table/53/1/56307121/0/1528661494.288372850,0-/Table/53/1/64529671
. The start key seems pretty close to the the suggested compaction, but that end key is much larger. The RocksDB log line was added toCompactionPicker::SetupOtherInputs
and the range of keys was determined byCompactionPicker::GetRange
. I'm going to add some additional debugging info to try and track down what is going on.Cc @benesch, @tschottdorf
The text was updated successfully, but these errors were encountered: