Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

Closed
ricardocrdb opened this issue Dec 10, 2019 · 26 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. O-community Originated from the community

Comments

@ricardocrdb
Copy link

A user is currently using 2.1.6 and is running into an issue where a single node in the 5 node cluster is experiencing nearly 100% CPU usage on 12 cores. Logs show the following error on repeat for nearly 6 hours:

[n1,split,s1,r14507/42:/Table/123/1/"RandomStuff..] split at key /Table/123/1/"RandomStuff here" failed: context deadline exceeded

The range reports for r14507 shows everything expected with the node in question also as the leaseholder. An output of all goroutines from the Debugging page of the AdminUI shows that a high number of routines are in rocksDBIterator.Seek A core dump has been created of the cockroach process and is available here for further analysis.

Of note, when creating the core dump, the cockroach process was paused long enough for the leaseholder to be moved to a new node, which caused the range to be available. The user has confirmed that previously inaccessible rows are now available. The cockroach process on the node is still running at near max CPU usage as of the creating of this issue.

@ricardocrdb ricardocrdb added O-community Originated from the community A-kv-distribution Relating to rebalancing and leasing. labels Dec 10, 2019
@ricardocrdb
Copy link
Author

CCing @ajwerner @petermattis

@RoachietheSupportRoach
Copy link
Collaborator

Zendesk ticket #4246 has been linked to this issue.

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

2.1.9 has multiple infinite loop fixes: https://www.cockroachlabs.com/docs/releases/v2.1.9.html. One for compaction and one for reverse scan. Note there is yet another infinite loop reverse scan bug that is not present in any 2.1 patch release: #35505.

@petermattis
Copy link
Collaborator

petermattis commented Dec 11, 2019 via email

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

Got it. These are the stack traces I'm seeing.

Thread 29 (LWP 315581):
#0  0x00007fbeb90c1600 in ?? ()
#1  0x000000000209058d in rocksdb::Slice::compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/../rocksdb/include/rocksdb/slice.h:223
#2  cockroach::DBComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/comparator.cc:29
#3  0x00000000021759b2 in rocksdb::InternalKeyComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/dbformat.cc:156
#4  0x00000000021aea42 in operator() () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:33
#5  _M_upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1686
#6  upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1111
--Type <RET> for more, q to quit, c to continue without paging--
#7  upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_map.h:1195
#8  rocksdb::CollapsedRangeDelMap::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:339
#9  0x00000000021ac45a in rocksdb::RangeDelAggregator::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:602
#10 0x000000000222ceec in rocksdb::BlockBasedTableIterator::InitRangeTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2133
#11 0x000000000222f900 in rocksdb::BlockBasedTableIterator::FindKeyForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2039
#12 0x000000000222fc51 in rocksdb::BlockBasedTableIterator::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1877
#13 0x00000000021c16a6 in rocksdb::IteratorWrapper::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:69
#14 SkipEmptyFileForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:685
#15 0x000000000224dd5d in rocksdb::IteratorWrapper::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:63
#16 rocksdb::MergingIterator::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:112
#17 0x000000000216d886 in rocksdb::DBIter::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1165
#18 0x000000000208e038 in Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/batch.cc:155
#19 0x0000000002092d5a in DBIterSeek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:518
#20 0x0000000001ff1733 in _cgo_77181d5e6ea5_Cfunc_DBIterSeek (v=0xc467b10468) at cgo-gcc-prolog:646

I am not sure it rules out an MVCC scan loop as presumably that is a caller of DBIterSeek. But I am still looking and don't have evidence to rule it in either.

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

Actually it does rule it out as mvcc scan would show up in the C++ stack trace, sorry.

@petermattis
Copy link
Collaborator

petermattis commented Dec 11, 2019 via email

@petermattis
Copy link
Collaborator

Here is a list of the threads from the core file:
threads.txt. I stripped out all of the sleeping threads. The count of threads in DBIterSeek (33) matches up precisely with the count from the goroutine dump. Note that the goroutine dump does indicate one goroutine in _Cfunc_DBIterSeek, but that goroutine has yet to make the transition to cgo.

It looks like there is a pattern of DBIterSeek -> ... -> rocksdb::BlockBasedTableIterator::SeekToFirst. I'm going to see if I can determine the sstable file numbers involved.

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

Based on some potentially error-prone steps I suspect the offending file is an L5 file, "2326537.sst".

Another notable thing is the cached range tombstone (BlockBasedTableIterator::range_tombstone_) appears to have the same start and end key. Maybe the skipping optimization gets stuck in this case.

@petermattis
Copy link
Collaborator

petermattis commented Dec 11, 2019 via email

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

It isn't recommended but I rebuilt v2.1.6 with -g instead of -g1 (which does not include local variables) and the debug info looks good to me. Picked a few of the stacks stuck in SeekToFirst(), changed frame to one in BlockBasedTableIterator, then print the file number. It was the same for the few cases I checked:

(gdb) p this->file_meta_->fd.packed_number_and_path_id & kFileNumberMask
$20 = 2326537

@petermattis
Copy link
Collaborator

petermattis commented Dec 11, 2019 via email

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

Yes it was done using builder image. Here's something else interesting - the empty range tombstone originates from two points in the collapsed map with the same user key, where one has kMaxSequenceNumber and the other has seqnum zero. Also the two points are at the same seqnum (elem[i]->right) which suggests they came from the same logical tombstone.

Used std map pretty printers from https://sourceware.org/ml/gdb/2008-02/msg00064/stl-views.gdb. Ran in a stack frame of rocksdb::CollapsedRangeDelMap:

(gdb) pmap &rep_ ParsedInternalKey SequenceNumber
...
elem[9]->left: $51 = {
  user_key = {
    data_ = 0x7fbdce6e5f20 "X",
    size_ = 76
  },
  sequence = 72057594037927935,
  type = rocksdb::kMaxValue
}
elem[9]->right: $52 = 326417547197
elem[10]->left: $53 = {
  user_key = {
    data_ = 0x7fbdce6e48a0 "X",
    size_ = 76
  },
  sequence = 0,
  type = rocksdb::kMaxValue
}
elem[10]->right: $54 = 326417547197

User keys replaced with "X" for privacy. Verified they are the same before replacement.

@ajkr
Copy link
Contributor

ajkr commented Dec 11, 2019

I think the PartialRangeTombstone that is returned according to the above two points is non-empty, actually. It covers the single point key "X". My feeling is the fix should be to handle seeking to the end key of a range tombstone that has the same user key as the current key properly in BlockBasedTableIterator::FindKeyForward().

@ajkr
Copy link
Contributor

ajkr commented Dec 12, 2019

I looked into how we got in a situation where the collapsed range tombstone map can return a PartialRangeTombstone with an end_key_ whose user key equals a point key at a lower level, and whose seqnum is smaller than that point key. That is a necessary condition to enter the infinite loop I'm thinking of, and is satisfied by the evidence in the core dump.

The point key we noted earlier that scan gets stuck at is "X" in "2326537.sst" at L5. Further investigation shows the covering range tombstone is three levels earlier (i.e., L2) and split across files "2632281.sst" and "2327340.sst". Note the split point is the same "X" as in L5 where scan is stuck.

(gdb) p flevel_[-3].files[3]
$158 = {
  fd = {
    table_reader = 0x0,
    packed_number_and_path_id = 2327355,
    file_size = 2632281
  },
  file_metadata = 0x7fbe06f324c0,
  smallest_key = {
    data_ = 0x7fbea983e4c0 "W",
    size_ = 84
  },
  largest_key = {
    data_ = 0x7fbea983e514 "X",
    size_ = 84
  }
}
(gdb) p flevel_[-3].files[4]
$159 = {
  fd = {
    table_reader = 0x0,
    packed_number_and_path_id = 2327340,
    file_size = 8444279
  },
  file_metadata = 0x7fbdff3a2600,
  smallest_key = {
    data_ = 0x7fbea9874000 "X",
    size_ = 84
  },
  largest_key = {
    data_ = 0x7fbea9874054 "Y",
    size_ = 84
  }
}

Now, we need to figure out why the collapsed map contains an internal key with seqnum smaller than a point key at a lower level. Knowing the latter of the above two files has a range tombstone spanning beyond its earliest key, we can look at that table's smallest key to see how its tombstones would be truncated.

(gdb) p (uint64_t)flevel_[-3].files[4].smallest_key.data_[76]
$162 = 15

The above output indicates the table's smallest key has type 15 (range delete) and seqnum zero. This is highly unusual. Typically a file's smallest key corresponds to its first point key, which should have a non-zero seqnum at an upper level. We would not expect the file to be extended left by range tombstones. It means truncation of our range tombstone produces "X@0" to be inserted into the collapsed map. And the "X" point key at L5 certainly has a higher seqnum than zero. So the necessary condition for the infinite loop is satisfied.

Another unusual thing about the above two files is, although they have overlapping endpoints, they do not have consecutive file numbers. In fact the latter's file number is smaller than the former's file number. This points to subcompaction, which indeed is enabled in v2.1.6. Additionally, L2 appears to be the base level, and L0->Lbase is the case where subcompaction would happen.

So, without a complete understanding/repro, my feeling so far is (1) the unusual database state is related to subcompactions; and (2) although the state is unusual, it doesn't look like any corruption happened, so our seek-to-end-key optimization should handle this case.

@petermattis
Copy link
Collaborator

Excellent sleuthing, @ajkr. Definitely looks like a partial tombstone with the same start/end user-key would cause a problem. Something that was confusing when I was verifying this is that we seek to tombstone_internal_end_key(). Naively I was assuming that would seek past the tombstone's end key, but tombstone_internal_end_key() actually returns a key with kMaxSequenceNumber.

Seems like we could just skip this optimization if parsed.user_key == range_tombstone_.end_key().user_key. Does that sound right?

Does the same problem exist for FindKeyBackward? I think no because of the construction of tombstone_internal_start_key().

@petermattis
Copy link
Collaborator

Looks like this bug is present in 19.1.x as well. The code in question is very different in 19.2, and this particular optimization is not present.

@ajkr
Copy link
Contributor

ajkr commented Dec 12, 2019

Seems like we could just skip this optimization if parsed.user_key == range_tombstone_.end_key().user_key. Does that sound right?

Yes it sounds right. Or tombstone_internal_end_key() could return the actual end_key_ and we could seek to there - I am not sure why we seek elsewhere. But skipping the optimization sounds safer.

By the way, the condition for infinite loop turned out to be slightly different from a PartialRangeTombstone with same start/end user keys. I spent a while trying to repro it in a test case before realizing it. The condition is like this:

  • Collapsed range-del map has points A#i and B#j mapping to seqnum n (n is the tombstone seqnum whereas i and j are for limiting its span). A and B are variables representing user keys, and A <= B.
  • An SST file has point key B#k, where k > j. Additionally that file's max seqnum is less than n, which is necessary to enable the seek optimization for that file.
  • When iterator is positioned at B#k, it calls GetTombstone() and receives the partial tombstone [A#i, B#j)#n. (Note without k > j it would return a later tombstone and we might be safe.)
  • Now the iterator seeks to a forged end key which is B#kMaxSeqnum. It lands back on B#k and repeats.

The second bullet point makes it hard to repro. Requiring the file's max seqnum less than n means that file must be in a lower level than our range tombstone. Meanwhile, k > j requires having a point key in that file that is newer than a point in the collapsed map corresponding to our range tombstone. That is very difficult as the points in the collapsed map either have max seqnum or are file endpoints where the tombstone was split. The only case I can find a file endpoint with lower seqnum than a point key in a lower level is subcompactions, where a range tombstone extending a file to the left causes it to be assigned a smallest key with seqnum zero.

@petermattis
Copy link
Collaborator

Yes it sounds right. Or tombstone_internal_end_key() could return the actual end_key_ and we could seek to there - I am not sure why we seek elsewhere. But skipping the optimization sounds safer.

The comment in tombstone_internal_end_key() scares me from touching that seqnum magic.

The second bullet point makes it hard to repro. Requiring the file's max seqnum less than n means that file must be in a lower level than our range tombstone. Meanwhile, k > j requires having a point key in that file that is newer than a point in the collapsed map corresponding to our range tombstone. That is very difficult as the points in the collapsed map either have max seqnum or are file endpoints where the tombstone was split. The only case I can find a file endpoint with lower seqnum than a point key in a lower level is subcompactions, where a range tombstone extending a file to the left causes it to be assigned a smallest key with seqnum zero.

Can you repro with a test directly on BlockBasedTableIterator? That is, construct an sstable with the desired point keys, then construct a BlockBasedTableIterator and pass it a RangeDelAggregator populated with the necessary range tombstones.

@ajkr
Copy link
Contributor

ajkr commented Dec 12, 2019

Well, I was able to write the test of the condition using subcompaction: ajkr/rocksdb@6b55f57. Sadly the result is reappearing keys under the subcompaction split point rather than infinite loop.

@petermattis
Copy link
Collaborator

Sadly the result is reappearing keys under the subcompaction split point rather than infinite loop.

Can you elaborate on this? Did the test find another bug?

@ajkr
Copy link
Contributor

ajkr commented Dec 13, 2019

Yes. Well we already knew there are bugs with DeleteRange+subcompactions -- it just materialized one.

Let me explain by printing the PartialRangeTombstones that are seen during the DB scan at the end of that test. Here, inf means infinity, i.e., unbounded.

[inf, key000100#72057594037927935)#900
[key000100#72057594037927935, key000100#0)#0
[key000100#0, key000300#72057594037927935)#900
[key000300#72057594037927935, key000300#0)#0
[key000300#0, inf)#900

The subcompaction split points were key000100 and key000300. Notice how the PartialRangeTombstones returned exactly at split point X follow the form [X#kMaxSeqnum, X#0)#0. This means X is not covered in lower levels, while it should be as the range tombstone spanned over it before it was cut by subcompactions.

This happens because the last file in the subcompaction left of the split point has largest key X#kMaxSeqnum, while the first file in the subcompaction right of the split point has smallest key X#0. It is a loss of power that is harmful to readers, and such tombstones are not restored to their full power during compaction.

@ajkr
Copy link
Contributor

ajkr commented Dec 13, 2019

Yes. Well we already knew there are bugs with DeleteRange+subcompactions -- it just materialized one.

It's fixed by facebook/rocksdb@64aabc9. We don't need to backport as we already disabled subcompactions in v2.1.

For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a [X#kMaxSeqnum, X#0)#n where n > 0 and X is a subcompaction split point.

@petermattis
Copy link
Collaborator

It's fixed by facebook/rocksdb@64aabc9. We don't need to backport as we already disabled subcompactions in v2.1.

Ack. Glad you tracked that down.

For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a [X#kMaxSeqnum, X#0)#n where n > 0 and X is a subcompaction split point.

In parallel, it would be good to fix the BlockBasedTableIterator bug. I think that infinite loop condition can be tested with a targeted test of BlockBasedTableIterator and a manually constructed RangeDelAggregator.

@ajkr
Copy link
Contributor

ajkr commented Dec 16, 2019

For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a [X#kMaxSeqnum, X#0)#n where n > 0 and X is a subcompaction split point.

In parallel, it would be good to fix the BlockBasedTableIterator bug. I think that infinite loop condition can be tested with a targeted test of BlockBasedTableIterator and a manually constructed RangeDelAggregator.

Will do. Here is the test: ajkr/rocksdb@c46e860. While writing it I became even more surprised by the presence of X#kMaxSeqnum: n and X#0: n consecutively in the collapsed map. The testing showed consecutive points at the same sequence number should be collapsed into a single point, i.e., only X#kMaxSeqnum: n should be present.

@ajkr
Copy link
Contributor

ajkr commented Dec 17, 2019

fixed by #43253 and #43255.

We still don't fully understand how a collapsed range-del map contains two adjacent points with identical seqnum. It might be worth looking into later.

@ajkr ajkr closed this as completed Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

4 participants