Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: unexpected panic: llrb: inverted range #6027

Closed
mberhault opened this issue Apr 13, 2016 · 12 comments
Closed

stability: unexpected panic: llrb: inverted range #6027

mberhault opened this issue Apr 13, 2016 · 12 comments
Assignees
Milestone

Comments

@mberhault
Copy link
Contributor

Build sha: 08d6640

Beta cluster has been having issues, so some of them may be the direct cause of this (eg: #5998, #6000, #6020)
I just restarted node 4 (ec2-52-91-234-29.compute-1.amazonaws.com) with a recent sha (08d6640). About 15s later, node 3 (ec2-54-209-150-36.compute-1.amazonaws.com) crashed.

Node 4 shows the following:

I160413 10:46:45.889824 http2_client.go:881  transport: http2Client.notifyError got notified that the client transport was broken EOF.

Node 3 died with:

F160413 10:46:45.783176 util/log/log.go:42  unexpected panic: llrb: inverted range
1: running [Created by http2.(*serverConn).processHeaders @ server.go:1444]
    log        clog.go:858             getStacks(0xecea01a01, 0, 0, 0)
    log        clog.go:724             (*loggingT).outputLogEntry(#4, 0x3, 0x1f705bf, 0xf, 0x2a, 0xc8230fa630, 0x26)
    log        structured.go:57        addStructured(0, 0, 0x3, 0x2, #3, 0x14, #24, 0x1, 0x1)
    log        log.go:73               logDepth(0, 0, 0x1, 0x3, #3, 0x14, #24, 0x1, 0x1)
    log        log.go:179              Fatalf(#3, 0x14, #24, 0x1, 0x1)
    log        log.go:42               FatalOnPanic()
               panic.go:426            panic(0x140ef80, 0xc82bc25900)
    llrb       llrb.go:505             (*Tree).DoRange(0xc820405478, 0xc826b5d220, #33, 0xc8230fa570, #33, 0xc8230fa5a0, 0x18)
    cache      cache.go:370            (*OrderedCache).DoRange(0xc820405420, 0xc826b5d2d0, #1, 0xc822113080, #1, 0xc8221130a0)
    kv         range_cache.go:617      (*rangeDescriptorCache).clearOverlappingCachedRangeDescriptors(#15, #20, 0, 0)
    kv         range_cache.go:561      (*rangeDescriptorCache).insertRangeDescriptorsLocked(#15, #20, 0x1, 0x1, 0, 0)
    kv         range_cache.go:217      insertRangeDescriptorsLocked)-fm(#20, 0x1, 0x1, 0, 0)
    kv         range_cache.go:244      (*evictionToken).EvictAndReplace(0xc826b5dcd8, #20, 0x1, 0x1, 0, 0)
    kv         dist_sender.go:761      (*DistSender).sendChunk(#12, #32, #23, #35, 0, 0, 0, 0, 0, 0, ...)
    kv         dist_sender.go:546      (*DistSender).Send(#12, #32, #23, #35, 0, 0, 0, 0, 0, 0, ...)
    kv         txn_coord_sender.go:409 (*TxnCoordSender).Send(#13, #32, #23, 0, 0, 0, 0, 0, 0, 0, ...)
    client     db.go:474               (*DB).send(#16, 0, 0x2, #25, 0x1, 0x1, 0x25, 0x40)
    client     db.go:421               send)-fm(0, 0x2, #25, 0x1, 0x1, 0xc82ca11fb0, 0x25)
    client     db.go:386               sendAndFill(0xc826b5efc8, #19, 0x25, 0x25)
    client     db.go:421               (*DB).RunWithResponse(#16, #19, #2, #26)
    client     db.go:412               (*DB).Run(#16, #19, #19)
    client     db.go:490               runOneResult(#34, #16, #19, 0, 0, 0, 0, 0, 0, 0, ...)
    client     db.go:293               (*DB).scan(#16, #2, #26, #2, #27, 0, #6, 0, 0, 0, ...)
    client     db.go:309               (*DB).ScanInconsistent(#16, #2, #26, #2, #27, 0, 0, 0, 0, 0)
    ts         query.go:435            (*DB).Query(#7, 0xc826235920, 0x19, 0xc82a826a68, 0xc82a826a74, 0xc82a826a80, #5, 0, 0, 0x1, ...)
    ts         server.go:84            (*Server).handleQuery(#11, #30, #29, #28, 0, 0, 0)
    ts         server.go:49            handleQuery)-fm(#30, #29, #28, 0, 0, 0)
    httprouter router.go:344           (*Router).ServeHTTP(#8, #30, #29, #28)
    ts         server.go:55            (*Server).ServeHTTP(#11, #30, #29, #28)
    http       server.go:1910          (*ServeMux).ServeHTTP(#14, #30, #29, #28)
    server     server.go:440           (*Server).ServeHTTP(#9, #30, #29, #28)
    http       server.go:2081          serverHandler.ServeHTTP(#10, #31, #22, #28)
    http       server.go:2489          initNPNRequest.ServeHTTP(#17, #10, #31, #22, #28)
    http       <autogenerated>:253     (*initNPNRequest).ServeHTTP(#21, #31, #22, #28)
    http       h2_bundle.go:3847       (Handler).ServeHTTP-fm(#31, #22, #28)
    http2      server.go:1679          (*serverConn).runHandler(#18, #22, #28, 0xc8262352e0)

Node 3 log:
node3.log.parse.txt.gz

Node 4 log:
node4.log.parse.txt

I'll see if I can add debugging here and try again.

@mberhault
Copy link
Contributor Author

just happened to three nodes on the gamma cluster:

2: 104.196.124.159
3: 104.196.98.24
4: 104.196.39.184

@tbg
Copy link
Member

tbg commented Apr 14, 2016

@nvanbenschoten, would you mind taking a peek?

@cuongdo
Copy link
Contributor

cuongdo commented Apr 14, 2016

This has hit me several times, so let me know if you need help reproducing.
The lower parts of the stack involve time-series queries, which are hit
very frequently by the admin UI.

On Thu, Apr 14, 2016 at 5:15 AM Tobias Schottdorf [email protected]
wrote:

@nvanbenschoten https://github.com/nvanbenschoten, would you mind
taking a peek?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#6027 (comment)

@tbg
Copy link
Member

tbg commented Apr 14, 2016

I think Nathan should have an idea pretty quickly, or at least be able to
add assertions that make it obvious. This error is happening on
RangeDescriptor updates.

On Thu, Apr 14, 2016 at 6:52 AM Cuong Do [email protected] wrote:

This has hit me several times, so let me know if you need help reproducing.
The lower parts of the stack involve time-series queries, which are hit
very frequently by the admin UI.

On Thu, Apr 14, 2016 at 5:15 AM Tobias Schottdorf <
[email protected]>
wrote:

@nvanbenschoten https://github.com/nvanbenschoten, would you mind
taking a peek?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
<
#6027 (comment)


You are receiving this because you commented.

Reply to this email directly or view it on GitHub
#6027 (comment)

-- Tobias

@mberhault
Copy link
Contributor Author

Trivially reproducible on the gamma cluster.
The compared ranges in question are:

F160415 16:01:35.553269 util/log/log.go:42  unexpected panic: llrb: inverted range: from="\x00" to=/Min

Or with %q printout:

F160415 16:10:41.185122 util/log/log.go:42  unexpected panic: llrb: inverted range: from="\"\\x00\"" to="/Min"

From a modified DoRange:

func (t *Tree) DoRange(fn Operation, from, to Comparable) bool {
  if t.Root == nil {
    return false
  }
  if from.Compare(to) > 0 {
    panic(fmt.Sprintf("llrb: inverted range: from=%+v to=%+v", from, to))
  }
  return t.Root.doRange(fn, from, to)
}

Could it just be a matter of using the "pretty" value?

@tbg
Copy link
Member

tbg commented Apr 15, 2016

No, the keys are actually inverted (nil vs \x00). The question is what happens to get to this call.
I think I would recover() the panic in clearOverlappingCachedRangeDescriptors and print out all of the descriptors I have (or maybe one level higher, depending on what's available/most interesting).

@mberhault
Copy link
Contributor Author

Interesting. I added the following recover and logging:

diff --git a/kv/range_cache.go b/kv/range_cache.go
index dd45199..bbb58eb 100644
--- a/kv/range_cache.go
+++ b/kv/range_cache.go
@@ -609,6 +609,16 @@ func (rdc *rangeDescriptorCache) clearOverlappingCachedRangeDescriptors(desc *ro
  // going to cache. This could happen on a merge (and also happens
  // when there's a lot of concurrency). Iterate from the range meta key
  // after RangeMetaKey(desc.StartKey) to the range meta key for desc.EndKey.
+ defer func() {
+   if r := recover(); r != nil {
+     log.Infof("Recovered: %v", r)
+     log.Infof("from: %+v (start: %+v) to: %+v (%+v)", startMeta, startMeta.Next(), endMeta, rangeCacheKey(endMeta))
+     rdc.rangeCache.Do(func(k, v interface{}) {
+       log.Infof("key=%s desc=%+v\n", roachpb.Key(k.(rangeCacheKey)), v)
+     })
+     panic("Dying anyway!!!")
+   }
+ }()
  rdc.rangeCache.DoRange(func(k, v interface{}) {
    if log.V(1) {
      log.Infof("clearing subsumed descriptor: key=%s desc=%s", k, v.(*roachpb.RangeDescriptor))

Which dumps:

I160415 16:35:27.938240 kv/range_cache.go:614  Recovered: llrb: inverted range: from="\"\\x00\"" to="/Min"
I160415 16:35:27.938265 kv/range_cache.go:615  from: /Min (start: "\x00") to: /Min (/Min)

And the full log, including dump of all keys/descs.
node4.log.txt

@bdarnell
Copy link
Contributor

I'm pretty sure the offending descriptor is simply an empty zero value; that would make clearOverlappingCachedRangeDescriptors compute a range of (KeyMin.Next(), KeyMin).

According to the stack trace, this is happening as a result of a RangeKeyMismatchError. I think that a request is being directed to an uninitialized replica, and it makes it as far as Replica.checkCmdHeader. Here, ContainsKeyRange returns false so it returns an error with a non-nil but incomplete range descriptor.

If this theory is correct then this patch should make the error go away:

diff --git a/storage/replica.go b/storage/replica.go
index a423f27..6c38917 100644
--- a/storage/replica.go
+++ b/storage/replica.go
@@ -787,6 +787,9 @@ func (r *Replica) Send(ctx context.Context, ba roachpb.BatchRequest) (*roachpb.B

 // TODO(tschottdorf): almost obsolete.
 func (r *Replica) checkCmdHeader(header roachpb.Span) error {
+   if !r.IsInitialized() {
+       return roachpb.NewRangeKeyMismatchError(header.Key, header.EndKey, nil)
+   }
    if !r.ContainsKeyRange(header.Key, header.EndKey) {
        mismatchErr := roachpb.NewRangeKeyMismatchError(header.Key, header.EndKey, r.Desc())
        // Try to suggest the correct range on a key mismatch error where

@mberhault
Copy link
Contributor Author

Sure, I'll give it a shot.

@mberhault
Copy link
Contributor Author

so far so good. it usually takes less time than this to crash. Thanks Ben.

@bdarnell
Copy link
Contributor

Cool. I'll work on a test and figure out the right place for this check (I think it should have been caught earlier than checkCmdHeader, and RangeKeyMismatch is probably not the right error and will cause unnecessary cache invalidations).

@tbg
Copy link
Member

tbg commented Apr 15, 2016

👍

On Fri, Apr 15, 2016 at 12:54 PM Ben Darnell [email protected]
wrote:

Cool. I'll work on a test and figure out the right place for this check (I
think it should have been caught earlier than checkCmdHeader, and
RangeKeyMismatch is probably not the right error and will cause
unnecessary cache invalidations).


You are receiving this because you commented.

Reply to this email directly or view it on GitHub
#6027 (comment)

-- Tobias

@mberhault mberhault modified the milestone: Q2 Apr 17, 2016
@mberhault mberhault assigned bdarnell and unassigned nvanbenschoten Apr 17, 2016
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 19, 2016
The store now returns NotLeaderError instead of a malformed
RangeKeyMismatchError when a request is addressed to an uninitialized
replica. The replica that caused the creation of the uninitialized
replica is used as our best guess for the current leader.

Guard against regressions by checking the validity of the descriptor
passed to NewRangeKeyMismatchError. Also add an admittedly complex
regression test.

Fixes cockroachdb#6027
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 19, 2016
The store now returns NotLeaderError instead of a malformed
RangeKeyMismatchError when a request is addressed to an uninitialized
replica. The replica that caused the creation of the uninitialized
replica is used as our best guess for the current leader.

Guard against regressions by checking the validity of the descriptor
passed to NewRangeKeyMismatchError. Also add an admittedly complex
regression test.

Fixes cockroachdb#6027
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants