Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: nil pointer panic in timestamp cache #6190

Closed
bdarnell opened this issue Apr 21, 2016 · 7 comments
Closed

stability: nil pointer panic in timestamp cache #6190

bdarnell opened this issue Apr 21, 2016 · 7 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

Seen on cockroach-gamma-5 running e6880ae

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0xe765a7]

goroutine 201640 [running]:
panic(0x16667e0, 0xc820012080)
        /usr/local/go/src/runtime/panic.go:464 +0x3e6
github.com/cockroachdb/cockroach/storage.(*Store).Send.func1(0xc82eadf268, 0xc82eadf2d8, 0x14472af1bbd1f04a, 0x0, 0xc82eadf2d0, 0xc82031c900)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:1558 +0x6e
panic(0x16667e0, 0xc820012080)
        /usr/local/go/src/runtime/panic.go:426 +0x4e9
github.com/cockroachdb/cockroach/util/cache.(*entryList).remove(0xc82037b128, 0xc83a500dd0, 0xc887212b80)
        /go/src/github.com/cockroachdb/cockroach/util/cache/cache.go:132 +0x107
github.com/cockroachdb/cockroach/util/cache.(*baseCache).removeElement(0xc82037b100, 0xc83a500dd0)
        /go/src/github.com/cockroachdb/cockroach/util/cache/cache.go:277 +0x3f
github.com/cockroachdb/cockroach/util/cache.(*baseCache).DelEntry(0xc82037b100, 0xc83a500dd0)
        /go/src/github.com/cockroachdb/cockroach/util/cache/cache.go:250 +0x32
github.com/cockroachdb/cockroach/storage.(*TimestampCache).Add(0xc82012db40, 0xc887212b18, 0x5, 0x8, 0xc887212b80, 0x7, 0x8, 0x14472ae78a58f663, 0x1, 0xc887212b70, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/timestamp_cache.go:186 +0x67d
github.com/cockroachdb/cockroach/storage.(*Replica).endCmds(0xc820322700, 0xc86d734040, 0x1, 0x1, 0xc828815180, 0xc82c46ff80, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:936 +0x40e
github.com/cockroachdb/cockroach/storage.(*Replica).beginCmds.func1(0xc82c46ff80, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:903 +0x5c
github.com/cockroachdb/cockroach/storage.(*Replica).addReadOnlyCmd.func1(0xc82d117b00, 0xc82eade9e8, 0xc82eade9f0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1091 +0x36
github.com/cockroachdb/cockroach/storage.(*Replica).addReadOnlyCmd(0xc820322700, 0x7fe348dd2e58, 0xc8209293e0, 0x14472ae78a58f663, 0x1, 0x500000005, 0x5, 0x9, 0x0, 0xc821f
54000, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1108 +0x423
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc820322700, 0x7fe348dd2e58, 0xc8209293e0, 0x14472ae78a58f663, 0x1, 0x500000005, 0x5, 0x9, 0x0, 0xc821f54000, ...
)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:793 +0x500
github.com/cockroachdb/cockroach/storage.(*Store).Send(0xc82031c900, 0x7fe348dd2e58, 0xc8209293e0, 0x14472ae78a58f663, 0x1, 0x500000005, 0x5, 0x9, 0x0, 0xc821f54000, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:1653 +0xe8e
github.com/cockroachdb/cockroach/storage.(*Stores).Send(0xc8203f8280, 0x7fe348dd2e58, 0xc820929380, 0x0, 0x0, 0x500000005, 0x5, 0x9, 0x0, 0xc821f54000, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/stores.go:178 +0x501
github.com/cockroachdb/cockroach/server.(*Node).Batch.func2()
        /go/src/github.com/cockroachdb/cockroach/server/node.go:730 +0x55b
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask(0xc8203b17a0, 0xc82eadf880, 0x0)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:166 +0xd9
github.com/cockroachdb/cockroach/server.(*Node).Batch(0xc820346600, 0x7fe348dd2e58, 0xc820929110, 0xc828814e70, 0xb03ecef60a8a6a87, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/server/node.go:742 +0x2c0
github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler(0x17a2ae0, 0xc820346600, 0x7fe348dd2e58, 0xc820929110, 0xc821fa9480, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1336 +0x132
google.golang.org/grpc.(*Server).processUnaryRPC(0xc82000c580, 0x7fe348dd4418, 0xc82000abd0, 0xc82e7321c0, 0xc8204280a0, 0x22abba0, 0x0, 0x0, 0x0)
        /go/src/google.golang.org/grpc/server.go:497 +0xe13
google.golang.org/grpc.(*Server).handleStream(0xc82000c580, 0x7fe348dd4418, 0xc82000abd0, 0xc82e7321c0, 0x0)
        /go/src/google.golang.org/grpc/server.go:646 +0x109d
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc822a08980, 0xc82000c580, 0x7fe348dd4418, 0xc82000abd0, 0xc82e7321c0)
        /go/src/google.golang.org/grpc/server.go:323 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /go/src/google.golang.org/grpc/server.go:324 +0x9a
@seiflotfy
Copy link
Contributor

seiflotfy commented Apr 21, 2016

So this can be replicated using 5 nodes and the photo example?

@seiflotfy seiflotfy added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 21, 2016
@tbg
Copy link
Member

tbg commented Apr 21, 2016

You're probably going to have a hard time reproducing this just like that - it's the first time I've seen it, so it's probably rare. The best starting point for analyzing that issue is this code from timestamp_cache.go:

                case sCmp <= 0 && eCmp >= 0:
                    // New contains or is equal to old; delete old.
                    //
                    // New: ------------      ------------      ------------
                    // Old:   --------    or    ----------  or  ----------
                    //
                    // Old:
                    cache.DelEntry(o.Entry) // this panicked

Naively, this could be a locking issue (i.e. some concurrent actor removes o.Entry and then we try it again, which would presumably (would have to check) trigger this particular stack trace. That's one thing to verify - if that seems unlikely (i.e. if we never mutate the timestamp cache except under (*Replica).mu) maybe something we do causes corruption.

In the absence of good ideas on the cause of this issue, one measure would be getting more information the next time it happens, for instance by adding a defer in (*TimestampCache).Add which, on panic, prints out the timestamp cache's contents, the key range to be added, and the overlap we're considering while it breaks (o in the above code).

@tbg
Copy link
Member

tbg commented Apr 21, 2016

Another thought: Maybe we're modifying overlaps while we're still iterating and cause corruption that way.
For instance, picture a previous iteration of the same GetOverlaps loop deleting/mutating an object that pops up later in the iteration (or mutating an object evicts an object that we later get returned anyway but which at that point is not contained in the cache any more).

@petermattis
Copy link
Collaborator

I recently mucked with the code for baseCache.list. I think the old code was safe to remove a list element multiple times while the new code isn't. Removing a list element multiple times sounds like a bug, but we could also re-add the behavior that doing so is a no-op instead of a panic.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 22, 2016
This brings back the behavior that was present prior to cockroachdb#6140.

See cockroachdb#6190.
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 22, 2016
This brings back the behavior that was present prior to cockroachdb#6140.

See cockroachdb#6190.
@tamird tamird added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label May 5, 2016
@petermattis
Copy link
Collaborator

Looks like this was fixed by the code to avoid panicing when removing a list element twice.

@tbg
Copy link
Member

tbg commented Aug 4, 2017

@petermattis is your last message about this code from cache.go:

	// TODO(peter): Revert this protection against removing a non-existent entry
	// from the list when the cause of
	// https://github.com/cockroachdb/cockroach/issues/6190 is determined. Should
	// be replaced with an explicit panic instead of the implicit one of a
	// nil-pointer dereference.
	if e.next != nil {
		e.prev.next = e.next
		e.next.prev = e.prev
		e.next = nil // avoid memory leaks
		e.prev = nil // avoid memory leaks
	}

or can that code be removed?

@tbg tbg reopened this Aug 4, 2017
@petermattis
Copy link
Collaborator

@tschottdorf I don't recall ever tracking down the root cause. It is possible the problem was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

5 participants