Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid memory address or nil pointer dereference (Kubernetes) #4053

Closed
igormiletic opened this issue Sep 24, 2019 · 4 comments · Fixed by #4084
Closed

Invalid memory address or nil pointer dereference (Kubernetes) #4053

igormiletic opened this issue Sep 24, 2019 · 4 comments · Fixed by #4084
Assignees
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/confirmed The issue has been triaged but still not reproduced.

Comments

@igormiletic
Copy link

igormiletic commented Sep 24, 2019

Version: v1.1.0

It repeats all the time without clear reasons.

Kubernetes 1.13.
Setup is from HA documentation of DGraph. 3 alpha nodes and 3 zero nodes.

As soon as serious traffic (about 200 upsert commands per socond) after some time nodes start to get corrupted and they simple does not work.

Error is:

I0923 11:36:27.372346       1 node.go:143] Setting raft.Config to: &{ID:8 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc01110baa0 Applied:25420349 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x22d6ac8 DisableProposalForwarding:false}
W0923 11:36:27.386112       1 pool.go:237] Connection lost with dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local:7080. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local: no such host"
I0923 11:36:27.773401       1 node.go:301] Found Snapshot.Metadata: {ConfState:{Nodes:[6 7 8] Learners:[] XXX_unrecognized:[]} Index:25420349 Term:221 XXX_unrecognized:[]}
I0923 11:36:27.773476       1 node.go:312] Found hardstate: {Term:223 Vote:8 Commit:25428540 XXX_unrecognized:[]}
I0923 11:36:29.541079       1 snapshot.go:185] Got StreamSnapshot request: context:<id:9 group:1 addr:"dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local:7080" > index:25420349 read_ts:29013139 
I0923 11:36:29.541280       1 snapshot.go:124] Waiting to reach timestamp: 29013139
I0923 11:36:29.810351       1 node.go:321] Group 1 found 8192 entries
I0923 11:36:29.810374       1 draft.go:1369] Restarting node for group: 1
I0923 11:36:29.810410       1 node.go:180] Setting conf state to nodes:6 nodes:7 nodes:8 
E0923 11:36:29.815924       1 snapshot.go:187] While streaming snapshot: context canceled. Reporting failure.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x1351fb5]
goroutine 223 [running]:
github.com/dgraph-io/dgraph/worker.(*grpcWorker).StreamSnapshot(0xc0110c7958, 0x1952560, 0xc0111ae660, 0x0, 0x0)
    /tmp/go/src/github.com/dgraph-io/dgraph/worker/snapshot.go:188 +0x1f5
github.com/dgraph-io/dgraph/protos/pb._Worker_StreamSnapshot_Handler(0x16e2b20, 0xc0110c7958, 0x194d8e0, 0xc0118b4000, 0x22d6ac8, 0xc011206200)
    /tmp/go/src/github.com/dgraph-io/dgraph/protos/pb/pb.pb.go:5339 +0xad
github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc000432d80, 0x19539a0, 0xc0112c1800, 0xc011206200, 0xc0110c1ef0, 0x220aac0, 0x0, 0x0, 0x0)
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:1176 +0xacd
github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).handleStream(0xc000432d80, 0x19539a0, 0xc0112c1800, 0xc011206200, 0x0)
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:1256 +0xd3f
github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0112a2490, 0xc000432d80, 0x19539a0, 0xc0112c1800, 0xc011206200)
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:691 +0x9f
created by github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:689 +0xa1
@danielmai
Copy link
Contributor

Version: v1.0.1

@igormiletic can you confirm the Dgraph version you're running? You can paste the output of dgraph version. The stack trace does not match the code from v1.0.1.

@igormiletic
Copy link
Author

igormiletic commented Sep 25, 2019

Ah, sorry, I was wrong:

Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb2
Commit timestamp : 2019-09-04 00:12:51 -0700
Branch : HEAD
Go version : go1.12.7

I have updated initial post as well to v.1.1.0

@danielmai danielmai added area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. status/confirmed The issue has been triaged but still not reproduced. labels Sep 25, 2019
@mangalaman93 mangalaman93 added the priority/P0 Critical issue that requires immediate attention. label Sep 25, 2019
@mangalaman93
Copy link
Contributor

Can't see what could possibly be nil there. Because -

  1. n is not nil, we check for it
  2. Good chance n.Raft() doesn't return nil, but need to check more
  3. snap.Context.GetId() is not nil because we printed it in the logs
  4. raft.SnapshotFailure is a constant

cc: @pawanrawal @manishrjain

@manishrjain
Copy link
Contributor

manishrjain commented Sep 27, 2019

I0923 11:36:27.372346 1 node.go:143] Setting raft.Config to: &{ID:8 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc01110baa0 Applied:25420349 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x22d6ac8 DisableProposalForwarding:false}

So, this node is number 8. Looks like this node is seeing 3 peers already: Nodes:[6 7 8]. So, basically 3 nodes, including itself. However, some other node, i.e. number 9 is trying to join this cluster and asking this server for a snapshot, as it is starting up. This call happens before n.Raft has time to be properly created, hence resulting in a segmentation fault.

I0923 11:36:29.541079 1 snapshot.go:185] Got StreamSnapshot request: context:<id:9 group:1 addr:"dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local:7080" > index:25420349 read_ts:29013139

This number 9 node should not be part of the cluster. So, you should remove it by calling /removeNode?id=9&group=1 to Zero leader's HTTP port (6080).

If you can also paste the /state as seen by Zero, that'd be helpful. We can also get on a call to debug this further. CC: @danielmai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/confirmed The issue has been triaged but still not reproduced.
Development

Successfully merging a pull request may close this issue.

4 participants