Invalid memory address or nil pointer dereference (Kubernetes) #4053

igormiletic · 2019-09-24T05:17:33Z

Version: v1.1.0

It repeats all the time without clear reasons.

Kubernetes 1.13.
Setup is from HA documentation of DGraph. 3 alpha nodes and 3 zero nodes.

As soon as serious traffic (about 200 upsert commands per socond) after some time nodes start to get corrupted and they simple does not work.

Error is:

I0923 11:36:27.372346       1 node.go:143] Setting raft.Config to: &{ID:8 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc01110baa0 Applied:25420349 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x22d6ac8 DisableProposalForwarding:false}
W0923 11:36:27.386112       1 pool.go:237] Connection lost with dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local:7080. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local: no such host"
I0923 11:36:27.773401       1 node.go:301] Found Snapshot.Metadata: {ConfState:{Nodes:[6 7 8] Learners:[] XXX_unrecognized:[]} Index:25420349 Term:221 XXX_unrecognized:[]}
I0923 11:36:27.773476       1 node.go:312] Found hardstate: {Term:223 Vote:8 Commit:25428540 XXX_unrecognized:[]}
I0923 11:36:29.541079       1 snapshot.go:185] Got StreamSnapshot request: context:<id:9 group:1 addr:"dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local:7080" > index:25420349 read_ts:29013139 
I0923 11:36:29.541280       1 snapshot.go:124] Waiting to reach timestamp: 29013139
I0923 11:36:29.810351       1 node.go:321] Group 1 found 8192 entries
I0923 11:36:29.810374       1 draft.go:1369] Restarting node for group: 1
I0923 11:36:29.810410       1 node.go:180] Setting conf state to nodes:6 nodes:7 nodes:8 
E0923 11:36:29.815924       1 snapshot.go:187] While streaming snapshot: context canceled. Reporting failure.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x1351fb5]
goroutine 223 [running]:
github.com/dgraph-io/dgraph/worker.(*grpcWorker).StreamSnapshot(0xc0110c7958, 0x1952560, 0xc0111ae660, 0x0, 0x0)
    /tmp/go/src/github.com/dgraph-io/dgraph/worker/snapshot.go:188 +0x1f5
github.com/dgraph-io/dgraph/protos/pb._Worker_StreamSnapshot_Handler(0x16e2b20, 0xc0110c7958, 0x194d8e0, 0xc0118b4000, 0x22d6ac8, 0xc011206200)
    /tmp/go/src/github.com/dgraph-io/dgraph/protos/pb/pb.pb.go:5339 +0xad
github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc000432d80, 0x19539a0, 0xc0112c1800, 0xc011206200, 0xc0110c1ef0, 0x220aac0, 0x0, 0x0, 0x0)
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:1176 +0xacd
github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).handleStream(0xc000432d80, 0x19539a0, 0xc0112c1800, 0xc011206200, 0x0)
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:1256 +0xd3f
github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0112a2490, 0xc000432d80, 0x19539a0, 0xc0112c1800, 0xc011206200)
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:691 +0x9f
created by github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
    /tmp/go/src/github.com/dgraph-io/dgraph/vendor/google.golang.org/grpc/server.go:689 +0xa1

The text was updated successfully, but these errors were encountered:

danielmai · 2019-09-25T00:22:59Z

Version: v1.0.1

@igormiletic can you confirm the Dgraph version you're running? You can paste the output of dgraph version. The stack trace does not match the code from v1.0.1.

igormiletic · 2019-09-25T04:35:25Z

Ah, sorry, I was wrong:

Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb2
Commit timestamp : 2019-09-04 00:12:51 -0700
Branch : HEAD
Go version : go1.12.7

I have updated initial post as well to v.1.1.0

mangalaman93 · 2019-09-26T06:41:45Z

Can't see what could possibly be nil there. Because -

n is not nil, we check for it
Good chance n.Raft() doesn't return nil, but need to check more
snap.Context.GetId() is not nil because we printed it in the logs
raft.SnapshotFailure is a constant

cc: @pawanrawal @manishrjain

manishrjain · 2019-09-27T01:45:57Z

I0923 11:36:27.372346 1 node.go:143] Setting raft.Config to: &{ID:8 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc01110baa0 Applied:25420349 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x22d6ac8 DisableProposalForwarding:false}

So, this node is number 8. Looks like this node is seeing 3 peers already: Nodes:[6 7 8]. So, basically 3 nodes, including itself. However, some other node, i.e. number 9 is trying to join this cluster and asking this server for a snapshot, as it is starting up. This call happens before n.Raft has time to be properly created, hence resulting in a segmentation fault.

I0923 11:36:29.541079 1 snapshot.go:185] Got StreamSnapshot request: context:<id:9 group:1 addr:"dgraph-alpha-0.dgraph-alpha.dgraph-b.svc.cluster.local:7080" > index:25420349 read_ts:29013139

This number 9 node should not be part of the cluster. So, you should remove it by calling /removeNode?id=9&group=1 to Zero leader's HTTP port (6080).

If you can also paste the /state as seen by Zero, that'd be helpful. We can also get on a call to debug this further. CC: @danielmai

danielmai added area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. status/confirmed The issue has been triaged but still not reproduced. labels Sep 25, 2019

mangalaman93 added the priority/P0 Critical issue that requires immediate attention. label Sep 25, 2019

mangalaman93 self-assigned this Sep 27, 2019

mangalaman93 added a commit that referenced this issue Sep 27, 2019

Check for n.Raft() to be nil, fixes #4053

09e4f1e

mangalaman93 mentioned this issue Sep 27, 2019

Check for n.Raft() to be nil #4084

Merged

igormiletic mentioned this issue Sep 29, 2019

Alpha node get restarted with: invalid memory address or nil pointer dereference #4095

Closed

mangalaman93 added a commit that referenced this issue Oct 2, 2019

Check for n.Raft() to be nil, fixes #4053

a0793b9

mangalaman93 closed this as completed in #4084 Oct 2, 2019

mangalaman93 added a commit that referenced this issue Oct 2, 2019

Check for n.Raft() to be nil, fixes #4053 (#4084)

30de144

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid memory address or nil pointer dereference (Kubernetes) #4053

Invalid memory address or nil pointer dereference (Kubernetes) #4053

igormiletic commented Sep 24, 2019 •

edited

Loading

danielmai commented Sep 25, 2019

igormiletic commented Sep 25, 2019 •

edited

Loading

mangalaman93 commented Sep 26, 2019

manishrjain commented Sep 27, 2019 •

edited

Loading

Invalid memory address or nil pointer dereference (Kubernetes) #4053

Invalid memory address or nil pointer dereference (Kubernetes) #4053

Comments

igormiletic commented Sep 24, 2019 • edited Loading

danielmai commented Sep 25, 2019

igormiletic commented Sep 25, 2019 • edited Loading

mangalaman93 commented Sep 26, 2019

manishrjain commented Sep 27, 2019 • edited Loading

igormiletic commented Sep 24, 2019 •

edited

Loading

igormiletic commented Sep 25, 2019 •

edited

Loading

manishrjain commented Sep 27, 2019 •

edited

Loading