Request sent to wrong server #2227

Levatius · 2018-03-14T11:19:25Z

As suggested, creating a new issue to track overflow from: #2172

Summary:

We are using a 3-server dgraph setup within a Kubernetes cluster.
We have found that our servers can crash with error:

2018/03/08 16:24:25 attr: "_programme_created" groupId: 1 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
/home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1250
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
/home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2563
google.golang.org/grpc.(*Server).processUnaryRPC
/home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
/home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/runtime/asm_amd64.s:2337

After such a crash, we have observed that the affected server may not get re-elected as the leader of its prior group halting all further mutations.

Levatius · 2018-03-14T11:20:50Z

@pawanrawal
@jimanvlad

pawanrawal · 2018-03-14T22:51:02Z

Does it enter a crash loop where it keeps on failing with this assert and hence isn't able to become the leader?

pawanrawal · 2018-03-15T03:53:48Z

Ok, this happens during a query. From the logs, the request was sent to a node serving group 2 and it thought that the predicate _programme_created belongs to group 1, hence the crash. Seems like a race condition, investigating how this could happen.

Do you know if there could have been a DropAttr request with the attribute as _programme_created?

Levatius · 2018-03-15T10:59:19Z

We do not drop any items from the graph whilst the pipeline is operational, we just drop everything when redeploying to run a fresh test.

In a recent crash around 2018/03/14 20:30:34, we see that the affected server recovers as leader of its group but all further predicate moves are met by "DeadlineExceeded" errors:

(server-0: logs_server-0.txt)

2018/03/14 20:29:17 draft.go:653: Couldn't take snapshot, txn watermark: [2221303], applied watermark: [2484344]
2018/03/14 20:29:47 draft.go:653: Couldn't take snapshot, txn watermark: [2221303], applied watermark: [2488920]
2018/03/14 20:30:17 draft.go:653: Couldn't take snapshot, txn watermark: [2221303], applied watermark: [2494364]
2018/03/14 20:30:33 predicate_move.go:164: Sent 45160 number of keys for predicate _affiliate_xid
2018/03/14 20:30:34 attr: "_affiliate_xid" groupId: 3 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
/home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1261
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
/home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2542
google.golang.org/grpc.(*Server).processUnaryRPC
/home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
/home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
/home/travis/.gimme/versions/go1.9.4.linux.amd64/src/runtime/asm_amd64.s:2337
2018/03/14 20:30:39 gRPC server started. Listening on port 9080
2018/03/14 20:30:39 HTTP server started. Listening on port 8080
2018/03/14 20:30:39 groups.go:86: Current Raft Id: 1
2018/03/14 20:30:39 worker.go:99: Worker listening at address: [::]:7080
2018/03/14 20:30:39 pool.go:118: == CONNECT ==> Setting dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
2018/03/14 20:30:39 groups.go:109: Connected to group zero. Assigned group: 0
2018/03/14 20:30:39 pool.go:118: == CONNECT ==> Setting dgraph-server-1.dgraph-server.default.svc.cluster.local:7080
2018/03/14 20:30:39 pool.go:118: == CONNECT ==> Setting dgraph-server-2.dgraph-server.default.svc.cluster.local:7080
2018/03/14 20:30:39 draft.go:139: Node ID: 1 with GroupID: 1
2018/03/14 20:30:39 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2221293 Term:2 XXX_unrecognized:[]}
2018/03/14 20:30:39 node.go:246: Found hardstate: {Term:3 Vote:1 Commit:2496525 XXX_unrecognized:[]}
2018/03/14 20:30:40 node.go:258: Group 1 found 275232 entries
2018/03/14 20:30:40 draft.go:717: Restarting node for group: 1
2018/03/14 20:30:40 raft.go:567: INFO: 1 became follower at term 3
2018/03/14 20:30:40 raft.go:315: INFO: newRaft 1 [peers: [1], term: 3, commit: 2496525, applied: 2221293, lastindex: 2496525, lastterm: 3]
2018/03/14 20:30:48 raft.go:749: INFO: 1 is starting a new election at term 3
2018/03/14 20:30:48 raft.go:580: INFO: 1 became candidate at term 4
2018/03/14 20:30:48 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 4
2018/03/14 20:30:48 raft.go:621: INFO: 1 became leader at term 4
2018/03/14 20:30:48 node.go:301: INFO: raft.node: 1 elected leader 1 at term 4

(zero: logs_zero.txt)

Groups sorted by size: [{gid:3 size:240772844} {gid:2 size:301465102} {gid:1 size:349930572}]

2018/03/14 20:30:30 tablet.go:188: size_diff 109157728
2018/03/14 20:30:30 tablet.go:78: Going to move predicate: [_affiliate_xid], size: [4.0 MB] from group 1 to 3
2018/03/14 20:30:34 tablet.go:113: Predicate move done for: [_affiliate_xid] from group 1 to 3
2018/03/14 20:30:39 zero.go:333: Got connection request: id:1 addr:"dgraph-server-0.dgraph-server.default.svc.cluster.local:7080"
2018/03/14 20:30:39 zero.go:430: Connected
2018/03/14 20:38:30 tablet.go:183:

Groups sorted by size: [{gid:3 size:245704073} {gid:2 size:334222283} {gid:1 size:345961394}]

2018/03/14 20:38:30 tablet.go:188: size_diff 100257321
2018/03/14 20:38:30 tablet.go:78: Going to move predicate: [_checkout_xid], size: [3.5 MB] from group 1 to 3
2018/03/14 20:45:04 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:45:04 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:50:04 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:50:47 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:55:15 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:55:46 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:55:46 raft.go:556: While applying proposal: Tablet is already being served
2018/03/14 20:58:30 tablet.go:69: Error while trying to move predicate _checkout_xid from 1 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/03/14 20:58:30 tablet.go:183:

Other restarts before this seem to recover fine.

Logs from other servers:
logs_server-1.txt
logs_server-2.txt

We were wondering also whether the following is an issue?

Couldn't take snapshot, txn watermark: [1324655], applied watermark: [2441032]

pawanrawal · 2018-03-15T23:57:57Z

The assert here happens after a predicate move in all cases.

2018/03/14 20:30:33 predicate_move.go:164: Sent 45160 number of keys for predicate _affiliate_xid
2018/03/14 20:30:34 attr: "_affiliate_xid" groupId: 3 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1261
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2542
google.golang.org/grpc.(*Server).processUnaryRPC
        /home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
        /home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
        /home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
        /home/travis/.gimme/versions/go1.9.4.linux.amd64/src/runtime/asm_amd64.s:2337

What could happen is that the predicate move is in progress for attr from server 1 to 2 and server 3 received a request. Now server 3 would send the request to server 1 but by the time the request is received by server 1, server 2 is serving the predicate and now server 1 crashes. Since this is a temporary query error (until all servers catch up with the latest membership state from Zero), we can just return an error when this happens and the client can retry.

pawanrawal · 2018-03-16T02:59:13Z

Couldn't take snapshot, txn watermark: [1324655], applied watermark: [2441032]

This is not the reason for the crash. This should typically be a temporary issue. I will modify the logs to print this as a warning to make it clear.

In a recent crash around 2018/03/14 20:30:34, we see that the affected server recovers as leader of its group but all further predicate moves are met by "DeadlineExceeded" errors:

Interestingly, not all further predicate moves were met by the error. From logs_server-2.txt, the server died at '18:58:37'

2018/03/14 18:58:38 attr: "_all_" groupId: 1 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef

After that predicates moves were successful until. Seems like somehow the connection got broken at this point and wasn't reconnected. I am checking this.

2018/03/14 20:30:30 tablet.go:78: Going to move predicate: [_affiliate_xid], size: [4.0 MB] from group 1 to 3
2018/03/14 20:30:34 tablet.go:113: Predicate move done for: [_affiliate_xid] from group 1 to 3

pawanrawal · 2018-03-16T08:18:36Z

I have pushed a docker image at dgraph/dgraph:dev which adds more relevant logs. It doesn't fix all the issues that you are seeing but would help us debug some of them better.

It would be great if you could use it to load some data and share the logs. Otherwise, if you can share some steps to reproduce like some sample data/scripts, then I am happy to bring up a Kubernetes cluster and debug this at my end. Keen to resolve these issues as soon as possible.

Levatius · 2018-03-16T15:22:16Z

Since this is a temporary query error (until all servers catch up with the latest membership state from Zero), we can just return an error when this happens and the client can retry.

I see, that would be good if we could just retry the query and avoid having the server crash.

Interestingly, not all further predicate moves were met by the error. From logs_server-2.txt, the server died at '18:58:37'

Ah yes, I was not initially clear, there were multiple server crashes beforehand but that particular crash at 20:30:34 seemed to be what broke things.

I have pushed a docker image at dgraph/dgraph:dev which adds more relevant logs. It doesn't fix all the issues that you are seeing but would help us debug some of them better.

We have pulled this image for testing, here are some logs surrounding a crash:

2018/03/16 14:10:52 draft.go:696: Writing snapshot at index: 1408440, applied mark: 1903731
2018/03/16 14:11:22 draft.go:696: Writing snapshot at index: 1571049, applied mark: 1910997
2018/03/16 14:11:52 draft.go:678: Couldn't take snapshot, txn watermark: [1571059], applied watermark: [1918832]
2018/03/16 14:12:22 draft.go:696: Writing snapshot at index: 1657197, applied mark: 1928588
2018/03/16 14:12:52 draft.go:696: Writing snapshot at index: 1738019, applied mark: 1942352
2018/03/16 14:13:22 draft.go:678: Couldn't take snapshot, txn watermark: [1738029], applied watermark: [1952159]
2018/03/16 14:13:52 draft.go:696: Writing snapshot at index: 1794546, applied mark: 1966497
2018/03/16 14:14:22 draft.go:696: Writing snapshot at index: 1861982, applied mark: 1978615
2018/03/16 14:14:52 draft.go:678: Couldn't take snapshot, txn watermark: [1861992], applied watermark: [1989882]
2018/03/16 14:15:12 predicate_move.go:281: Move predicate request for pred: [HAS_TRIGGER], src: [2], dst: [1]
2018/03/16 14:15:12 predicate_move.go:164: Sent 19 number of keys for predicate HAS_TRIGGER
2018/03/16 14:15:22 draft.go:696: Writing snapshot at index: 1985970, applied mark: 2000896
2018/03/16 14:15:29 attr: "all" groupId: 3 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
/home/pawan/go/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/task.go:1261
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
/home/pawan/go/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2542
google.golang.org/grpc.(*Server).processUnaryRPC
/home/pawan/go/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
/home/pawan/go/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/pawan/go/src/google.golang.org/grpc/server.go:617
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2337
2018/03/16 14:15:34 gRPC server started. Listening on port 9080
2018/03/16 14:15:34 HTTP server started. Listening on port 8080
2018/03/16 14:15:34 groups.go:86: Current Raft Id: 2
2018/03/16 14:15:34 worker.go:99: Worker listening at address: [::]:7080
2018/03/16 14:15:34 pool.go:118: == CONNECT ==> Setting dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
2018/03/16 14:15:34 groups.go:109: Connected to group zero. Assigned group: 0
2018/03/16 14:15:34 pool.go:118: == CONNECT ==> Setting dgraph-server-2.dgraph-server.default.svc.cluster.local:7080
2018/03/16 14:15:34 pool.go:118: == CONNECT ==> Setting dgraph-server-0.dgraph-server.default.svc.cluster.local:7080
2018/03/16 14:15:34 draft.go:139: Node ID: 2 with GroupID: 2
2018/03/16 14:15:34 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[2] XXX_unrecognized:[]} Index:1985970 Term:2 XXX_unrecognized:[]}
2018/03/16 14:15:34 node.go:246: Found hardstate: {Term:2 Vote:2 Commit:2001474 XXX_unrecognized:[]}
2018/03/16 14:15:34 node.go:258: Group 2 found 15505 entries
2018/03/16 14:15:34 draft.go:744: Restarting node for group: 2
2018/03/16 14:15:34 raft.go:567: INFO: 2 became follower at term 2
2018/03/16 14:15:34 raft.go:315: INFO: newRaft 2 [peers: [2], term: 2, commit: 2001474, applied: 1985970, lastindex: 2001474, lastterm: 2]
2018/03/16 14:15:38 raft.go:749: INFO: 2 is starting a new election at term 2
2018/03/16 14:15:38 raft.go:580: INFO: 2 became candidate at term 3
2018/03/16 14:15:38 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 3
2018/03/16 14:15:38 raft.go:621: INFO: 2 became leader at term 3
2018/03/16 14:15:38 node.go:301: INFO: raft.node: 2 elected leader 2 at term 3
2018/03/16 14:15:40 groups.go:336: Serving tablet for: _country_xid
2018/03/16 14:15:40 groups.go:336: Serving tablet for: Promo_Code_Set_Name
2018/03/16 14:15:40 groups.go:336: Serving tablet for: Promo_Code_Set_Id
2018/03/16 14:15:40 groups.go:336: Serving tablet for: Promo_Code

What new log types should there be? We could not spot anything new but it could be subtle.

Otherwise, if you can share some steps to reproduce like some sample data/scripts

We are interested in creating a new sample rig, the current issue is that our pipeline now spans 3 pod-types separated using message queues. We will try to come up with a single pod approximation when possible so no additional requirements are needed.

Levatius · 2018-03-16T15:29:53Z

Zero: dgraph-zero-0_logs.txt

pawanrawal · 2018-03-16T21:15:02Z

Could you share full logs from all the servers please like before?

Specifically, I am looking for a log saying this in the server logs.

2018/03/16 14:15:40 groups.go:336: Serving tablet for: _all_

Since your crash is for that predicate, I want to see which server got to serve it

pawanrawal · 2018-03-16T22:42:19Z

I have also updated the image(dgraph/dgraph:dev) to print more logs for Dgraph Zero (i.e. log all tablet proposals that it gets). If you can force pull again before doing another run, it would be helpful. I think I have a good idea of what is happening, logs will help confirm.

Levatius · 2018-03-19T18:31:51Z

Hi,

We ran the updated image today, we see server crashes but no longer see the cause of the crash in the logs?

Here is everything we have:

The only thing I find odd is the memory usage of dgraph-server-2 spikes (29GB out of 32GB for that node) at around 17:56 and crashes soon after.

pawanrawal · 2018-03-20T00:41:29Z

We ran the updated image today, we see server crashes but no longer see the cause of the crash in the logs?

Since there isn't anything in the logs, I suspect these crashes are because of OOM. Though the exact reason can only be found out by inspecting the pods.

Since the crashes that you were seeing earlier were because of a race condition, you don't see them because maybe that condition isn't getting trigged yet. If you run it again, I suspect it should trigger. Also, is it possible to take a heap profile when the memory usage is high and share it?

Thanks for the logs, I will see what I can find and update here.

pawanrawal · 2018-03-20T07:36:43Z

A fix for this (Request sent to the wrong server) has been merged to master and is also part of the updated dgraph/dgraph:dev image. Feel free to close this issue and open new ones regarding the issues you are facing.

pawanrawal self-assigned this Mar 15, 2018

pawanrawal added the kind/bug Something is broken. label Mar 15, 2018

pawanrawal mentioned this issue Mar 20, 2018

Throw error instead of asserting in query #2243

Closed

pawanrawal closed this as completed Mar 22, 2018

manishrjain added the kind/bug Something is broken. label Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request sent to wrong server #2227

Request sent to wrong server #2227

Levatius commented Mar 14, 2018 •

edited by manishrjain

Loading

Levatius commented Mar 14, 2018

pawanrawal commented Mar 14, 2018

pawanrawal commented Mar 15, 2018 •

edited

Loading

Levatius commented Mar 15, 2018

pawanrawal commented Mar 15, 2018

pawanrawal commented Mar 16, 2018

pawanrawal commented Mar 16, 2018 •

edited

Loading

Levatius commented Mar 16, 2018

Levatius commented Mar 16, 2018

pawanrawal commented Mar 16, 2018 •

edited

Loading

pawanrawal commented Mar 16, 2018 •

edited

Loading

Levatius commented Mar 19, 2018

pawanrawal commented Mar 20, 2018 •

edited

Loading

pawanrawal commented Mar 20, 2018 •

edited

Loading

Request sent to wrong server #2227

Request sent to wrong server #2227

Comments

Levatius commented Mar 14, 2018 • edited by manishrjain Loading

Levatius commented Mar 14, 2018

pawanrawal commented Mar 14, 2018

pawanrawal commented Mar 15, 2018 • edited Loading

Levatius commented Mar 15, 2018

pawanrawal commented Mar 15, 2018

pawanrawal commented Mar 16, 2018

pawanrawal commented Mar 16, 2018 • edited Loading

Levatius commented Mar 16, 2018

Levatius commented Mar 16, 2018

pawanrawal commented Mar 16, 2018 • edited Loading

pawanrawal commented Mar 16, 2018 • edited Loading

Levatius commented Mar 19, 2018

pawanrawal commented Mar 20, 2018 • edited Loading

pawanrawal commented Mar 20, 2018 • edited Loading

Levatius commented Mar 14, 2018 •

edited by manishrjain

Loading

pawanrawal commented Mar 15, 2018 •

edited

Loading

pawanrawal commented Mar 16, 2018 •

edited

Loading

pawanrawal commented Mar 16, 2018 •

edited

Loading

pawanrawal commented Mar 16, 2018 •

edited

Loading

pawanrawal commented Mar 20, 2018 •

edited

Loading

pawanrawal commented Mar 20, 2018 •

edited

Loading