Dgraph cluster stuck #2054

Levatius · 2018-01-26T16:52:18Z

Hi there,

I have a question regarding my dgraph cluster (v1.0.2) seeming to freeze (it is unable to process any more operations; queries, mutations, commits).

In my cluster, I have three server instances running off a single zero. This is the log from one the servers, however similar logs can be found on the other two as well:

2018/01/26 15:39:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:111629 Term:2 XXX_unrecognized:[]}, len(data): 73
2018/01/26 15:40:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:139214 Term:2 XXX_unrecognized:[]}, len(data): 73
2018/01/26 15:41:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:153415 Term:2 XXX_unrecognized:[]}, len(data): 73

The zero logs:

2018/01/26 15:41:22 tablet.go:170: size_diff 73270054
2018/01/26 15:41:22 tablet.go:87: Going to move predicate VISITED from 2 to 1
2018/01/26 15:41:22 tablet.go:91: Error while trying to move predicate VISITED from 2 to 1: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.
2018/01/26 15:49:22 tablet.go:165:

Groups sorted by size: [{gid:1 size:9900094} {gid:3 size:26051113} {gid:2 size:116278626}]

2018/01/26 15:49:22 tablet.go:170: size_diff 106378532
2018/01/26 15:49:22 tablet.go:87: Going to move predicate VISITED from 2 to 1
2018/01/26 16:09:22 tablet.go:91: Error while trying to move predicate VISITED from 2 to 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/26 16:09:22 tablet.go:165:

Groups sorted by size: [{gid:1 size:9900094} {gid:3 size:26051113} {gid:2 size:116278626}]

2018/01/26 16:09:22 tablet.go:170: size_diff 106378532
2018/01/26 16:09:22 tablet.go:87: Going to move predicate VISITED from 2 to 1

It seems that the zero is having trouble performing operations between servers while there are transactions in progress on the servers?

Issues:

The cluster remains stuck even after removing the pods that were giving the servers work.
This leads to us needing to hard reset the dgraph database (delete p/, w/, zw/ and the pods) before we can begin submitting work again.

Levatius · 2018-01-26T16:52:32Z

@jimanvlad

pawanrawal · 2018-01-30T05:31:07Z

Hey @Levatius

I am investigating this. Do you get any errors from your queries or mutations while this is happening or they don't return any results?

jimanvlad · 2018-01-30T06:33:10Z

They don’t return any results, the query just run forever.

On Tue, 30 Jan 2018 at 05:31, Pawan Rawal ***@***.***> wrote: Hey @Levatius <https://github.com/levatius> I am investigating this. Do you get any errors from your queries or mutations while this is happening or they don't return any results? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2054 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACKSZuO3Aple0BuXd2JjGywwyE77NZArks5tPqkdgaJpZM4Ruk7N> .

-- Thanks, Vlad

jimanvlad · 2018-02-13T15:30:52Z

Hi @pawanrawal - We still have this issue, where Zero works fine and then all of a sudden it's trying to do this and gets completely stuck. We can't read or write form the server when it gets to this stage.

2018/02/13 12:40:22 oracle.go:84: purging below ts:397126, len(o.commits):245, len(o.aborts):0
2018/02/13 12:40:32 oracle.go:84: purging below ts:404382, len(o.commits):215, len(o.aborts):4
2018/02/13 12:40:39 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:33631 Term:2 XXX_unrecognized:[]}, len(data): 2018
2018/02/13 12:40:42 oracle.go:84: purging below ts:413017, len(o.commits):221, len(o.aborts):0
2018/02/13 12:40:52 oracle.go:84: purging below ts:420951, len(o.commits):168, len(o.aborts):1
2018/02/13 12:41:02 oracle.go:84: purging below ts:432956, len(o.commits):199, len(o.aborts):1
2018/02/13 12:41:12 oracle.go:84: purging below ts:435447, len(o.commits):228, len(o.aborts):3
2018/02/13 12:41:22 oracle.go:84: purging below ts:436173, len(o.commits):433, len(o.aborts):15
2018/02/13 12:41:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:34655 Term:2 XXX_unrecognized:[]}, len(data): 3124
2018/02/13 12:41:32 oracle.go:84: purging below ts:436519, len(o.commits):337, len(o.aborts):19
2018/02/13 12:41:42 oracle.go:84: purging below ts:437272, len(o.commits):490, len(o.aborts):38
2018/02/13 12:41:52 oracle.go:84: purging below ts:437422, len(o.commits):414, len(o.aborts):86
2018/02/13 12:43:32 tablet.go:173:

=====FINE UNTIL HERE=====

Groups sorted by size: [{gid:3 size:0} {gid:1 size:29813} {gid:2 size:35048857}]

2018/02/13 12:43:32 tablet.go:178: size_diff 35048857
2018/02/13 12:43:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 12:43:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.
2018/02/13 12:43:42 oracle.go:84: purging below ts:444421, len(o.commits):3793, len(o.aborts):279
2018/02/13 12:43:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:38927 Term:2 XXX_unrecognized:[]}, len(data): 3144
2018/02/13 12:51:32 tablet.go:173:

Groups sorted by size: [{gid:3 size:299} {gid:1 size:47775} {gid:2 size:43694896}]

2018/02/13 12:51:32 tablet.go:178: size_diff 43694597
2018/02/13 12:51:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 13:11:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/02/13 13:11:32 tablet.go:173:

Groups sorted by size: [{gid:3 size:299} {gid:1 size:47775} {gid:2 size:43694896}]

2018/02/13 13:11:32 tablet.go:178: size_diff 43694597
2018/02/13 13:11:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 13:31:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/02/13 13:31:32 tablet.go:173:

Groups sorted by size: [{gid:3 size:299} {gid:1 size:47775} {gid:2 size:43694896}]

2018/02/13 13:31:32 tablet.go:178: size_diff 43694597
2018/02/13 13:31:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 13:51:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/02/13 13:51:32 tablet.go:173:

pawanrawal · 2018-02-13T20:10:12Z

I am going to work on this today. Is it possible for you to move to v1.0.3 if not done already?

jimanvlad · 2018-02-13T20:42:19Z

Yep, this is now happening with 1.0.3 now. Let us know if you need any more information to pinpoint this. This is running in a clustered set up on kubernetes with 3 servers and 1 zero.

On Tue, 13 Feb 2018 at 20:10, Pawan Rawal ***@***.***> wrote: I am going to work on this today. Is it possible for you to move to v1.0.3 if not done already? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2054 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACKSZkOxGht9rhCr58Q7VWp3JjBrb6_cks5tUewogaJpZM4Ruk7N> .

-- Thanks, Vlad

pawanrawal · 2018-02-13T22:46:59Z

Did you start Zero with replicas flag? Or are all the Dgraph servers serving different groups? And from what I understand, once a predicate move fails you are not able to run any queries/mutations on any of the Dgraph servers? Is there anything unusual in the Dgraph server logs?

jimanvlad · 2018-02-13T23:49:51Z

No replicas - all servers run different groups. Nothing weird in the server logs, they all just stop and don't do anything any more.

On Tue, 13 Feb 2018 at 22:47, Pawan Rawal ***@***.***> wrote: Did you start Zero with replicas flag? Or are all the Dgraph servers serving different groups? And from what I understand, once a predicate move fails you are not able to run any queries/mutations on any of the Dgraph servers? Is there anything unusual in the Dgraph server logs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2054 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACKSZkEXb0GMy7cWzpzshohdPkDZsa4zks5tUhDmgaJpZM4Ruk7N> .

-- Thanks, Vlad

pawanrawal · 2018-02-14T07:00:09Z

This is a bit weird. I simulated conflict error but the subsequent predicate move after that went smoothly. Do you have any logs like the below on Dgraph servers when the predicate move is happening?

2018/02/14 17:56:53 predicate_move.go:165: Sent 499417 number of keys for predicate _predicate_
2018/02/14 17:57:16 predicate_move.go:165: Sent 320500 number of keys for predicate directed_by

2018/02/14 17:56:41 index.go:942: Dropping predicate: [_predicate_]
2018/02/14 17:56:41 schema.go:73: Deleting schema for predicate: [_predicate_]
2018/02/14 17:56:45 predicate_move.go:52: Writing 205774 keys
2018/02/14 17:56:50 predicate_move.go:52: Writing 204157 keys
2018/02/14 17:56:53 predicate_move.go:52: Writing 89486 keys
2018/02/14 17:56:54 schema.go:244: Setting schema for attr _predicate_: [string], tokenizer: [], directive: NONE, count: false

pawanrawal · 2018-02-15T00:35:57Z

One more thing, can you please share the output of the /state endpoint on the Zero server when the cluster is stuck?

jimanvlad · 2018-03-01T18:07:11Z

Hi, sorry for the late reply. The server is stuck again now (in the 3 node Kubernetes setup - see #2172). Output form state below:

{"counter":"6502","groups":{"1":{"members":{"1":{"id":"1","groupId":1,"addr":"dgraph-server-0.dgraph-server.default.svc.cluster.local:7080","leader":true,"lastUpdate":"1519925832"}},"tablets":{"CONTAINS":{"groupId":1,"predicate":"CONTAINS","space":"3220"},"LINKED_WITH":{"groupId":1,"predicate":"LINKED_WITH","space":"1958"},"SEEN_ON":{"groupId":1,"predicate":"SEEN_ON","space":"6132"},"SEEN_WITH":{"groupId":1,"predicate":"SEEN_WITH","space":"3094"},"STARTED":{"groupId":1,"predicate":"STARTED","space":"2346"},"SUBSITE_OF":{"groupId":1,"predicate":"SUBSITE_OF","space":"1430"},"VISITED":{"groupId":1,"predicate":"VISITED","space":"2189"},"_basket_created":{"groupId":1,"predicate":"_basket_created","space":"498"},"_basket_last_checked":{"groupId":1,"predicate":"_basket_last_checked","space":"983"},"_basket_last_updated":{"groupId":1,"predicate":"_basket_last_updated","space":"1046"},"_basket_xid":{"groupId":1,"predicate":"_basket_xid","space":"274693"},"_checkout_last_checked":{"groupId":1,"predicate":"_checkout_last_checked","space":"884"},"_checkout_xid":{"groupId":1,"predicate":"_checkout_xid","space":"145389"},"_device_last_checked":{"groupId":1,"predicate":"_device_last_checked","space":"293"},"_device_last_updated":{"groupId":1,"predicate":"_device_last_updated","space":"342"},"_device_xid":{"groupId":1,"predicate":"_device_xid","space":"458"},"_elysium_account_created":{"groupId":1,"predicate":"_elysium_account_created","space":"543"},"_elysium_account_last_checked":{"groupId":1,"predicate":"_elysium_account_last_checked","space":"940"},"_elysium_account_last_updated":{"groupId":1,"predicate":"_elysium_account_last_updated","space":"1012"},"_elysium_account_xid":{"groupId":1,"predicate":"_elysium_account_xid","space":"87032"},"_elysium_cookie_created":{"groupId":1,"predicate":"_elysium_cookie_created","space":"538"},"_elysium_cookie_last_checked":{"groupId":1,"predicate":"_elysium_cookie_last_checked","space":"1023"},"_elysium_cookie_last_updated":{"groupId":1,"predicate":"_elysium_cookie_last_updated","space":"1094"},"_elysium_cookie_xid":{"groupId":1,"predicate":"_elysium_cookie_xid","space":"4929179"},"_ga_account_created":{"groupId":1,"predicate":"_ga_account_created","space":"538"},"_ga_account_last_checked":{"groupId":1,"predicate":"_ga_account_last_checked","space":"1177"},"_ga_account_last_updated":{"groupId":1,"predicate":"_ga_account_last_updated","space":"1198"},"_ga_account_xid":{"groupId":1,"predicate":"_ga_account_xid","space":"11964"},"_ga_session_created":{"groupId":1,"predicate":"_ga_session_created","space":"1088"},"_ga_session_last_checked":{"groupId":1,"predicate":"_ga_session_last_checked","space":"1756"},"_ga_session_last_updated":{"groupId":1,"predicate":"_ga_session_last_updated","space":"1890"},"_ga_session_xid":{"groupId":1,"predicate":"_ga_session_xid","space":"4604053"},"_ip_created":{"groupId":1,"predicate":"_ip_created","space":"478"},"_ip_last_checked":{"groupId":1,"predicate":"_ip_last_checked","space":"915"},"_ip_last_updated":{"groupId":1,"predicate":"_ip_last_updated","space":"1006"},"_ip_xid":{"groupId":1,"predicate":"_ip_xid","space":"1292660"},"_locale_last_checked":{"groupId":1,"predicate":"_locale_last_checked","space":"906"},"_locale_last_updated":{"groupId":1,"predicate":"_locale_last_updated","space":"942"},"_locale_xid":{"groupId":1,"predicate":"_locale_xid","space":"4739"},"_master_product_created":{"groupId":1,"predicate":"_master_product_created","space":"1128"},"_master_product_last_checked":{"groupId":1,"predicate":"_master_product_last_checked","space":"2253"},"_master_product_last_updated":{"groupId":1,"predicate":"_master_product_last_updated","space":"2379"},"_master_product_xid":{"groupId":1,"predicate":"_master_product_xid","space":"2611880"},"_product_created":{"groupId":1,"predicate":"_product_created","space":"947"},"_product_last_checked":{"groupId":1,"predicate":"_product_last_checked","space":"1804"},"_product_last_updated":{"groupId":1,"predicate":"_product_last_updated","space":"1868"},"_product_xid":{"groupId":1,"predicate":"_product_xid","space":"265901"},"_site_created":{"groupId":1,"predicate":"_site_created","space":"488"},"_site_last_checked":{"groupId":1,"predicate":"_site_last_checked","space":"997"},"_site_last_updated":{"groupId":1,"predicate":"_site_last_updated","space":"1091"},"_site_xid":{"groupId":1,"predicate":"_site_xid","space":"3491"},"dummy":{"groupId":1,"predicate":"dummy","space":"17"},"ga_visit_id":{"groupId":1,"predicate":"ga_visit_id","space":"1897"},"timestamp":{"groupId":1,"predicate":"timestamp","space":"1114"}}},"2":{"members":{"2":{"id":"2","groupId":2,"addr":"dgraph-server-1.dgraph-server.default.svc.cluster.local:7080","leader":true,"lastUpdate":"1519925825"}},"tablets":{"_predicate_":{"groupId":2,"predicate":"_predicate_","space":"15361725"}}},"3":{"members":{"3":{"id":"3","groupId":3,"addr":"dgraph-server-2.dgraph-server.default.svc.cluster.local:7080","leader":true,"lastUpdate":"1519925827"}},"tablets":{"BELONGS_TO":{"groupId":3,"predicate":"BELONGS_TO","space":"5774"},"CONTRIBUTED_TO":{"groupId":3,"predicate":"CONTRIBUTED_TO","space":"3794"},"_affiliate_created":{"groupId":3,"predicate":"_affiliate_created","space":"1272"},"_affiliate_last_checked":{"groupId":3,"predicate":"_affiliate_last_checked","space":"2416"},"_affiliate_last_updated":{"groupId":3,"predicate":"_affiliate_last_updated","space":"2881"},"_affiliate_xid":{"groupId":3,"predicate":"_affiliate_xid","space":"555176"},"_checkout_created":{"groupId":3,"predicate":"_checkout_created","space":"420"},"_checkout_last_updated":{"groupId":3,"predicate":"_checkout_last_updated","space":"791"},"_locale_created":{"groupId":3,"predicate":"_locale_created","space":"309"},"_network_last_checked":{"groupId":3,"predicate":"_network_last_checked","space":"250"},"_network_last_updated":{"groupId":3,"predicate":"_network_last_updated","space":"450"},"_network_xid":{"groupId":3,"predicate":"_network_xid","space":"220"},"_order_created":{"groupId":3,"predicate":"_order_created","space":"913"},"_order_last_checked":{"groupId":3,"predicate":"_order_last_checked","space":"1692"},"_order_last_updated":{"groupId":3,"predicate":"_order_last_updated","space":"2106"},"_order_xid":{"groupId":3,"predicate":"_order_xid","space":"14909253"},"_programme_last_checked":{"groupId":3,"predicate":"_programme_last_checked","space":"726"},"_programme_last_updated":{"groupId":3,"predicate":"_programme_last_updated","space":"976"},"_programme_xid":{"groupId":3,"predicate":"_programme_xid","space":"2092"},"affiliate_name":{"groupId":3,"predicate":"affiliate_name","space":"1411"},"clicks":{"groupId":3,"predicate":"clicks","space":"552"},"impressions":{"groupId":3,"predicate":"impressions","space":"592"},"network_name":{"groupId":3,"predicate":"network_name","space":"175"},"programme_name":{"groupId":3,"predicate":"programme_name","space":"513"},"promotion_method":{"groupId":3,"predicate":"promotion_method","space":"1233"}}}},"zeros":{"1":{"id":"1","addr":"dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080","leader":true}},"maxLeaseId":"250000","maxTxnTs":"20000","maxRaftId":"3"}

pawanrawal · 2018-03-14T06:10:55Z

Cross posting from issue #2172

Since you have three servers all serving different groups, if one of them goes down then all mutations which touch the predicates on that server will get stuck. What we have to see is that why could the server not become a leader after a restart. You could also mitigate this problem by having replicas.

Did the cluster get stuck when one of the servers was down?

jimanvlad · 2018-03-14T06:33:21Z

Hi, Yes, it’s to do with the error/crash raised in the other issue. When one server crashes, it doesn’t become a leader when Kubernetes respawns the pod. We’ll create a new issue to track this, but the error and behaviour is already explained in the “Unbalanced load” issue.

On Wed, 14 Mar 2018 at 06:11, Pawan Rawal ***@***.***> wrote: Cross posting from issue #2172 <#2172> Since you have three servers all serving different groups, if one of them goes down then all mutations which touch the predicates on that server will get stuck. What we have to see is that why could the server not become a leader after a restart. You could also mitigate this problem by having replicas. Did the cluster get stuck when one of the servers was down? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2054 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACKSZiSqifnhcgUnDBQva9nGUX_5an5mks5teLR2gaJpZM4Ruk7N> .

-- Thanks, Vlad

pawanrawal · 2018-03-22T01:15:21Z

I am keen to solve this issue, it would be great if you guys could share logs from a recent run.

Levatius · 2018-03-22T18:02:52Z

Raised a follow-up issue (#2254), probably can close this one and move to the new one.

Levatius changed the title ~~Dgraph Cluster Stuck~~ Dgraph cluster stuck Jan 26, 2018

pawanrawal added the investigate Requires further investigation label Jan 28, 2018

janardhan1993 assigned janardhan1993 and pawanrawal and unassigned janardhan1993 Jan 30, 2018

jimanvlad mentioned this issue Mar 1, 2018

Unbalanced server load #2172

Closed

manishrjain added the investigate Requires further investigation label Mar 22, 2018

pawanrawal closed this as completed Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dgraph cluster stuck #2054

Dgraph cluster stuck #2054

Levatius commented Jan 26, 2018 •

edited by manishrjain

Loading

Levatius commented Jan 26, 2018

pawanrawal commented Jan 30, 2018

jimanvlad commented Jan 30, 2018 via email

jimanvlad commented Feb 13, 2018 •

edited

Loading

pawanrawal commented Feb 13, 2018

jimanvlad commented Feb 13, 2018 via email

pawanrawal commented Feb 13, 2018

jimanvlad commented Feb 13, 2018 via email

pawanrawal commented Feb 14, 2018

pawanrawal commented Feb 15, 2018 •

edited

Loading

jimanvlad commented Mar 1, 2018 •

edited

Loading

pawanrawal commented Mar 14, 2018

jimanvlad commented Mar 14, 2018 via email

pawanrawal commented Mar 22, 2018

Levatius commented Mar 22, 2018

Dgraph cluster stuck #2054

Dgraph cluster stuck #2054

Comments

Levatius commented Jan 26, 2018 • edited by manishrjain Loading

Levatius commented Jan 26, 2018

pawanrawal commented Jan 30, 2018

jimanvlad commented Jan 30, 2018 via email

jimanvlad commented Feb 13, 2018 • edited Loading

pawanrawal commented Feb 13, 2018

jimanvlad commented Feb 13, 2018 via email

pawanrawal commented Feb 13, 2018

jimanvlad commented Feb 13, 2018 via email

pawanrawal commented Feb 14, 2018

pawanrawal commented Feb 15, 2018 • edited Loading

jimanvlad commented Mar 1, 2018 • edited Loading

pawanrawal commented Mar 14, 2018

jimanvlad commented Mar 14, 2018 via email

pawanrawal commented Mar 22, 2018

Levatius commented Mar 22, 2018

Levatius commented Jan 26, 2018 •

edited by manishrjain

Loading

jimanvlad commented Feb 13, 2018 •

edited

Loading

pawanrawal commented Feb 15, 2018 •

edited

Loading

jimanvlad commented Mar 1, 2018 •

edited

Loading