Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dgraph cluster stuck #2054

Closed
Levatius opened this issue Jan 26, 2018 · 15 comments
Closed

Dgraph cluster stuck #2054

Levatius opened this issue Jan 26, 2018 · 15 comments
Assignees
Labels
investigate Requires further investigation

Comments

@Levatius
Copy link

Levatius commented Jan 26, 2018

Hi there,

I have a question regarding my dgraph cluster (v1.0.2) seeming to freeze (it is unable to process any more operations; queries, mutations, commits).

In my cluster, I have three server instances running off a single zero. This is the log from one the servers, however similar logs can be found on the other two as well:

2018/01/26 15:39:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:111629 Term:2 XXX_unrecognized:[]}, len(data): 73
2018/01/26 15:40:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:139214 Term:2 XXX_unrecognized:[]}, len(data): 73
2018/01/26 15:41:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:153415 Term:2 XXX_unrecognized:[]}, len(data): 73

The zero logs:

2018/01/26 15:41:22 tablet.go:170: size_diff 73270054
2018/01/26 15:41:22 tablet.go:87: Going to move predicate VISITED from 2 to 1
2018/01/26 15:41:22 tablet.go:91: Error while trying to move predicate VISITED from 2 to 1: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.
2018/01/26 15:49:22 tablet.go:165:

Groups sorted by size: [{gid:1 size:9900094} {gid:3 size:26051113} {gid:2 size:116278626}]

2018/01/26 15:49:22 tablet.go:170: size_diff 106378532
2018/01/26 15:49:22 tablet.go:87: Going to move predicate VISITED from 2 to 1
2018/01/26 16:09:22 tablet.go:91: Error while trying to move predicate VISITED from 2 to 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/26 16:09:22 tablet.go:165:

Groups sorted by size: [{gid:1 size:9900094} {gid:3 size:26051113} {gid:2 size:116278626}]

2018/01/26 16:09:22 tablet.go:170: size_diff 106378532
2018/01/26 16:09:22 tablet.go:87: Going to move predicate VISITED from 2 to 1

It seems that the zero is having trouble performing operations between servers while there are transactions in progress on the servers?

Issues:

  • The cluster remains stuck even after removing the pods that were giving the servers work.
  • This leads to us needing to hard reset the dgraph database (delete p/, w/, zw/ and the pods) before we can begin submitting work again.
@Levatius
Copy link
Author

@jimanvlad

@Levatius Levatius changed the title Dgraph Cluster Stuck Dgraph cluster stuck Jan 26, 2018
@pawanrawal pawanrawal added the investigate Requires further investigation label Jan 28, 2018
@pawanrawal
Copy link
Contributor

Hey @Levatius

I am investigating this. Do you get any errors from your queries or mutations while this is happening or they don't return any results?

@jimanvlad
Copy link

jimanvlad commented Jan 30, 2018 via email

@jimanvlad
Copy link

jimanvlad commented Feb 13, 2018

Hi @pawanrawal - We still have this issue, where Zero works fine and then all of a sudden it's trying to do this and gets completely stuck. We can't read or write form the server when it gets to this stage.

2018/02/13 12:40:22 oracle.go:84: purging below ts:397126, len(o.commits):245, len(o.aborts):0
2018/02/13 12:40:32 oracle.go:84: purging below ts:404382, len(o.commits):215, len(o.aborts):4
2018/02/13 12:40:39 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:33631 Term:2 XXX_unrecognized:[]}, len(data): 2018
2018/02/13 12:40:42 oracle.go:84: purging below ts:413017, len(o.commits):221, len(o.aborts):0
2018/02/13 12:40:52 oracle.go:84: purging below ts:420951, len(o.commits):168, len(o.aborts):1
2018/02/13 12:41:02 oracle.go:84: purging below ts:432956, len(o.commits):199, len(o.aborts):1
2018/02/13 12:41:12 oracle.go:84: purging below ts:435447, len(o.commits):228, len(o.aborts):3
2018/02/13 12:41:22 oracle.go:84: purging below ts:436173, len(o.commits):433, len(o.aborts):15
2018/02/13 12:41:25 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:34655 Term:2 XXX_unrecognized:[]}, len(data): 3124
2018/02/13 12:41:32 oracle.go:84: purging below ts:436519, len(o.commits):337, len(o.aborts):19
2018/02/13 12:41:42 oracle.go:84: purging below ts:437272, len(o.commits):490, len(o.aborts):38
2018/02/13 12:41:52 oracle.go:84: purging below ts:437422, len(o.commits):414, len(o.aborts):86
2018/02/13 12:43:32 tablet.go:173:

=====FINE UNTIL HERE=====

Groups sorted by size: [{gid:3 size:0} {gid:1 size:29813} {gid:2 size:35048857}]

2018/02/13 12:43:32 tablet.go:178: size_diff 35048857
2018/02/13 12:43:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 12:43:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.
2018/02/13 12:43:42 oracle.go:84: purging below ts:444421, len(o.commits):3793, len(o.aborts):279
2018/02/13 12:43:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:38927 Term:2 XXX_unrecognized:[]}, len(data): 3144
2018/02/13 12:51:32 tablet.go:173:

Groups sorted by size: [{gid:3 size:299} {gid:1 size:47775} {gid:2 size:43694896}]

2018/02/13 12:51:32 tablet.go:178: size_diff 43694597
2018/02/13 12:51:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 13:11:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/02/13 13:11:32 tablet.go:173:

Groups sorted by size: [{gid:3 size:299} {gid:1 size:47775} {gid:2 size:43694896}]

2018/02/13 13:11:32 tablet.go:178: size_diff 43694597
2018/02/13 13:11:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 13:31:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/02/13 13:31:32 tablet.go:173:

Groups sorted by size: [{gid:3 size:299} {gid:1 size:47775} {gid:2 size:43694896}]

2018/02/13 13:31:32 tablet.go:178: size_diff 43694597
2018/02/13 13:31:32 tablet.go:71: Going to move predicate _predicate_ from 2 to 3
2018/02/13 13:51:32 tablet.go:64: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/02/13 13:51:32 tablet.go:173:

@pawanrawal
Copy link
Contributor

I am going to work on this today. Is it possible for you to move to v1.0.3 if not done already?

@jimanvlad
Copy link

jimanvlad commented Feb 13, 2018 via email

@pawanrawal
Copy link
Contributor

Did you start Zero with replicas flag? Or are all the Dgraph servers serving different groups? And from what I understand, once a predicate move fails you are not able to run any queries/mutations on any of the Dgraph servers? Is there anything unusual in the Dgraph server logs?

@jimanvlad
Copy link

jimanvlad commented Feb 13, 2018 via email

@pawanrawal
Copy link
Contributor

This is a bit weird. I simulated conflict error but the subsequent predicate move after that went smoothly. Do you have any logs like the below on Dgraph servers when the predicate move is happening?

2018/02/14 17:56:53 predicate_move.go:165: Sent 499417 number of keys for predicate _predicate_
2018/02/14 17:57:16 predicate_move.go:165: Sent 320500 number of keys for predicate directed_by
2018/02/14 17:56:41 index.go:942: Dropping predicate: [_predicate_]
2018/02/14 17:56:41 schema.go:73: Deleting schema for predicate: [_predicate_]
2018/02/14 17:56:45 predicate_move.go:52: Writing 205774 keys
2018/02/14 17:56:50 predicate_move.go:52: Writing 204157 keys
2018/02/14 17:56:53 predicate_move.go:52: Writing 89486 keys
2018/02/14 17:56:54 schema.go:244: Setting schema for attr _predicate_: [string], tokenizer: [], directive: NONE, count: false

@pawanrawal
Copy link
Contributor

pawanrawal commented Feb 15, 2018

One more thing, can you please share the output of the /state endpoint on the Zero server when the cluster is stuck?

@jimanvlad
Copy link

jimanvlad commented Mar 1, 2018

Hi, sorry for the late reply. The server is stuck again now (in the 3 node Kubernetes setup - see #2172). Output form state below:

{"counter":"6502","groups":{"1":{"members":{"1":{"id":"1","groupId":1,"addr":"dgraph-server-0.dgraph-server.default.svc.cluster.local:7080","leader":true,"lastUpdate":"1519925832"}},"tablets":{"CONTAINS":{"groupId":1,"predicate":"CONTAINS","space":"3220"},"LINKED_WITH":{"groupId":1,"predicate":"LINKED_WITH","space":"1958"},"SEEN_ON":{"groupId":1,"predicate":"SEEN_ON","space":"6132"},"SEEN_WITH":{"groupId":1,"predicate":"SEEN_WITH","space":"3094"},"STARTED":{"groupId":1,"predicate":"STARTED","space":"2346"},"SUBSITE_OF":{"groupId":1,"predicate":"SUBSITE_OF","space":"1430"},"VISITED":{"groupId":1,"predicate":"VISITED","space":"2189"},"_basket_created":{"groupId":1,"predicate":"_basket_created","space":"498"},"_basket_last_checked":{"groupId":1,"predicate":"_basket_last_checked","space":"983"},"_basket_last_updated":{"groupId":1,"predicate":"_basket_last_updated","space":"1046"},"_basket_xid":{"groupId":1,"predicate":"_basket_xid","space":"274693"},"_checkout_last_checked":{"groupId":1,"predicate":"_checkout_last_checked","space":"884"},"_checkout_xid":{"groupId":1,"predicate":"_checkout_xid","space":"145389"},"_device_last_checked":{"groupId":1,"predicate":"_device_last_checked","space":"293"},"_device_last_updated":{"groupId":1,"predicate":"_device_last_updated","space":"342"},"_device_xid":{"groupId":1,"predicate":"_device_xid","space":"458"},"_elysium_account_created":{"groupId":1,"predicate":"_elysium_account_created","space":"543"},"_elysium_account_last_checked":{"groupId":1,"predicate":"_elysium_account_last_checked","space":"940"},"_elysium_account_last_updated":{"groupId":1,"predicate":"_elysium_account_last_updated","space":"1012"},"_elysium_account_xid":{"groupId":1,"predicate":"_elysium_account_xid","space":"87032"},"_elysium_cookie_created":{"groupId":1,"predicate":"_elysium_cookie_created","space":"538"},"_elysium_cookie_last_checked":{"groupId":1,"predicate":"_elysium_cookie_last_checked","space":"1023"},"_elysium_cookie_last_updated":{"groupId":1,"predicate":"_elysium_cookie_last_updated","space":"1094"},"_elysium_cookie_xid":{"groupId":1,"predicate":"_elysium_cookie_xid","space":"4929179"},"_ga_account_created":{"groupId":1,"predicate":"_ga_account_created","space":"538"},"_ga_account_last_checked":{"groupId":1,"predicate":"_ga_account_last_checked","space":"1177"},"_ga_account_last_updated":{"groupId":1,"predicate":"_ga_account_last_updated","space":"1198"},"_ga_account_xid":{"groupId":1,"predicate":"_ga_account_xid","space":"11964"},"_ga_session_created":{"groupId":1,"predicate":"_ga_session_created","space":"1088"},"_ga_session_last_checked":{"groupId":1,"predicate":"_ga_session_last_checked","space":"1756"},"_ga_session_last_updated":{"groupId":1,"predicate":"_ga_session_last_updated","space":"1890"},"_ga_session_xid":{"groupId":1,"predicate":"_ga_session_xid","space":"4604053"},"_ip_created":{"groupId":1,"predicate":"_ip_created","space":"478"},"_ip_last_checked":{"groupId":1,"predicate":"_ip_last_checked","space":"915"},"_ip_last_updated":{"groupId":1,"predicate":"_ip_last_updated","space":"1006"},"_ip_xid":{"groupId":1,"predicate":"_ip_xid","space":"1292660"},"_locale_last_checked":{"groupId":1,"predicate":"_locale_last_checked","space":"906"},"_locale_last_updated":{"groupId":1,"predicate":"_locale_last_updated","space":"942"},"_locale_xid":{"groupId":1,"predicate":"_locale_xid","space":"4739"},"_master_product_created":{"groupId":1,"predicate":"_master_product_created","space":"1128"},"_master_product_last_checked":{"groupId":1,"predicate":"_master_product_last_checked","space":"2253"},"_master_product_last_updated":{"groupId":1,"predicate":"_master_product_last_updated","space":"2379"},"_master_product_xid":{"groupId":1,"predicate":"_master_product_xid","space":"2611880"},"_product_created":{"groupId":1,"predicate":"_product_created","space":"947"},"_product_last_checked":{"groupId":1,"predicate":"_product_last_checked","space":"1804"},"_product_last_updated":{"groupId":1,"predicate":"_product_last_updated","space":"1868"},"_product_xid":{"groupId":1,"predicate":"_product_xid","space":"265901"},"_site_created":{"groupId":1,"predicate":"_site_created","space":"488"},"_site_last_checked":{"groupId":1,"predicate":"_site_last_checked","space":"997"},"_site_last_updated":{"groupId":1,"predicate":"_site_last_updated","space":"1091"},"_site_xid":{"groupId":1,"predicate":"_site_xid","space":"3491"},"dummy":{"groupId":1,"predicate":"dummy","space":"17"},"ga_visit_id":{"groupId":1,"predicate":"ga_visit_id","space":"1897"},"timestamp":{"groupId":1,"predicate":"timestamp","space":"1114"}}},"2":{"members":{"2":{"id":"2","groupId":2,"addr":"dgraph-server-1.dgraph-server.default.svc.cluster.local:7080","leader":true,"lastUpdate":"1519925825"}},"tablets":{"_predicate_":{"groupId":2,"predicate":"_predicate_","space":"15361725"}}},"3":{"members":{"3":{"id":"3","groupId":3,"addr":"dgraph-server-2.dgraph-server.default.svc.cluster.local:7080","leader":true,"lastUpdate":"1519925827"}},"tablets":{"BELONGS_TO":{"groupId":3,"predicate":"BELONGS_TO","space":"5774"},"CONTRIBUTED_TO":{"groupId":3,"predicate":"CONTRIBUTED_TO","space":"3794"},"_affiliate_created":{"groupId":3,"predicate":"_affiliate_created","space":"1272"},"_affiliate_last_checked":{"groupId":3,"predicate":"_affiliate_last_checked","space":"2416"},"_affiliate_last_updated":{"groupId":3,"predicate":"_affiliate_last_updated","space":"2881"},"_affiliate_xid":{"groupId":3,"predicate":"_affiliate_xid","space":"555176"},"_checkout_created":{"groupId":3,"predicate":"_checkout_created","space":"420"},"_checkout_last_updated":{"groupId":3,"predicate":"_checkout_last_updated","space":"791"},"_locale_created":{"groupId":3,"predicate":"_locale_created","space":"309"},"_network_last_checked":{"groupId":3,"predicate":"_network_last_checked","space":"250"},"_network_last_updated":{"groupId":3,"predicate":"_network_last_updated","space":"450"},"_network_xid":{"groupId":3,"predicate":"_network_xid","space":"220"},"_order_created":{"groupId":3,"predicate":"_order_created","space":"913"},"_order_last_checked":{"groupId":3,"predicate":"_order_last_checked","space":"1692"},"_order_last_updated":{"groupId":3,"predicate":"_order_last_updated","space":"2106"},"_order_xid":{"groupId":3,"predicate":"_order_xid","space":"14909253"},"_programme_last_checked":{"groupId":3,"predicate":"_programme_last_checked","space":"726"},"_programme_last_updated":{"groupId":3,"predicate":"_programme_last_updated","space":"976"},"_programme_xid":{"groupId":3,"predicate":"_programme_xid","space":"2092"},"affiliate_name":{"groupId":3,"predicate":"affiliate_name","space":"1411"},"clicks":{"groupId":3,"predicate":"clicks","space":"552"},"impressions":{"groupId":3,"predicate":"impressions","space":"592"},"network_name":{"groupId":3,"predicate":"network_name","space":"175"},"programme_name":{"groupId":3,"predicate":"programme_name","space":"513"},"promotion_method":{"groupId":3,"predicate":"promotion_method","space":"1233"}}}},"zeros":{"1":{"id":"1","addr":"dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080","leader":true}},"maxLeaseId":"250000","maxTxnTs":"20000","maxRaftId":"3"}

@pawanrawal
Copy link
Contributor

Cross posting from issue #2172

Since you have three servers all serving different groups, if one of them goes down then all mutations which touch the predicates on that server will get stuck. What we have to see is that why could the server not become a leader after a restart. You could also mitigate this problem by having replicas.

Did the cluster get stuck when one of the servers was down?

@jimanvlad
Copy link

jimanvlad commented Mar 14, 2018 via email

@pawanrawal
Copy link
Contributor

I am keen to solve this issue, it would be great if you guys could share logs from a recent run.

@manishrjain manishrjain added the investigate Requires further investigation label Mar 22, 2018
@Levatius
Copy link
Author

Raised a follow-up issue (#2254), probably can close this one and move to the new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Requires further investigation
Development

No branches or pull requests

5 participants