Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DGraph server killed due to OOM #2301

Closed
awmanoj opened this issue Apr 3, 2018 · 4 comments
Closed

DGraph server killed due to OOM #2301

awmanoj opened this issue Apr 3, 2018 · 4 comments
Assignees
Labels
kind/bug Something is broken. kind/question Something requiring a response.
Milestone

Comments

@awmanoj
Copy link

awmanoj commented Apr 3, 2018

  • What version of Dgraph are you using?
    1.0.4

  • Have you tried reproducing the issue with latest release?
    Yes

  • What is the hardware spec (RAM, OS)?
    ubuntu 14.04 / 8 core 32GB

  • Steps to reproduce the issue (command/config used to run Dgraph).
    Out of memory condition invoked and Dgraph process got killed. Details below.

  • Expected behaviour and actual result.
    No out of memory.
    __

We have a 3 node setup for dgraphzero and a 3 node setup for dgraph-server.

We got an OOM in one of the dgraph-servers on Apr 1, 21:58 WIB:

[Sun Apr  1 21:58:32 2018] Out of memory: Kill process 21070 (dgraph) score 957 or sacrifice child
[Sun Apr  1 21:58:32 2018] Killed process 21070 (dgraph) total-vm:44275648kB, anon-rss:32441532kB, file-rss:0kB

At this time as you see, anon-rss is 32 GB.

We got another OOM on another dgraph-server on Apr 2, 01:23 WIB:

Apr  2 01:23:44 localhost kernel: [5997037.198669] Killed process 3327 (dgraph) total-vm:44678136kB, anon-rss:32430992kB, file-rss:0kB

Here too, anon-rss is 32GB.

When I check the p/ folder on each node data is ~ 3.5 G. During normal hours RSS usage from top is limited to under 6 G:

 2010 root      20   0 27.284g 5.622g 1.520g S   6.5 17.9 437:39.24 dgraph                                                       

From logs during that time, I see following:

2018/04/02 01:23:39 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 301
2018/04/02 01:23:39 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 1 at term 301
2018/04/02 01:23:39 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 3 at term 301
2018/04/02 01:23:41 draft.go:653: Couldn't take snapshot, txn watermark: [1529797], applied watermark: [1578263]
2018/04/02 01:23:42 raft.go:749: INFO: 2 is starting a new election at term 301
2018/04/02 01:23:42 raft.go:580: INFO: 2 became candidate at term 302
2018/04/02 01:23:42 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 302
2018/04/02 01:23:42 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 1 at term 302
2018/04/02 01:23:42 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 3 at term 302
2018/04/02 01:23:45 raft.go:749: INFO: 2 is starting a new election at term 302
2018/04/02 01:23:45 raft.go:580: INFO: 2 became candidate at term 303

One of the three server is constantly emitting following logs as I see now:

2018/04/03 11:45:49 raft.go:749: INFO: 2 is starting a new election at term 41644
2018/04/03 11:45:49 raft.go:580: INFO: 2 became candidate at term 41645
2018/04/03 11:45:49 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 41645
2018/04/03 11:45:49 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 1 at term 41645
2018/04/03 11:45:49 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 3 at term 41645
2018/04/03 11:45:52 raft.go:749: INFO: 2 is starting a new election at term 41645
2018/04/03 11:45:52 raft.go:580: INFO: 2 became candidate at term 41646
2018/04/03 11:45:52 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 41646
2018/04/03 11:45:52 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 1 at term 41646
2018/04/03 11:45:52 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 3 at term 41646
2018/04/03 11:45:55 raft.go:749: INFO: 2 is starting a new election at term 41646
2018/04/03 11:45:55 raft.go:580: INFO: 2 became candidate at term 41647
2018/04/03 11:45:55 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 41647
2018/04/03 11:45:55 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 3 at term 41647
2018/04/03 11:45:55 raft.go:651: INFO: 2 [logterm: 59, index: 1578263] sent MsgVote request to 1 at term 41647

Other servers' logs at this time are:

2018/04/03 11:47:55 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 52)
2018/04/03 11:47:57 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 46)
2018/04/03 11:47:58 draft.go:653: Couldn't take snapshot, txn watermark: [1529797], applied watermark: [1578309]
2018/04/03 11:48:01 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 50)
2018/04/03 11:48:04 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 38)
2018/04/03 11:48:06 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 9)
2018/04/03 11:48:08 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 8)
2018/04/03 11:48:10 raft.go:692: INFO: 1 [logterm: 59, index: 1578309, vote: 1] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 96)
2018/04/03 11:45:22 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
2018/04/03 11:45:25 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
2018/04/03 11:45:27 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
2018/04/03 11:45:31 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
2018/04/03 11:45:33 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
2018/04/03 11:45:37 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
2018/04/03 11:45:40 raft.go:692: INFO: 3 [logterm: 59, index: 1578309, vote: 0] ignored MsgVote from 2 [logterm: 59, index: 1578263] at term 59: lease is not expired (remaining ticks: 100)
@pawanrawal pawanrawal self-assigned this Apr 3, 2018
@pawanrawal pawanrawal added the kind/bug Something is broken. label Apr 3, 2018
@pawanrawal
Copy link
Contributor

Are the three dgraph servers replicas, i.e. Zero was started with replicas as 3?

Firstly, I'd recommend upgrading to the nightly release as it contains a fix for the issue where snapshot wasn't happening in the case where a group had replicas. See #2266. That could be one reason for OOM. If you share some logs before the OOM crash, I can confirm that.

If shifting to nightly release doesn't solve the issue, can you take a heap profile when the usage is high and share that? Instructions for doing that are available at https://docs.dgraph.io/howto/#retrieving-debug-information.

@awmanoj
Copy link
Author

awmanoj commented Apr 3, 2018

Yes, with replicas.

replicas: 3
wal: /data/dgraph/wz
bindall: true
my: 172.21.252.4:5080
peer: 172.21.253.84:5080

I'll try running nightly.

@manishrjain manishrjain added this to the Sprint-001 milestone Apr 3, 2018
@pawanrawal
Copy link
Contributor

@awmanoj could you confirm if using the nightly solved the problem for you so that I could close this issue?

@manishrjain manishrjain added the kind/question Something requiring a response. label Apr 9, 2018
@awmanoj
Copy link
Author

awmanoj commented Apr 9, 2018

Hi @pawanrawal, moving to nightly build has resolved the issue. Thanks.

@awmanoj awmanoj closed this as completed Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken. kind/question Something requiring a response.
Development

No branches or pull requests

3 participants