Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected list of size zero while creating reverse index #2752

Closed
kosiakk opened this issue Nov 15, 2018 · 13 comments
Closed

Unexpected list of size zero while creating reverse index #2752

kosiakk opened this issue Nov 15, 2018 · 13 comments

Comments

@kosiakk
Copy link

kosiakk commented Nov 15, 2018

I'm using the latest DGraph v1.0.10 8b801bd
On a powerful Windows Laptop, with all default settings (e.g. replication factor 1).

  • Steps to reproduce the issue (command/config used to run Dgraph).
  1. Import 1M example from the tutorial into a fresh installation (3 alpha servers on the same host, 1 zero)
  2. Open Schema in ratel
  3. Add reverse index on actor.film
  • Expected behaviour and actual result.
    One server runs 100% cpu of one core for 10 minutes and then crashes:

2018/11/15 13:00:06 Unexpected list of size zero: "\x00\x00\nactor.film\f\x00\x00\x00\x00"
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.AssertTruef
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/error.go:62
github.com/dgraph-io/dgraph/posting.(*rebuild).Run
/ext-go/1/src/github.com/dgraph-io/dgraph/posting/index.go:587
github.com/dgraph-io/dgraph/posting.RebuildReverseEdges
/ext-go/1/src/github.com/dgraph-io/dgraph/posting/index.go:710
github.com/dgraph-io/dgraph/worker.(*node).rebuildOrDelRevEdge
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/index.go:55
github.com/dgraph-io/dgraph/worker.runSchemaMutationHelper
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:169
github.com/dgraph-io/dgraph/worker.runSchemaMutation
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:99
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:198
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:280
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:401
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:429
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1333

Upon restart, it reads the WAL of the schema migration, re-starts the index and crashes again.

@kosiakk kosiakk changed the title Unexpected list of size zero while creating reverse index Unexpected list of size zero while creating reverse index Nov 15, 2018
@kosiakk
Copy link
Author

kosiakk commented Nov 15, 2018

Complete log:

I1115 12:50:46.178210   23844 init.go:80] 

Dgraph version   : v1.0.10
Commit SHA-1     : 8b801bd7
Commit timestamp : 2018-11-05 17:52:33 -0800
Branch           : HEAD

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit https://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed under Apache 2.0. Copyright 2015-2018 Dgraph Labs, Inc.


I1115 12:50:46.180210   23844 server.go:115] Setting Badger table load option: mmap
I1115 12:50:46.180210   23844 server.go:127] Setting Badger value log load option: mmap
I1115 12:50:46.180210   23844 server.go:155] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:true TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:65500 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:true}
badger2018/11/15 12:50:46 INFO: Replaying file id: 0 at offset: 0
badger2018/11/15 12:50:46 INFO: Replay took: 148ms
badger2018/11/15 12:50:46 INFO: Replaying file id: 1 at offset: 0
badger2018/11/15 12:50:46 INFO: Replay took: 7ms
I1115 12:50:46.347210   23844 server.go:115] Setting Badger table load option: mmap
I1115 12:50:46.347210   23844 server.go:127] Setting Badger value log load option: mmap
I1115 12:50:46.347210   23844 server.go:169] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:true TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:true}
badger2018/11/15 12:50:46 INFO: Replaying file id: 6 at offset: 37183251
badger2018/11/15 12:50:46 INFO: Replay took: 463ms
badger2018/11/15 12:50:46 INFO: Replaying file id: 7 at offset: 0
badger2018/11/15 12:50:50 INFO: Replay took: 3.214s
I1115 12:50:50.134210   23844 groups.go:92] Current Raft Id: 21
I1115 12:50:50.136210   23844 worker.go:80] Worker listening at address: [::]:7080
I1115 12:50:50.137210   23844 run.go:338] gRPC server started.  Listening on port 9080
I1115 12:50:50.137210   23844 run.go:339] HTTP server started.  Listening on port 8080
I1115 12:50:50.141210   23844 pool.go:118] CONNECTED to localhost:5080
I1115 12:50:50.144210   23844 groups.go:119] Connected to group zero. Assigned group: 0
I1115 12:50:50.146210   23844 pool.go:118] CONNECTED to localhost:7082
I1115 12:50:50.148210   23844 pool.go:118] CONNECTED to localhost:7081
I1115 12:50:50.148210   23844 draft.go:74] Node ID: 21 with GroupID: 3
I1115 12:50:50.148210   23844 node.go:152] Setting raft.Config to: &{ID:21 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc0129520d0 Applied:2905 MaxSizePerMsg:1048576 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1718900}
I1115 12:50:50.149210   23844 node.go:271] Found Snapshot.Metadata: {ConfState:{Nodes:[21] XXX_unrecognized:[]} Index:2905 Term:2 XXX_unrecognized:[]}
I1115 12:50:50.149210   23844 node.go:282] Found hardstate: {Term:6 Vote:21 Commit:3118 XXX_unrecognized:[]}
I1115 12:50:50.152210   23844 node.go:291] Group 3 found 214 entries
I1115 12:50:50.152210   23844 draft.go:1047] Restarting node for group: 3
I1115 12:50:50.152210   23844 node.go:174] Setting conf state to nodes:21 
I1115 12:50:50.153210   23844 node.go:84] 15 became follower at term 6
I1115 12:50:50.154210   23844 node.go:84] newRaft 15 [peers: [15], term: 6, commit: 3118, applied: 2905, lastindex: 3118, lastterm: 6]
I1115 12:50:50.155210   23844 draft.go:313] Skipping snapshot at 2905, because found one at 2905
I1115 12:50:50.156210   23844 index.go:48] Deleting reverse index for actor.film
I1115 12:50:50.242210   23844 index.go:54] Rebuilding reverse index for actor.film
I1115 12:50:51.084210   23844 groups.go:519] Got address of a Zero server: localhost:5080
I1115 12:50:53.374210   23844 node.go:84] 15 is starting a new election at term 6
I1115 12:50:53.374210   23844 node.go:84] 15 became pre-candidate at term 6
I1115 12:50:53.374210   23844 node.go:84] 15 received MsgPreVoteResp from 15 at term 6
I1115 12:50:53.374210   23844 node.go:84] 15 became candidate at term 7
I1115 12:50:53.374210   23844 node.go:84] 15 received MsgVoteResp from 15 at term 7
I1115 12:50:53.374210   23844 node.go:84] 15 became leader at term 7
I1115 12:50:53.374210   23844 node.go:84] raft.node: 15 elected leader 15 at term 7
I1115 12:50:54.154210   23844 groups.go:718] Leader idx=21 of group=3 is connecting to Zero for txn updates
E1115 12:51:22.157210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:51:50.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:52:18.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:52:46.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:53:14.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:53:42.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:54:10.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:54:38.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:55:06.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:55:34.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:56:02.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:56:30.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:56:58.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:57:26.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:57:54.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:58:22.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:58:50.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:59:18.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
E1115 12:59:46.161210   23844 groups.go:826] While proposing delta: max_assigned:11757 . Error=Server unavailable. Please retry later. Retrying...
2018/11/15 13:00:06 Unexpected list of size zero: "\x00\x00\nactor.film\f\x00\x00\x00\x00"
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.AssertTruef
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/error.go:62
github.com/dgraph-io/dgraph/posting.(*rebuild).Run
	/ext-go/1/src/github.com/dgraph-io/dgraph/posting/index.go:587
github.com/dgraph-io/dgraph/posting.RebuildReverseEdges
	/ext-go/1/src/github.com/dgraph-io/dgraph/posting/index.go:710
github.com/dgraph-io/dgraph/worker.(*node).rebuildOrDelRevEdge
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/index.go:55
github.com/dgraph-io/dgraph/worker.runSchemaMutationHelper
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:169
github.com/dgraph-io/dgraph/worker.runSchemaMutation
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:99
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:198
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:280
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:401
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:429
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333

@manishrjain
Copy link
Contributor

Fixed in v1.0.11-rc2. Can you try with that? The on disk representation has not changed, between 10 and the rc.

@manishrjain
Copy link
Contributor

Feel free to reopen if it still persists in rc2.

@kosiakk
Copy link
Author

kosiakk commented Nov 16, 2018

Unfortunately, it still fails:

Dgraph version : v1.0.11-rc2
Commit SHA-1 : bec18bc
Commit timestamp : 2018-11-14 15:03:36 -0800
Branch : HEAD

2018/11/16 20:20:48 Unexpected list of size zero: "\x00\x00\nactor.film\f\x00\x00\x00\x00"
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.AssertTruef
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/error.go:62
github.com/dgraph-io/dgraph/posting.(*rebuild).Run
	/ext-go/1/src/github.com/dgraph-io/dgraph/posting/index.go:587
github.com/dgraph-io/dgraph/posting.RebuildReverseEdges
	/ext-go/1/src/github.com/dgraph-io/dgraph/posting/index.go:710
github.com/dgraph-io/dgraph/worker.(*node).rebuildOrDelRevEdge
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/index.go:55
github.com/dgraph-io/dgraph/worker.runSchemaMutationHelper
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:179
github.com/dgraph-io/dgraph/worker.runSchemaMutation
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/mutation.go:100
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:198
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:289
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:402
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:429
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333

@manishrjain manishrjain reopened this Nov 16, 2018
@manishrjain
Copy link
Contributor

Yeah, I see the assert still there. Been on my TODO list to refactor this a bit and speed it up. Will take care of this.

If you want to get out of this issue right away, and are not afraid to compile Dgraph, just replace the assert with a continue here:
https://github.com/dgraph-io/dgraph/blob/a1475713bc7f5ce666b5740349283d1c22f46adc/posting/index.go#L587

@kosiakk
Copy link
Author

kosiakk commented Nov 16, 2018

Thank you!
As a workaround I've tried to delete w folder of the crashing group. It does not try to redo the schema migration, but cannot restart either E1116 23:22:35.102485 11928 groups.go:109] Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address

On zero:

I1116 23:23:07.194485   13700 zero.go:477] Connected: id:89 group_id:4 addr:"localhost:7080"
I1116 23:23:10.412485   13700 zero.go:375] Got connection request: addr:"localhost:7080"
E1116 23:23:10.413485   13700 raft.go:516] While applying proposal: Invalid address
I1116 23:23:10.415485   13700 zero.go:477] Connected: id:90 group_id:4 addr:"localhost:7080"

@kosiakk
Copy link
Author

kosiakk commented Nov 16, 2018

Well, the bug itself is not that important, but to me it shows a much deeper problem here:
Why a normal bug in TX processing crashes the entire node?

I’m coming from the Java universe, where bugs of this kind rarely crash even a thread.
In this case, I'd expect some kind of transaction abort or any way of preserving the status quo.

For my application I can accept a TX rollback or even a read-only state for some predicates. Well, even a slow, read-only cluster with warnings, alarms and blinking lights is still a possible solution: it still works.

But loosing a possibility to read committed data doesn't seem like a viable option.

@manishrjain
Copy link
Contributor

It is actually not a crash, but an assert failure. Part of a defensive coding technique that we use at Dgraph (where we put asserts at places we never expect an error). In this case, that assert is actually not that useful, and as you're witnessing, does trigger.

Java is a different universe, where a few unhandled exceptions here and there are just part and parcel of life. I'm personally not a fan of that.

@kosiakk
Copy link
Author

kosiakk commented Nov 17, 2018

Manish, please understand me right: I really don't want to start a programming holy war here. I fully understand the value of assert and use it as well. That's okay.
Bugs happen, asserts misfire, people fool around. That is inevitable, no matter which language or tech is used. Go is quite obviously a popular and reasonable choice.

But please try to imagine my situation right now:

DGraph is at the core of my data product. Somebody clicked something, and this or some other assert misfired. Now my database is down and I have absolutely no way to recover it. What are my options before customers start calling?

Before trying I've read the entire DGraph documentation and found some suggestion to delete WAL. Sadly, that does not help in my case.

What I expect from a production ready product is quite modest:

  1. Badger/DGrpath discovers an unexpected situation, assert fires.
  2. I get all the infos and stack trace in the log.
  3. Operation/request is aborted.
  4. Database continues to work as if nothing happened.

Instead, on the step 3, the alpha goes down, killing 1/3 of the database. I believe, if it was a part of a replication group, all related alphas would face the same assert and behave identically.

@manishrjain
Copy link
Contributor

I think you're conflating an assert failure from a stack trace (in Java) or error handling (in Go). Both of the latter do what you expect, i.e. keep the process running and just error the request out. Assert failures are different, the idea behind them is not that the request has an error. The idea is that there're assumptions that a developer makes about how the code should work, and the asserts (Equivalent of log fatal) enforce those assumptions, so if it ever triggers, the developer knows that their assumptions are flawed. This is part of a defensive coding technique, nothing to do with the language that the code is written in.

This discussion, however, ventures into programming philosophy, something that I don't intend to go into. @danielmai is putting together the fix that I'd mentioned earlier, and we'll cut a release candidate today, so you could use that.

I'd also recommend not to base your entire perception of Dgraph on this bug. As you mentioned, bugs do and will happen. If you're concerned about Dgraph's stability in production, I'm happy to have a chat.

@danielmai
Copy link
Contributor

@kosiakk v1.0.11-rc3 has been released containing a fix for this issue.

Link to binaries: Linux, Darwin, Windows.
Available on Docker with docker pull dgraph/dgraph:v1.0.11-rc3.

@kosiakk
Copy link
Author

kosiakk commented Nov 20, 2018

Fix confirmed! Thank you
I'll gain some more experience and will certainly get back to you with questions

@u007
Copy link

u007 commented Dec 24, 2018

thank god it works :)

and thank you for the fast fix

i think there should be a good practise to do regular backup of the database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants