-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock when writing into Badger using multiple routines #1032
Comments
Just tested it with 398445a and it still hangs here:
|
Still waiting for test results with the latest master, but I guess it'll still fail.
|
This commit fixes the issue on latest master: |
@connorgorman latest master also deadlocks in the same place:
|
Hi @tony2001, Thanks for checking it out. do you have any snippet which could reproduce this issue? It'll be very helpful to debug :) |
The patch above still makes sense to me (release the lock before Wait()'ing), but it didn't fix the issue either. |
Unfortunately, no, I don't have a reproduce case and it takes more than 2 hours to reproduce it. |
Yeah, please share it. |
https://gist.github.com/tony2001/0e97a7bb36b97970cf401394e5362b93 all routines backtraces for badger/master with the patch above ^^ |
So it looks like this (correct me if I'm wrong):
|
Hey @tony2001, are you running Badger with DefaultOptions? If not, can you share the options struct with us? |
These are the options I use. I also use our own Log interface, but that shouldn't matter at all.
|
Any news on this? Deadlock (or endless cycle) looks fairly critical to me |
@tony2001 we found the issue. we're working on the fix. Will update you after it gets merged. :) |
Fixes #1032 Currently discardStats flow is as follows: * Discard Stats are generated during compaction. At the end, compaction routine updates these stats in vlog(vlog maintains all discard stats). If number of updates exceeds a threshold, a new request is generated and sent to write channel. Routine waits for request to complete(request.Wait()). * Requests are consumed from write channel and written to vlog first and then to memtable. * If memtable is full, it is flushed to flush channel. *From flush channel, memtables are written to L0 only if there are less than or equal to NumLevelZeroTablesStall tables already. Events which can lead to deadlock: Compaction is running on L0 which has NumLevelZeroTablesStall tables currently and tries to flush discard stats to write channel. After pushing stats to write channel, it waits for write request to complete, which cannot be completed due to cyclic dependency. Fix: This PR introduces a flush channel(buffered) for discardStats. Compaction routine, will push generated discard stats to flush channel, if channel is full it just returns. This decouples compaction and writes. We have a separate routine for consuming stats from flush chan.
@tony2001 I have merged fix in |
Hey @tony2001 did you get a chance to try the fix? |
Unfortunately, no. We've solved the task with a different storage engine, so I'll have to implement a synthetic test to try it. |
Fixes #1032 Currently discardStats flow is as follows: * Discard Stats are generated during compaction. At the end, compaction routine updates these stats in vlog(vlog maintains all discard stats). If number of updates exceeds a threshold, a new request is generated and sent to write channel. Routine waits for request to complete(request.Wait()). * Requests are consumed from write channel and written to vlog first and then to memtable. * If memtable is full, it is flushed to flush channel. *From flush channel, memtables are written to L0 only if there are less than or equal to NumLevelZeroTablesStall tables already. Events which can lead to deadlock: Compaction is running on L0 which has NumLevelZeroTablesStall tables currently and tries to flush discard stats to write channel. After pushing stats to write channel, it waits for write request to complete, which cannot be completed due to cyclic dependency. Fix: This PR introduces a flush channel(buffered) for discardStats. Compaction routine, will push generated discard stats to flush channel, if channel is full it just returns. This decouples compaction and writes. We have a separate routine for consuming stats from flush chan. (cherry picked from commit c1cf0d7)
go version go1.13 linux/amd64
github.com/dgraph-io/badger v2.0.0-rc3+incompatible
I have to admit I didn't try latest master, but I've seen this issue since early versions of Badger 2.0 and it's still there with rc3.
The server is running on 56-core Intel Xeon E5-2660, no idea about the disk.
I'm reading several multi-gigabyte files in their own binary format and "converting" them into Badger, writing data encoded with Protobuf. The initial data is organized in batches (of arbitrary size, from 1 to 1000+ records in a batch), so I'm using WriteBatches to write the actual data and then badger.Update() to update the counters (stored in the same badger db). To speed up the process I'm using 64 goroutines that listen on channel with data and do the actual encoding/writing.
The problem is that after some time the badger deadlocks and stops writing anything, the process is stuck with the backtraces provided below.
As you can see from the backtrace 3, I tried to use Txn.Set/Commit directly instead of using badger.Update, but it didn't help.
Unfortunately, I'm unable to provide a reproduce code so far, as the issue seems to be reproducible only on really large amounts of data after a couple of hours of running with no error messages whatsoever. To make it more complicated, I'm unable to reproduce it in 100% of cases even using the same initial data and the same Go code.
My code looks very much like this (in simplified go "pseudocode"):
Backtraces:
1.
There are some more routines with their own traces, dunno if they are related to the issue or not:
The text was updated successfully, but these errors were encountered: