-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Boltdb Deleting Bucket? #429
Comments
Is the transaction shared across goroutines? I ask because it's not safe to do that and you mentioned that you can cause the error by adding prints which changes timing characteristics and therefore smells like a race condition. If you're deleting different IDs from different goroutines using the same transaction, the type of behavior you describe could certainly happen due to the fact |
@davecgh thanks for dropping in. The tx is happening as 1 goroutine. The program does spawn hundreds of goroutines but the tx is only active inside of 1 of them. I'll double check that though. Additionally, I have the race detector on, and it's not complaining at all.
I've now been able to get it to drop from 18 items to 10 items in a single delete operation, by increasing the aggressiveness of the stress test. edit: pretty confident there's only 1 goroutine accessing the db at a time, restructured the program so that all the extra goroutines were never spawned. Error is still happening. |
Running tx.Check() is not turning up any errors either. Should tx.Check() be able to catch an error like this? |
I was able to prevent the problem (even under extreme load) by periodically copying the bucket to memory, deleting it, and then repopulating it. Once the dataset is too large for this to be manageable, I imagine using a temporary bucket will work just as well. I strongly suspect that there is an error somewhere in bolt's node/leaf logic. The fact that the failures are nondeterministic does suggest a race condition on my end, though I did my best to isolate any possibility of that. To the best of my knowledge, the problem is not from accessing the database using multiple gothreads. I am able to reproduce the problem by disabling the 'bucket refreshing', and can assist with debugging if needed. |
Can you share something that reproduces this behavior? |
https://github.com/NebulousLabs/Sia/tree/bolt-narrowing run My next step is going to be trying to reproducing the error with just the calls to the database, none of the extra logic. I'll get the calls with some form of logging. |
https://github.com/NebulousLabs/Sia/tree/bolt-narrowing Stripped out as much logic as time allowed. I tried to reproduce the issues by making a boltdb instance that just copied the reads and writes, but I was unsuccessful. I was only managing the reads + writes for 1 bucket, as opposed to the whole database. The entire repository runs on a single goroutine, but the failures are still nondeterministic. Would appreciate more guidance. I'm not sure how to continue. |
Confirmed that this behavior still appears as early as 6903c74 |
I realized why the failures are nondeterministic - the test inserts random data seeded by a timestamp. That behavior is fairly well buried, but I should have realized it sooner. |
Further testing indicates that the deletions are probably happening correctly, but our code to check the number of entries is faulty. We are using EDIT: my latest guess is that the cursor is encountering a |
Update: calling |
@benbjohnson can you provide your input? Can you think of a reason that this would be happening, why |
This commit fixes an issue where keys are skipped by cursors after deletions occur in a bucket. This occurred because the cursor seeks to the leaf page but does not check if it is empty. Fixes boltdb#429, boltdb#450
I've started poking around in the bolt codebase, but it seems like, nondeterministically, though after a significant amount of changes within a single tx, calling
Delete
on a bucket can result in the entire bucket being deleted.edit: it doesn't always drop the whole bucket. Sometimes calling
delete
will just drop 3/5 items. Maybe it's getting confused about a leaf node vs. root/branch node and dropping a bunch of children?edit2: inside of
Bucket.Delete
, a call toc.node().del(key)
is made. Seems like the failure only occurs when the index of the leaf is0
and len(node.inodes) is1
. Further testing continues to confirm this trend.Things I know about my code:
I haven't seen it fail with more than 6 items remaining in the bucket.(got it to drop from 18 to 10 items in a single delete operation by increasing the stress)My whole codebase is pretty complicated. It's completely possible that I'm doing something else incorrect somewhere but the error isn't showing up until this point. I will be digging through the bolt code, but was hoping someone had ideas.
The text was updated successfully, but these errors were encountered: