Boltdb Deleting Bucket? #429

DavidVorick · 2015-09-25T02:26:53Z

I've started poking around in the bolt codebase, but it seems like, nondeterministically, though after a significant amount of changes within a single tx, calling Delete on a bucket can result in the entire bucket being deleted.

edit: it doesn't always drop the whole bucket. Sometimes calling delete will just drop 3/5 items. Maybe it's getting confused about a leaf node vs. root/branch node and dropping a bunch of children?

edit2: inside of Bucket.Delete, a call to c.node().del(key) is made. Seems like the failure only occurs when the index of the leaf is 0 and len(node.inodes) is 1. Further testing continues to confirm this trend.

Things I know about my code:

A lot is happening in a single tx. Probably on the order of thousands of reads and writes to the bucket.
The change that is causing the bucket to be emptied is not happening inside of a ForEach statement or a cursor.
The bucket seems most likely to wipe when there are only a few elements remaining. The stress test brings it up as high as 50 items, but it usually fails at 2 or 3 items. ~~I haven't seen it fail with more than 6 items remaining in the bucket.~~ (got it to drop from 18 to 10 items in a single delete operation by increasing the stress)
The failure is nondeterministic. Sometimes there won't be a failure, but I can get the failure happening again by adding printlns in arbitrary places in my code (the logic is not changed at all, but it can have an effect on whether the test fails or not)

    lenScoBucket(tx) // prints 3
    err := scoBucket.Delete(id[:]) // err is nil
    lenScoBucket(tx) // prints 0

My whole codebase is pretty complicated. It's completely possible that I'm doing something else incorrect somewhere but the error isn't showing up until this point. I will be digging through the bolt code, but was hoping someone had ideas.

The text was updated successfully, but these errors were encountered:

davecgh · 2015-09-25T03:02:15Z

Is the transaction shared across goroutines? I ask because it's not safe to do that and you mentioned that you can cause the error by adding prints which changes timing characteristics and therefore smells like a race condition.

If you're deleting different IDs from different goroutines using the same transaction, the type of behavior you describe could certainly happen due to the fact Delete involves a seek which sets state in the transaction and a race there could end up with an incorrect node being deleted.

DavidVorick · 2015-09-25T03:06:54Z

@davecgh thanks for dropping in.

The tx is happening as 1 goroutine. The program does spawn hundreds of goroutines but the tx is only active inside of 1 of them. I'll double check that though. Additionally, I have the race detector on, and it's not complaining at all.

~~The stress test brings it up as high as 50 items, but it usually fails at 2 or 3 items. I haven't seen it fail with more than 6 items remaining in the bucket.~~

I've now been able to get it to drop from 18 items to 10 items in a single delete operation, by increasing the aggressiveness of the stress test.

edit: pretty confident there's only 1 goroutine accessing the db at a time, restructured the program so that all the extra goroutines were never spawned. Error is still happening.

DavidVorick · 2015-09-25T03:52:44Z

Running tx.Check() is not turning up any errors either. Should tx.Check() be able to catch an error like this?

DavidVorick · 2015-09-25T06:09:08Z

I was able to prevent the problem (even under extreme load) by periodically copying the bucket to memory, deleting it, and then repopulating it. Once the dataset is too large for this to be manageable, I imagine using a temporary bucket will work just as well.

I strongly suspect that there is an error somewhere in bolt's node/leaf logic. The fact that the failures are nondeterministic does suggest a race condition on my end, though I did my best to isolate any possibility of that. To the best of my knowledge, the problem is not from accessing the database using multiple gothreads.

I am able to reproduce the problem by disabling the 'bucket refreshing', and can assist with debugging if needed.

chrsm · 2015-09-25T16:33:06Z

Can you share something that reproduces this behavior?

DavidVorick · 2015-10-01T08:55:29Z

https://github.com/NebulousLabs/Sia/tree/bolt-narrowing

run make test to reproduce the issue. The whole repo is single threaded. I stripped out substantial portions of the code to keep things as simple as possible. A combination of View, Update, and ForEach are all used in the repo, but as far as I've seen I'm not misusing anything anywhere.

My next step is going to be trying to reproducing the error with just the calls to the database, none of the extra logic. I'll get the calls with some form of logging.

DavidVorick · 2015-10-01T20:34:17Z

https://github.com/NebulousLabs/Sia/tree/bolt-narrowing

Stripped out as much logic as time allowed. I tried to reproduce the issues by making a boltdb instance that just copied the reads and writes, but I was unsuccessful. I was only managing the reads + writes for 1 bucket, as opposed to the whole database.

The entire repository runs on a single goroutine, but the failures are still nondeterministic. Would appreciate more guidance. I'm not sure how to continue.

lukechampine · 2015-10-02T18:26:04Z

Confirmed that this behavior still appears as early as 6903c74

DavidVorick · 2015-10-02T18:37:56Z

I realized why the failures are nondeterministic - the test inserts random data seeded by a timestamp. That behavior is fairly well buried, but I should have realized it sooner.

lukechampine · 2015-10-02T21:58:44Z

Further testing indicates that the deletions are probably happening correctly, but our code to check the number of entries is faulty. We are using ForEach to count the number of key/value pairs in the bucket, inside the same transaction in which the entries are deleted. My guess is that this is the real source of the discrepancy.
Is it safe to call ForEach on a bucket that has been modified within the same transaction? (Obviously it's unsafe to modify the bucket inside the actual ForEach function.) If not, is there any way to determine the "post-commit" number of entries from within the transaction? Seems like this is related to #275.

EDIT: my latest guess is that the cursor is encountering a nil key in the middle of a bucket. In effect, this is a "hole" left by Delete. The correct behavior would be to patch the hole, or skip over it somehow. I was able to manually skip over the hole to prevent the problem, but that's pretty unsophisticated.

lukechampine · 2015-10-23T06:24:38Z

Update: calling b.rebalance() after the c.node().del(key) call in bucket.Delete seems to resolve the problem. Again, I don't have a strong grasp of bolt's internals, so it will take me some time to find out why this works. At any rate, rebalancing after each delete incurs an acceptable cost for our purposes, so we may choose to vendor bolt and do it ourselves.

DavidVorick · 2015-10-23T11:33:42Z

@benbjohnson can you provide your input? Can you think of a reason that this would be happening, why b.rebalance() would solve the problem, and whether our fix is something you would merge into master?

This commit fixes an issue where keys are skipped by cursors after deletions occur in a bucket. This occurred because the cursor seeks to the leaf page but does not check if it is empty. Fixes boltdb#429, boltdb#450

This was referenced Oct 20, 2015

bucket: document Put() value must remain valid for the transaction #441

Merged

Copy Bucket.Put() value byte slice #443

Closed

lukechampine mentioned this issue Oct 31, 2015

ForEach stops short #450

Closed

benbjohnson mentioned this issue Nov 6, 2015

Skip empty pages during cursor seek #452

Merged

benbjohnson closed this as completed in #452 Nov 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boltdb Deleting Bucket? #429

Boltdb Deleting Bucket? #429

DavidVorick commented Sep 25, 2015

davecgh commented Sep 25, 2015

DavidVorick commented Sep 25, 2015

DavidVorick commented Sep 25, 2015

DavidVorick commented Sep 25, 2015

chrsm commented Sep 25, 2015

DavidVorick commented Oct 1, 2015

DavidVorick commented Oct 1, 2015

lukechampine commented Oct 2, 2015

DavidVorick commented Oct 2, 2015

lukechampine commented Oct 2, 2015

lukechampine commented Oct 23, 2015

DavidVorick commented Oct 23, 2015

Boltdb Deleting Bucket? #429

Boltdb Deleting Bucket? #429

Comments

DavidVorick commented Sep 25, 2015

davecgh commented Sep 25, 2015

DavidVorick commented Sep 25, 2015

DavidVorick commented Sep 25, 2015

DavidVorick commented Sep 25, 2015

chrsm commented Sep 25, 2015

DavidVorick commented Oct 1, 2015

DavidVorick commented Oct 1, 2015

lukechampine commented Oct 2, 2015

DavidVorick commented Oct 2, 2015

lukechampine commented Oct 2, 2015

lukechampine commented Oct 23, 2015

DavidVorick commented Oct 23, 2015