-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for metaslab_fastwrite_unmark() assert failure #4870
Conversation
Nice work! This one has been a thorn in our side for far too long. Your proposed fix looks entirely reasonable to me and nicely explains the problem. I'll do some local testing over the weekend to verify the patch as well. I'd suggest updating the patch when you get a chance to include a clear description on the problem, and added the following line to the commit message. This will instruct the buildbot to run zloop.sh for 2 hours when testing this commit instead of the default 15 minutes. That should give us a really good indication if it's been fixed.
|
@behlendorf Will do. To be clear, do you want the description in the commit message or the PR? |
d2ad59f
to
181f22b
Compare
I updated the pull as you requested. By the way, is there a way to look at the dumps from ztesting that buildbot does? I see that there will be ztest failures when everything is done, but I know this problem isn't the only issue plaguing ztest. Is there any way I can confirm this particular failure is fixed? |
Yes, a good commit message for the commit describing the problem and fix. Your comment from above in the PR would be fine. So zloop.sh is supposed to dump all backtraces from the core dump to the buildbot logs. In practice, I've seen this not work reliably on all the Linux distributions. It seems work more often than not on Debian 8 but not elsewhere, I haven't had a chance to run down the root cause. You can run the exact same script locally, grab it from here: https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-ztest.sh Your right to be absolutely sure this is no longer happening and those failures are unrelated we'll need to at least get back traces out of the buildbot for the remaining ztest failures. |
@behlendorf OK. Well I'll get my machine running tests over the weekend and when I get in to work on Monday I'll look through the logs and make sure anything that happens is unrelated. |
Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. TEST_ZTEST_TIMEOUT=7200
181f22b
to
60e9c7d
Compare
I checked in on my tests (2 instances of zloop.sh set to run for 2 days) after running them all night. So far there have been 4 crashes, all of which have been in zdb. Its also worth noting that one of the zloop instances hasn't crashed yet, which is surprising from my experience with it. Of the 4 failing test slaves, 1 failed in splat of all things. Something there looks a bit fishy; libc segfaulted during the splat test. Another failed slave died with ztest timing out, which I have also never seen before. The other 2 slaves had 3 legitimate ztest crashes between them which isn't bad considering we had 6 test instances running for 2 hours. I'm going to say I'm relatively sure the problem is fixed. @behlendorf Let me know if your tests prove otherwise. |
Test ran all weekend. Several failures, mostly from zdb, but metaslab_fastwrite_unmark() seems to be fixed. |
Similar results on my end after a weekend of testing. A few unrelated failures but it definitely looks like you've resolved this issue. I'll get it merged today! |
Thanks for fixing my code, @tcaputi! I can only apologize for not having found the time to find and fix my own mistakes. |
No problem, lets just get it in so my tests on the ZFS encryption PR stops looking like it failed. |
Merged as: 77943bc Fix for metaslab_fastwrite_unmark() assert failure |
Fixes Issue #4267
Currently there is an issue where
metaslab_fastwrite_unmark()
unmarks fastwrites onvdev_t
's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that thevdev_pending_fastwrite
field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assignsvdev_pending_fastwrite
from the old vdev to the new one so this count is not lost.