Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for metaslab_fastwrite_unmark() assert failure #4870

Closed
wants to merge 1 commit into from

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Jul 22, 2016

Fixes Issue #4267

Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost.

@behlendorf
Copy link
Contributor

Nice work! This one has been a thorn in our side for far too long. Your proposed fix looks entirely reasonable to me and nicely explains the problem. I'll do some local testing over the weekend to verify the patch as well.

I'd suggest updating the patch when you get a chance to include a clear description on the problem, and added the following line to the commit message. This will instruct the buildbot to run zloop.sh for 2 hours when testing this commit instead of the default 15 minutes. That should give us a really good indication if it's been fixed.

TEST_ZTEST_TIMEOUT=7200

@tcaputi
Copy link
Contributor Author

tcaputi commented Jul 22, 2016

@behlendorf Will do. To be clear, do you want the description in the commit message or the PR?

@tcaputi tcaputi force-pushed the metaslab_fastwrite_fix branch from d2ad59f to 181f22b Compare July 23, 2016 01:13
@tcaputi
Copy link
Contributor Author

tcaputi commented Jul 23, 2016

I updated the pull as you requested. By the way, is there a way to look at the dumps from ztesting that buildbot does? I see that there will be ztest failures when everything is done, but I know this problem isn't the only issue plaguing ztest. Is there any way I can confirm this particular failure is fixed?

@behlendorf
Copy link
Contributor

Yes, a good commit message for the commit describing the problem and fix. Your comment from above in the PR would be fine.

So zloop.sh is supposed to dump all backtraces from the core dump to the buildbot logs. In practice, I've seen this not work reliably on all the Linux distributions. It seems work more often than not on Debian 8 but not elsewhere, I haven't had a chance to run down the root cause. You can run the exact same script locally, grab it from here:

https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-ztest.sh

Your right to be absolutely sure this is no longer happening and those failures are unrelated we'll need to at least get back traces out of the buildbot for the remaining ztest failures.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jul 23, 2016

@behlendorf OK. Well I'll get my machine running tests over the weekend and when I get in to work on Monday I'll look through the logs and make sure anything that happens is unrelated.

Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.

TEST_ZTEST_TIMEOUT=7200
@tcaputi tcaputi force-pushed the metaslab_fastwrite_fix branch from 181f22b to 60e9c7d Compare July 23, 2016 04:02
@tcaputi
Copy link
Contributor Author

tcaputi commented Jul 23, 2016

I checked in on my tests (2 instances of zloop.sh set to run for 2 days) after running them all night. So far there have been 4 crashes, all of which have been in zdb. Its also worth noting that one of the zloop instances hasn't crashed yet, which is surprising from my experience with it.

Of the 4 failing test slaves, 1 failed in splat of all things. Something there looks a bit fishy; libc segfaulted during the splat test. Another failed slave died with ztest timing out, which I have also never seen before. The other 2 slaves had 3 legitimate ztest crashes between them which isn't bad considering we had 6 test instances running for 2 hours.

I'm going to say I'm relatively sure the problem is fixed.

@behlendorf Let me know if your tests prove otherwise.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jul 25, 2016

Test ran all weekend. Several failures, mostly from zdb, but metaslab_fastwrite_unmark() seems to be fixed.

@behlendorf
Copy link
Contributor

Similar results on my end after a weekend of testing. A few unrelated failures but it definitely looks like you've resolved this issue. I'll get it merged today!

@dechamps
Copy link
Contributor

Thanks for fixing my code, @tcaputi! I can only apologize for not having found the time to find and fix my own mistakes.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jul 25, 2016

No problem, lets just get it in so my tests on the ZFS encryption PR stops looking like it failed.

@behlendorf
Copy link
Contributor

Merged as:

77943bc Fix for metaslab_fastwrite_unmark() assert failure

@behlendorf behlendorf closed this Jul 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants