Fix for metaslab_fastwrite_unmark() assert failure #4870

tcaputi · 2016-07-22T20:55:33Z

Fixes Issue #4267

Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost.

behlendorf · 2016-07-22T22:55:45Z

Nice work! This one has been a thorn in our side for far too long. Your proposed fix looks entirely reasonable to me and nicely explains the problem. I'll do some local testing over the weekend to verify the patch as well.

I'd suggest updating the patch when you get a chance to include a clear description on the problem, and added the following line to the commit message. This will instruct the buildbot to run zloop.sh for 2 hours when testing this commit instead of the default 15 minutes. That should give us a really good indication if it's been fixed.

TEST_ZTEST_TIMEOUT=7200

tcaputi · 2016-07-22T23:13:45Z

@behlendorf Will do. To be clear, do you want the description in the commit message or the PR?

tcaputi · 2016-07-23T03:05:46Z

I updated the pull as you requested. By the way, is there a way to look at the dumps from ztesting that buildbot does? I see that there will be ztest failures when everything is done, but I know this problem isn't the only issue plaguing ztest. Is there any way I can confirm this particular failure is fixed?

behlendorf · 2016-07-23T03:25:20Z

Yes, a good commit message for the commit describing the problem and fix. Your comment from above in the PR would be fine.

So zloop.sh is supposed to dump all backtraces from the core dump to the buildbot logs. In practice, I've seen this not work reliably on all the Linux distributions. It seems work more often than not on Debian 8 but not elsewhere, I haven't had a chance to run down the root cause. You can run the exact same script locally, grab it from here:

https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-ztest.sh

Your right to be absolutely sure this is no longer happening and those failures are unrelated we'll need to at least get back traces out of the buildbot for the remaining ztest failures.

tcaputi · 2016-07-23T03:55:13Z

@behlendorf OK. Well I'll get my machine running tests over the weekend and when I get in to work on Monday I'll look through the logs and make sure anything that happens is unrelated.

Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. TEST_ZTEST_TIMEOUT=7200

tcaputi · 2016-07-23T14:34:24Z

I checked in on my tests (2 instances of zloop.sh set to run for 2 days) after running them all night. So far there have been 4 crashes, all of which have been in zdb. Its also worth noting that one of the zloop instances hasn't crashed yet, which is surprising from my experience with it.

Of the 4 failing test slaves, 1 failed in splat of all things. Something there looks a bit fishy; libc segfaulted during the splat test. Another failed slave died with ztest timing out, which I have also never seen before. The other 2 slaves had 3 legitimate ztest crashes between them which isn't bad considering we had 6 test instances running for 2 hours.

I'm going to say I'm relatively sure the problem is fixed.

@behlendorf Let me know if your tests prove otherwise.

tcaputi · 2016-07-25T14:06:37Z

Test ran all weekend. Several failures, mostly from zdb, but metaslab_fastwrite_unmark() seems to be fixed.

behlendorf · 2016-07-25T18:51:28Z

Similar results on my end after a weekend of testing. A few unrelated failures but it definitely looks like you've resolved this issue. I'll get it merged today!

dechamps · 2016-07-25T19:14:38Z

Thanks for fixing my code, @tcaputi! I can only apologize for not having found the time to find and fix my own mistakes.

tcaputi · 2016-07-25T19:21:06Z

No problem, lets just get it in so my tests on the ZFS encryption PR stops looking like it failed.

behlendorf · 2016-07-26T00:14:56Z

Merged as:

77943bc Fix for metaslab_fastwrite_unmark() assert failure

tcaputi force-pushed the metaslab_fastwrite_fix branch from d2ad59f to 181f22b Compare July 23, 2016 01:13

tcaputi force-pushed the metaslab_fastwrite_fix branch from 181f22b to 60e9c7d Compare July 23, 2016 04:02

behlendorf closed this Jul 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for metaslab_fastwrite_unmark() assert failure #4870

Fix for metaslab_fastwrite_unmark() assert failure #4870

tcaputi commented Jul 22, 2016 •

edited

Loading

behlendorf commented Jul 22, 2016

tcaputi commented Jul 22, 2016

tcaputi commented Jul 23, 2016

behlendorf commented Jul 23, 2016

tcaputi commented Jul 23, 2016

tcaputi commented Jul 23, 2016

tcaputi commented Jul 25, 2016

behlendorf commented Jul 25, 2016

dechamps commented Jul 25, 2016

tcaputi commented Jul 25, 2016 •

edited

Loading

behlendorf commented Jul 26, 2016

Fix for metaslab_fastwrite_unmark() assert failure #4870

Fix for metaslab_fastwrite_unmark() assert failure #4870

Conversation

tcaputi commented Jul 22, 2016 • edited Loading

behlendorf commented Jul 22, 2016

tcaputi commented Jul 22, 2016

tcaputi commented Jul 23, 2016

behlendorf commented Jul 23, 2016

tcaputi commented Jul 23, 2016

tcaputi commented Jul 23, 2016

tcaputi commented Jul 25, 2016

behlendorf commented Jul 25, 2016

dechamps commented Jul 25, 2016

tcaputi commented Jul 25, 2016 • edited Loading

behlendorf commented Jul 26, 2016

tcaputi commented Jul 22, 2016 •

edited

Loading

tcaputi commented Jul 25, 2016 •

edited

Loading