Fix Linux 4.1 compat regarding loop device on ZFS #3651

tuxoko · 2015-07-30T15:21:50Z

Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.

With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.

The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.

Signed-off-by: Chunwei Chen [email protected]

tuxoko · 2015-07-30T15:23:00Z

Depends on openzfs/spl#468
Fix #3511 #3640

behlendorf · 2015-07-31T00:51:07Z

@tuxoko Nice fix! Thanks for tackling this issue.

tuxoko · 2015-07-31T05:43:57Z

update: account for iov_iter->iov_offset and do iov_iter_advance

l1k · 2015-07-31T07:31:03Z

Tested-by: Lukas Wunner [email protected]
Fixes: #3612

WFM, great job @tuxoko!

ryao · 2015-07-31T17:42:56Z

module/zcommon/zfs_uio.c

+		void *paddr;
+		cnt = MIN(bv->bv_len - skip, n);
+
+		paddr = zfs_kmap_atomic(bv->bv_page, KM_USER1);


This is unsafe on older kernels because interrupts must be disabled to prevent another thread from using the CPU's KM_USER1 slot. This is likely okay on newer kernels where the slots are dynamically allocated.

This is not true, KM_USER* are not used in interrupt context since... well it's "user". So it's perfectly safe.

When CONFIG_PREEMPT is set, the kernel scheduler should be able to preempt us. If the scheduler preempts us and schedules us to another CPU or runs something else on this CPU that uses this, the value will be clobbered. See the comment for __schedule:

http://lxr.free-electrons.com/source/kernel/sched/core.c#L2693

All we need is an IRQ to occur when the scheduler thinks something else should run. As the comment for __schedule() indicates, the code in arch/x86/entry/entry_64.S will call into the scheduler on return from the IRQ and preempt_schedule_irq() will be called. That will call schedule() and something else can have the CPU.

You are correct that there is code in the mainline kernel doing this, but I think those routines will need to be changed for the same reason. The reason why they have not yet been changed is that these events are rare enough that no one has encountered them.

@tuxoko It seems that I used the wrong term. It is unsafe to use kmap_atomic in a preemptible context. An interrupt context is definitely unsafe. Also, my statement about it being okay on newer kernels was wrong.

@ryao
kmap_atomic does turn off preemption itself.

@tuxoko Where does it do that?

@ryao
http://lxr.free-electrons.com/source/include/linux/uaccess.h#L16

7 /* 8 * These routines enable/disable the pagefault handler in that 9 * it will not take any locks and go straight to the fixup table. 10 * 11 * They have great resemblance to the preempt_disable/enable calls 12 * and in fact they are identical; this is because currently there is 13 * no other way to make the pagefault handlers do this. So we do 14 * disable preemption but we don't necessarily care about that. 15 */ 16 static inline void pagefault_disable(void)

tuxoko · 2015-07-31T18:12:44Z

@ryao
Yes, there's inconsistency in the code base. I'm not sure if there's already a rule for this.
But personally, I prefer memcpy and friend because:

bcopy is deprecated from POSIX, while memcpy is in standard.
The src/dst argument for memcpy is consistent with stuffs like strcpy.

ryao · 2015-07-31T20:21:18Z

@tuxoko POSIX doesn't apply to kernel code and the other platforms are using bcopy(). We should use it instead of memcpy() for consistency with current practice.

tuxoko · 2015-08-03T08:29:13Z

@ryao
Well..., zfs isn't kernel only code. But I change it to bcopy anyway.

Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2. We use zfs_{kmap,kunmap}_atomic as wrappers and always take 2 argument, but ignore the 2nd for newer kernel. Signed-off-by: Chunwei Chen <[email protected]>

ryao · 2015-08-07T03:53:58Z

This looks fine to me. I have tested this on Linux 4.1.3 by applying it to spl-0.6.4.2. Using losetup on a from my root pool works fine with it, but hangs my laptop without it.

infowolfe · 2015-08-12T20:17:49Z

working with sys-kernel/aufs-sources-4.1.5

behlendorf · 2015-08-17T18:44:54Z

@tuxoko aside from fixing the bug this is a really nice improvement! The patch looks good to me but let me manually to it against a wider range of kernels before merging it.

I also agree there's no issue with using kmap_atomic() here, this is exactly what it's designed for. As for using bcopy() or memcpy() a case could probably be made for either but it doesn't really matter since this is in ZoL specific code.

kernelOfTruth · 2015-08-17T20:47:37Z

Causes trouble with Chromium, Chrome and bookmark handling:

https://forums.gentoo.org/viewtopic-t-1026884.html [ZFS loop fix [upstream] breaks chromium bookmark handling ?]

the bookmarks can't be displayed or are even corrupt (they are in the .config subfolder but Chromium asks for bookmark import and shows no bookmarks),

rolling back a previously created snapshot (on other kernel, or without modifying bookmarks) helps but this isn't a long-term solution.

I hope that there wasn't any additional breakage or file corruption :/

behlendorf · 2015-08-17T23:13:34Z

@tuxoko with a trivial fix to the SPL part of this patch set it builds cleanly on all the buildbots and passes all the automated testing.

@kernelOfTruth thanks for commenting on the issue with chromium. We'll definitely want to explain that before merging these changes. Do you happen to have another reproducer which is easier to run which shows the issue?

tuxoko · 2015-08-18T02:14:07Z

@kernelOfTruth
Would remove iov_iter_advance in zpl_iter_read and zpl_iter_write fix this issue?

kernelOfTruth · 2015-08-18T03:43:19Z

@tuxoko like so:

kernelOfTruth@82db4a1

?

unfortunately no :(

tuxoko · 2015-08-18T04:04:55Z

@kernelOfTruth
You need to remove the if (ret > 0) too.

This would be very strange if every thing except chromium works fine.
I'll try to test it on my VM later.

kernelOfTruth · 2015-08-18T04:07:22Z

@tuxoko alright, will test again once I did at least one backup & modify accordingly

thanks

tuxoko · 2015-08-18T06:05:46Z

@kernelOfTruth
I'm not sure if I understand correctly. Does the issue only happens when you use both this patch and linux-4.1.5? But it's OK for this patch on older kernel and also linux-4.1.5 without this patch.

Edit: Also, I couldn't reproduce the bug with linux-4.1.5 on my ubuntu VM.

kernelOfTruth · 2015-08-18T12:15:06Z

@tuxoko

Just saw that the zfs branch I was observing these issues also contains #3672 (configure bdi_setup_and_register: don't blow config stack #3672)

https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_13.08.2015 (minus the last commit)

so that actually could be the real cause (I'm running a branch without it and this pull-request - having accidentally forgotten to add),

it was too late the last time I wrote, so oversaw it :(

It happens with 4.1.4 to 4.1.6:

That's the range I started testing these changes and when I observed it, first I added the patches with the upgrade to 4.1.5, then also 4.1.6 - finally went back to 4.1.4, the "working" kernel and updated ZFS to the state used with the 4.1.5+ kernel and also observed it there.

First I had assumed that it was related to the up-to-date chrome & chromium versions, GCC (went also back from 5.2 to 4.9.3) or even CFLAGS - but to no avail. Only with going back to a previous patchset iteration without the loop fix and @chrisrd 's patch the behavior disappeared for good.

Once I've time again (currently /home is being scrubbed and I'm busy) I'll also test it against latest zfs master to see whether it's caused by the other patch stacks I'm using.

sorry for potentially leading you on a wrong track

kernelOfTruth · 2015-08-18T13:03:34Z

There's also something fishy going on with git's history (actually bash history) - not sure if it's connected but suddenly lots of entries are missing up to February of this year ...

@behlendorf unfortunately that's the only reproducer so far - I'll see if I can encircle it either to this pull request or @chrisrd 's that reduces stack usage (will add that solely in the next testing round)

kernelOfTruth · 2015-08-19T15:52:44Z

alright, just pulled the latest 2 commits from master into the tree from @chrisrd

and it's definitely this pull-request (the loop fix) that causes this issue with chromium,

will proceed to remove the

iov_iter_advance

statement and re-test ...

tuxoko · 2015-08-19T16:25:38Z

@kernelOfTruth
I've just found an error. Please have a go with the new patch.

kernelOfTruth · 2015-08-19T16:27:32Z

@tuxoko sure, will do

I just did a test with

iov_iter_advance

removed and more conservative CFLAGS

and that lead to no change

so now adding the new commit

Thanks !

tuxoko · 2015-08-19T16:41:25Z

@kernelOfTruth
I think I've got it. I just pull you branch and found out you seems to have ABD in it.
Unfortunately, I haven't update ABD for this fix, so it's currently incompatible.
I'll update the ABD tomorrow. Meanwhile, you don't have to waste time test anything.

kernelOfTruth · 2015-08-19T16:55:10Z

@tuxoko I was slowly suspecting that the problem has to do with ABD,

sorry, should have mentioned explicitly that I'm running a branch with it

Thanks

sempervictus · 2015-08-19T17:09:30Z

@tuxoko: thanks for the heads up, i just got to the point where i realized this PR might be what's causing #3684 and the obscene amount of data corruption i was seeing - inodes were being mauled on disk which hadn't been written to since this was deployed(gcc claimed it couldnt create elfs, git packs rended like bunnies in a blender, aptitude package index nuked, and a pool created with this exhibiting the worst behaviors, all sorts of fun stuff). This at least makes me think the SSE vectorized parity calculation branch is not the culprit, but i'll need to do more testing when my team gets to the DC and resets the "very locked-up" system i've been abusing for this stuff.

behlendorf · 2015-08-19T17:29:46Z

That explains things. Has anyone observed problems when running with just this patch?

l1k · 2015-08-19T18:58:06Z

When I tested the patch on July 31, a kernel compile failed 100% reproducably when linking vmlinux. I didn't report this here because I was unsure what the cause was: It was very hot in Germany back then (around 40° C) and I suspected that the RAM was either permanently damaged or had temporary issues due to the temperatures. I reverted the patch, waited for the machine to cool down, then retried a compile with "make -j1" to keep the temperature low and it worked fine. The patch did solve the loopback issue, this I had verified successfully. To this day I'm not sure what caused the kernel compile to fail but reading the other reports here I do get the impression this may be linked. I had used the vanilla ZoL code out of Turbo's Debian dailies repo, I believe it was zfsonlinux/pkg-zfs@2e97348.

behlendorf · 2015-08-20T00:20:29Z

Thus far I haven't been able to reproduce any issues when testing the updated version of this patch. It consistently passes:

All the automated buildbot testing. I've run it though multiple times and haven't observed any new issues.
Repeated builds of the Linux kernel on a ZFS filesystem (Running kernel Linux 4.2)
The fio surface_scan data integrity check (Running kernel Linux 4.2)
fsx, dbench, iozone (Running kernel Linux 4.2)

I'm going to leave much of the above testing running but I think it's safe to say the reported issues were due to mixing this with ABD which was known not to work. I'd love to get this merged in the next few days so any additional testing people can offer as always is appreciated!

Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into iter_read/iter_write. Notably, the loop device will pass bio_vec to backend filesystem. However, current ZFS code assumes iovec without any check, so it will always crash when using loop device. With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC set. The uio* functions are modified to handle bio_vec case separately. The const uio_iov causes some warning in xuio related stuff, so explicit convert them to non const. Signed-off-by: Chunwei Chen <[email protected]>

tuxoko · 2015-08-20T02:34:44Z

Update: nothing changed, just squashed the previous bvec fix.

infowolfe · 2015-08-21T14:25:13Z

I've been running this happily as 4.1.5-aufs on an X4540, Mellanox ConnectX EN 10GbE + Mellanox ConnectX VPI QDR IB both raidz2 (zil/l2arc on NVMe Intel 750) and raid10 (12x scratch disks) under varying loads (TBs of data in/out) without any issues.
Linux dojo 4.1.5-aufs #1 SMP Wed Aug 12 19:47:36 UTC 2015 x86_64 Six-Core AMD Opteron(tm) Processor 2435 AuthenticAMD GNU/Linux

behlendorf · 2015-08-24T17:23:27Z

After a weekend of testing without observing any issues and independent confirmation of the fix I'm comfortable merging this fix. Merged as: Thanks @tuxoko.

5475aad Linux 4.1 compat: loop device on ZFS
17888ae Add compatibility layer for {kmap,kunmap}_atomic
openzfs/spl@ae89cf0 Restructure uio to accommodate bio_vec

tuxoko force-pushed the loop_fix branch from 71e4ab0 to 07be6c9 Compare July 30, 2015 17:07

behlendorf added this to the 0.6.5 milestone Jul 31, 2015

tuxoko force-pushed the loop_fix branch from 07be6c9 to 1881554 Compare July 31, 2015 05:43

tuxoko force-pushed the loop_fix branch from 1881554 to 811b9c1 Compare July 31, 2015 06:02

l1k mentioned this pull request Jul 31, 2015

losetup/truecrypt/cryptsetup hard freeze on zfs and kernel 4.1 #3612

Closed

ryao reviewed Jul 31, 2015
View reviewed changes

tuxoko force-pushed the loop_fix branch from 3e35f5a to e51f422 Compare August 3, 2015 08:28

Add compatibility layer for {kmap,kunmap}_atomic

68b1acf

Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2. We use zfs_{kmap,kunmap}_atomic as wrappers and always take 2 argument, but ignore the 2nd for newer kernel. Signed-off-by: Chunwei Chen <[email protected]>

tuxoko force-pushed the loop_fix branch from e51f422 to 15da610 Compare August 3, 2015 08:35

kernelOfTruth mentioned this pull request Aug 18, 2015

Restructure uio to accommodate bio_vec openzfs/spl#468

Closed

behlendorf added the Bug - Major label Aug 19, 2015

behlendorf mentioned this pull request Aug 19, 2015

Large KMem Allocation When Using ZVOL on RAIDZ2 atop an LSI 1078 (MegaRAID 8888ELP) #3684

Closed

tuxoko force-pushed the loop_fix branch from e6fb56d to ed790a8 Compare August 20, 2015 02:34

This was referenced Aug 20, 2015

ABD on 0.6.4 stable branch #3696

Closed

ABD: linear/scatter dual typed buffer for ARC (ver 2) #3441

Closed

behlendorf closed this Aug 24, 2015

kernelOfTruth mentioned this pull request Sep 8, 2015

running "mount -o loop" system hang #3751

Closed

DeHackEd mentioned this pull request Sep 19, 2015

Mounting iso-file from ZFS hangs system #3802

Closed

Fix Linux 4.1 compat regarding loop device on ZFS #3651

Fix Linux 4.1 compat regarding loop device on ZFS #3651

Conversation

tuxoko commented Jul 30, 2015

tuxoko commented Jul 30, 2015

behlendorf commented Jul 31, 2015

tuxoko commented Jul 31, 2015

l1k commented Jul 31, 2015

ryao Jul 31, 2015

Choose a reason for hiding this comment

tuxoko Jul 31, 2015

Choose a reason for hiding this comment

ryao Jul 31, 2015

Choose a reason for hiding this comment

ryao Jul 31, 2015

Choose a reason for hiding this comment

tuxoko Aug 3, 2015

Choose a reason for hiding this comment

ryao Aug 4, 2015

Choose a reason for hiding this comment

tuxoko Aug 5, 2015

Choose a reason for hiding this comment

tuxoko commented Jul 31, 2015

ryao commented Jul 31, 2015

tuxoko commented Aug 3, 2015

ryao commented Aug 7, 2015

infowolfe commented Aug 12, 2015

behlendorf commented Aug 17, 2015

kernelOfTruth commented Aug 17, 2015

behlendorf commented Aug 17, 2015

tuxoko commented Aug 18, 2015

kernelOfTruth commented Aug 18, 2015

tuxoko commented Aug 18, 2015

kernelOfTruth commented Aug 18, 2015

tuxoko commented Aug 18, 2015

kernelOfTruth commented Aug 18, 2015

kernelOfTruth commented Aug 18, 2015

kernelOfTruth commented Aug 19, 2015

tuxoko commented Aug 19, 2015

kernelOfTruth commented Aug 19, 2015

tuxoko commented Aug 19, 2015

kernelOfTruth commented Aug 19, 2015

sempervictus commented Aug 19, 2015

behlendorf commented Aug 19, 2015

l1k commented Aug 19, 2015

behlendorf commented Aug 20, 2015

tuxoko commented Aug 20, 2015

infowolfe commented Aug 21, 2015

behlendorf commented Aug 24, 2015