Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Linux 4.1 compat regarding loop device on ZFS #3651

Closed
wants to merge 2 commits into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Jul 30, 2015

Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.

With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.

The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.

Signed-off-by: Chunwei Chen [email protected]

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 30, 2015

Depends on openzfs/spl#468
Fix #3511 #3640

@behlendorf
Copy link
Contributor

@tuxoko Nice fix! Thanks for tackling this issue.

@behlendorf behlendorf added this to the 0.6.5 milestone Jul 31, 2015
@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 31, 2015

update: account for iov_iter->iov_offset and do iov_iter_advance

@l1k
Copy link
Contributor

l1k commented Jul 31, 2015

Tested-by: Lukas Wunner [email protected]
Fixes: #3612

WFM, great job @tuxoko!

void *paddr;
cnt = MIN(bv->bv_len - skip, n);

paddr = zfs_kmap_atomic(bv->bv_page, KM_USER1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe on older kernels because interrupts must be disabled to prevent another thread from using the CPU's KM_USER1 slot. This is likely okay on newer kernels where the slots are dynamically allocated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, KM_USER* are not used in interrupt context since... well it's "user". So it's perfectly safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When CONFIG_PREEMPT is set, the kernel scheduler should be able to preempt us. If the scheduler preempts us and schedules us to another CPU or runs something else on this CPU that uses this, the value will be clobbered. See the comment for __schedule:

http://lxr.free-electrons.com/source/kernel/sched/core.c#L2693

All we need is an IRQ to occur when the scheduler thinks something else should run. As the comment for __schedule() indicates, the code in arch/x86/entry/entry_64.S will call into the scheduler on return from the IRQ and preempt_schedule_irq() will be called. That will call schedule() and something else can have the CPU.

You are correct that there is code in the mainline kernel doing this, but I think those routines will need to be changed for the same reason. The reason why they have not yet been changed is that these events are rare enough that no one has encountered them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuxoko It seems that I used the wrong term. It is unsafe to use kmap_atomic in a preemptible context. An interrupt context is definitely unsafe. Also, my statement about it being okay on newer kernels was wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryao
kmap_atomic does turn off preemption itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuxoko Where does it do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryao
http://lxr.free-electrons.com/source/include/linux/uaccess.h#L16

  7 /*
  8  * These routines enable/disable the pagefault handler in that
  9  * it will not take any locks and go straight to the fixup table.
 10  *
 11  * They have great resemblance to the preempt_disable/enable calls
 12  * and in fact they are identical; this is because currently there is
 13  * no other way to make the pagefault handlers do this. So we do
 14  * disable preemption but we don't necessarily care about that.
 15  */
 16 static inline void pagefault_disable(void)

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 31, 2015

@ryao
Yes, there's inconsistency in the code base. I'm not sure if there's already a rule for this.
But personally, I prefer memcpy and friend because:

  1. bcopy is deprecated from POSIX, while memcpy is in standard.
  2. The src/dst argument for memcpy is consistent with stuffs like strcpy.

@ryao
Copy link
Contributor

ryao commented Jul 31, 2015

@tuxoko POSIX doesn't apply to kernel code and the other platforms are using bcopy(). We should use it instead of memcpy() for consistency with current practice.

@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 3, 2015

@ryao
Well..., zfs isn't kernel only code. But I change it to bcopy anyway.

Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2.
We use zfs_{kmap,kunmap}_atomic as wrappers and always take 2 argument, but
ignore the 2nd for newer kernel.

Signed-off-by: Chunwei Chen <[email protected]>
@ryao
Copy link
Contributor

ryao commented Aug 7, 2015

This looks fine to me. I have tested this on Linux 4.1.3 by applying it to spl-0.6.4.2. Using losetup on a from my root pool works fine with it, but hangs my laptop without it.

@infowolfe
Copy link

working with sys-kernel/aufs-sources-4.1.5

@behlendorf
Copy link
Contributor

@tuxoko aside from fixing the bug this is a really nice improvement! The patch looks good to me but let me manually to it against a wider range of kernels before merging it.

I also agree there's no issue with using kmap_atomic() here, this is exactly what it's designed for. As for using bcopy() or memcpy() a case could probably be made for either but it doesn't really matter since this is in ZoL specific code.

@kernelOfTruth
Copy link
Contributor

Causes trouble with Chromium, Chrome and bookmark handling:

https://forums.gentoo.org/viewtopic-t-1026884.html [ZFS loop fix [upstream] breaks chromium bookmark handling ?]

the bookmarks can't be displayed or are even corrupt (they are in the .config subfolder but Chromium asks for bookmark import and shows no bookmarks),

rolling back a previously created snapshot (on other kernel, or without modifying bookmarks) helps but this isn't a long-term solution.

I hope that there wasn't any additional breakage or file corruption :/

@behlendorf
Copy link
Contributor

@tuxoko with a trivial fix to the SPL part of this patch set it builds cleanly on all the buildbots and passes all the automated testing.

@kernelOfTruth thanks for commenting on the issue with chromium. We'll definitely want to explain that before merging these changes. Do you happen to have another reproducer which is easier to run which shows the issue?

@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 18, 2015

@kernelOfTruth
Would remove iov_iter_advance in zpl_iter_read and zpl_iter_write fix this issue?

@kernelOfTruth
Copy link
Contributor

@tuxoko like so:

kernelOfTruth@82db4a1

?

unfortunately no :(

@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 18, 2015

@kernelOfTruth
You need to remove the if (ret > 0) too.

This would be very strange if every thing except chromium works fine.
I'll try to test it on my VM later.

@kernelOfTruth
Copy link
Contributor

@tuxoko alright, will test again once I did at least one backup & modify accordingly

thanks

@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 18, 2015

@kernelOfTruth
I'm not sure if I understand correctly. Does the issue only happens when you use both this patch and linux-4.1.5? But it's OK for this patch on older kernel and also linux-4.1.5 without this patch.

Edit: Also, I couldn't reproduce the bug with linux-4.1.5 on my ubuntu VM.

@kernelOfTruth
Copy link
Contributor

@tuxoko

Just saw that the zfs branch I was observing these issues also contains #3672 (configure bdi_setup_and_register: don't blow config stack #3672)

https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_13.08.2015 (minus the last commit)

so that actually could be the real cause (I'm running a branch without it and this pull-request - having accidentally forgotten to add),

it was too late the last time I wrote, so oversaw it :(

It happens with 4.1.4 to 4.1.6:

That's the range I started testing these changes and when I observed it, first I added the patches with the upgrade to 4.1.5, then also 4.1.6 - finally went back to 4.1.4, the "working" kernel and updated ZFS to the state used with the 4.1.5+ kernel and also observed it there.

First I had assumed that it was related to the up-to-date chrome & chromium versions, GCC (went also back from 5.2 to 4.9.3) or even CFLAGS - but to no avail. Only with going back to a previous patchset iteration without the loop fix and @chrisrd 's patch the behavior disappeared for good.

Once I've time again (currently /home is being scrubbed and I'm busy) I'll also test it against latest zfs master to see whether it's caused by the other patch stacks I'm using.

sorry for potentially leading you on a wrong track

@kernelOfTruth
Copy link
Contributor

There's also something fishy going on with git's history (actually bash history) - not sure if it's connected but suddenly lots of entries are missing up to February of this year ...

@behlendorf unfortunately that's the only reproducer so far - I'll see if I can encircle it either to this pull request or @chrisrd 's that reduces stack usage (will add that solely in the next testing round)

@kernelOfTruth
Copy link
Contributor

alright, just pulled the latest 2 commits from master into the tree from @chrisrd

and it's definitely this pull-request (the loop fix) that causes this issue with chromium,

will proceed to remove the

iov_iter_advance

statement and re-test ...

@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 19, 2015

@kernelOfTruth
I've just found an error. Please have a go with the new patch.

@kernelOfTruth
Copy link
Contributor

@tuxoko sure, will do

I just did a test with

iov_iter_advance

removed and more conservative CFLAGS

and that lead to no change

so now adding the new commit

Thanks !

@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 19, 2015

@kernelOfTruth
I think I've got it. I just pull you branch and found out you seems to have ABD in it.
Unfortunately, I haven't update ABD for this fix, so it's currently incompatible.
I'll update the ABD tomorrow. Meanwhile, you don't have to waste time test anything.

@kernelOfTruth
Copy link
Contributor

@tuxoko I was slowly suspecting that the problem has to do with ABD,

sorry, should have mentioned explicitly that I'm running a branch with it

Thanks

@sempervictus
Copy link
Contributor

@tuxoko: thanks for the heads up, i just got to the point where i realized this PR might be what's causing #3684 and the obscene amount of data corruption i was seeing - inodes were being mauled on disk which hadn't been written to since this was deployed(gcc claimed it couldnt create elfs, git packs rended like bunnies in a blender, aptitude package index nuked, and a pool created with this exhibiting the worst behaviors, all sorts of fun stuff). This at least makes me think the SSE vectorized parity calculation branch is not the culprit, but i'll need to do more testing when my team gets to the DC and resets the "very locked-up" system i've been abusing for this stuff.

@behlendorf
Copy link
Contributor

That explains things. Has anyone observed problems when running with just this patch?

@l1k
Copy link
Contributor

l1k commented Aug 19, 2015

When I tested the patch on July 31, a kernel compile failed 100% reproducably when linking vmlinux. I didn't report this here because I was unsure what the cause was: It was very hot in Germany back then (around 40° C) and I suspected that the RAM was either permanently damaged or had temporary issues due to the temperatures. I reverted the patch, waited for the machine to cool down, then retried a compile with "make -j1" to keep the temperature low and it worked fine. The patch did solve the loopback issue, this I had verified successfully. To this day I'm not sure what caused the kernel compile to fail but reading the other reports here I do get the impression this may be linked. I had used the vanilla ZoL code out of Turbo's Debian dailies repo, I believe it was zfsonlinux/pkg-zfs@2e97348.

@behlendorf
Copy link
Contributor

Thus far I haven't been able to reproduce any issues when testing the updated version of this patch. It consistently passes:

  • All the automated buildbot testing. I've run it though multiple times and haven't observed any new issues.
  • Repeated builds of the Linux kernel on a ZFS filesystem (Running kernel Linux 4.2)
  • The fio surface_scan data integrity check (Running kernel Linux 4.2)
  • fsx, dbench, iozone (Running kernel Linux 4.2)

I'm going to leave much of the above testing running but I think it's safe to say the reported issues were due to mixing this with ABD which was known not to work. I'd love to get this merged in the next few days so any additional testing people can offer as always is appreciated!

Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.

With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.

The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.

Signed-off-by: Chunwei Chen <[email protected]>
@tuxoko
Copy link
Contributor Author

tuxoko commented Aug 20, 2015

Update: nothing changed, just squashed the previous bvec fix.

@infowolfe
Copy link

I've been running this happily as 4.1.5-aufs on an X4540, Mellanox ConnectX EN 10GbE + Mellanox ConnectX VPI QDR IB both raidz2 (zil/l2arc on NVMe Intel 750) and raid10 (12x scratch disks) under varying loads (TBs of data in/out) without any issues.
Linux dojo 4.1.5-aufs #1 SMP Wed Aug 12 19:47:36 UTC 2015 x86_64 Six-Core AMD Opteron(tm) Processor 2435 AuthenticAMD GNU/Linux

@behlendorf
Copy link
Contributor

After a weekend of testing without observing any issues and independent confirmation of the fix I'm comfortable merging this fix. Merged as: Thanks @tuxoko.

5475aad Linux 4.1 compat: loop device on ZFS
17888ae Add compatibility layer for {kmap,kunmap}_atomic
openzfs/spl@ae89cf0 Restructure uio to accommodate bio_vec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants