Revert "Implementation of SSE optimized Fletcher-4" #4861

behlendorf · 2016-07-18T19:03:09Z

This reverts commit 35a76a0 which introduced a regression on CentOS 6.7 for automated testing. This issue appears not to impacted newer kernels even through the testing is performed on instances of the same type.

RIP  [<ffffffffa0241d11>] fletcher_4_sse2_fini+0x11/0xb0 [zcommon]
Pid: 813, comm: modprobe Tainted: P      D    -- ------------
2.6.32-642.3.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81546c21>] ? panic+0xa7/0x179
 [<ffffffff8154baa4>] ? oops_end+0xe4/0x100
 [<ffffffff8101102b>] ? die+0x5b/0x90
 [<ffffffff8154b572>] ? do_general_protection+0x152/0x160
 [<ffffffff8100c4ae>] ? xen_hvm_callback_vector+0xe/0x20
 [<ffffffff8154ace5>] ? general_protection+0x25/0x30
 [<ffffffffa0241d00>] ? fletcher_4_sse2_fini+0x0/0xb0 [zcommon]
 [<ffffffffa0241d11>] ? fletcher_4_sse2_fini+0x11/0xb0 [zcommon]
 [<ffffffff81007bc1>] ? xen_clocksource_read+0x21/0x30
 [<ffffffff81007ca9>] ? xen_clocksource_get_cycles+0x9/0x10
 [<ffffffff810b1b65>] ? getrawmonotonic+0x35/0xc0
 [<ffffffffa0241077>] ? fletcher_4_init+0x227/0x260 [zcommon]
 [<ffffffff812a8490>] ? kvasprintf+0x70/0x90
 [<ffffffffa024d000>] ? zcommon_init+0x0/0xd [zcommon]
 [<ffffffffa024d009>] ? zcommon_init+0x9/0xd [zcommon]
 [<ffffffff810020d0>] ? do_one_initcall+0xc0/0x280
 [<ffffffff810c8371>] ? sys_init_module+0xe1/0x250
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

This reverts commit 35a76a0 to resolve the following panic which was introduced. RIP [<ffffffffa0241d11>] fletcher_4_sse2_fini+0x11/0xb0 [zcommon] Pid: 813, comm: modprobe Tainted: P D -- ------------ 2.6.32-642.3.1.el6.x86_64 #1 Call Trace: [<ffffffff81546c21>] ? panic+0xa7/0x179 [<ffffffff8154baa4>] ? oops_end+0xe4/0x100 [<ffffffff8101102b>] ? die+0x5b/0x90 [<ffffffff8154b572>] ? do_general_protection+0x152/0x160 [<ffffffff8100c4ae>] ? xen_hvm_callback_vector+0xe/0x20 [<ffffffff8154ace5>] ? general_protection+0x25/0x30 [<ffffffffa0241d00>] ? fletcher_4_sse2_fini+0x0/0xb0 [zcommon] [<ffffffffa0241d11>] ? fletcher_4_sse2_fini+0x11/0xb0 [zcommon] [<ffffffff81007bc1>] ? xen_clocksource_read+0x21/0x30 [<ffffffff81007ca9>] ? xen_clocksource_get_cycles+0x9/0x10 [<ffffffff810b1b65>] ? getrawmonotonic+0x35/0xc0 [<ffffffffa0241077>] ? fletcher_4_init+0x227/0x260 [zcommon] [<ffffffff812a8490>] ? kvasprintf+0x70/0x90 [<ffffffffa024d000>] ? zcommon_init+0x0/0xd [zcommon] [<ffffffffa024d009>] ? zcommon_init+0x9/0xd [zcommon] [<ffffffff810020d0>] ? do_one_initcall+0xc0/0x280 [<ffffffff810c8371>] ? sys_init_module+0xe1/0x250 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Brian Behlendorf <[email protected]>

behlendorf · 2016-07-18T20:13:15Z

@tj90241 @ironMann I may need to revert this change from master because it has introduced a panic under CentOS 6. It appears to be caused by the zfs_fletcher_sse_array alignment. Increasing the alignment from 16 to 32, like in the avx2 code, resolves the issue but I'm not sure that's the best fix. What do you guys recommend?

diff --git a/module/zcommon/zfs_fletcher_sse.c b/module/zcommon/zfs_fletcher_sse
index 734ae08..a6a78c5 100644
--- a/module/zcommon/zfs_fletcher_sse.c
+++ b/module/zcommon/zfs_fletcher_sse.c
@@ -48,7 +48,7 @@
 #include <zfs_fletcher.h>

 struct zfs_fletcher_sse_array {
-       uint64_t v[2] __attribute__((aligned(16)));
+       uint64_t v[2] __attribute__((aligned(32)));
 };

 static void

ironMann · 2016-07-18T20:33:14Z

@behlendorf This might be a compiler issue, failing to properly aligning stack variables when incoming stack is not 16B aligned. Do you maybe have a object files around, or objdump of original code on that machine?

Safest bet, and right thing to do, is replacing all memory stores with unaligned variants. So in fini(): s/movdqa/movdqu/
Then alignment will not matter.

EDIT: Leaving alignment at 16 to conserve stack space.

tj90241 · 2016-07-18T20:46:38Z

Apologies for being AWOL...

What @ironMann suggested works.

Looks like this was a bug in GCC (fixed in 4.6) that was fixed quasi-recently:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838

behlendorf · 2016-07-18T20:48:33Z

@ironMann the compiler would be a very plausible explanation. I'm able to easily reproduce this with a stock CentOS 6 install which comes with the following gcc version.

gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Switching to the unaligned variants in fini appears to have resolved the issue as well. It looks like fletcher_4_ssse3_byteswap and fletcher_4_sse2 could suffer from the same issue. Should they also be changed to the unaligned variant? Something like this?

diff --git a/module/zcommon/zfs_fletcher_sse.c b/module/zcommon/zfs_fletcher_sse
index 734ae08..5db46fe 100644
--- a/module/zcommon/zfs_fletcher_sse.c
+++ b/module/zcommon/zfs_fletcher_sse.c
@@ -69,12 +69,12 @@ fletcher_4_sse2_fini(zio_cksum_t *zcp)
        struct zfs_fletcher_sse_array a, b, c, d;
        uint64_t A, B, C, D;

-       asm volatile("movdqa %%xmm0, %0":"=m" (a.v));
-       asm volatile("movdqa %%xmm1, %0":"=m" (b.v));
+       asm volatile("movdqu %%xmm0, %0":"=m" (a.v));
+       asm volatile("movdqu %%xmm1, %0":"=m" (b.v));
        asm volatile("psllq $0x2, %xmm2");
-       asm volatile("movdqa %%xmm2, %0":"=m" (c.v));
+       asm volatile("movdqu %%xmm2, %0":"=m" (c.v));
        asm volatile("psllq $0x3, %xmm3");
-       asm volatile("movdqa %%xmm3, %0":"=m" (d.v));
+       asm volatile("movdqu %%xmm3, %0":"=m" (d.v));

        kfpu_end();

@@ -106,7 +106,7 @@ fletcher_4_sse2(const void *buf, uint64_t size, zio_cksum_t 

        for (; ip < ipend; ip += 2) {
                asm volatile("movdqu %0, %%xmm5" :: "m"(*ip));
-               asm volatile("movdqa %xmm5, %xmm6");
+               asm volatile("movdqu %xmm5, %xmm6");
                asm volatile("punpckldq %xmm4, %xmm5");
                asm volatile("punpckhdq %xmm4, %xmm6");
                asm volatile("paddq %xmm5, %xmm0");
@@ -168,13 +168,13 @@ fletcher_4_ssse3_byteswap(const void *buf, uint64_t size, 
        const uint64_t *ip = buf;
        const uint64_t *ipend = (uint64_t *)((uint8_t *)ip + size);

-       asm volatile("movdqa %0, %%xmm7"::"m" (mask));
+       asm volatile("movdqu %0, %%xmm7"::"m" (mask));
        asm volatile("pxor %xmm4, %xmm4");

        for (; ip < ipend; ip += 2) {
                asm volatile("movdqu %0, %%xmm5"::"m" (*ip));
                asm volatile("pshufb %xmm7, %xmm5");
-               asm volatile("movdqa %xmm5, %xmm6");
+               asm volatile("movdqu %xmm5, %xmm6");
                asm volatile("punpckldq %xmm4, %xmm5");
                asm volatile("punpckhdq %xmm4, %xmm6");
                asm volatile("paddq %xmm5, %xmm0");

@tj90241 perhaps you could propose a PR to address this and we can avoid reverting the patch and instead just fix it.

tj90241 · 2016-07-18T20:58:28Z

Sure. Though: do we know if it's necessary to use movdqus in the core loop? While using unaligned loads for the fini isn't a problem (since it's only done once), unaligned memory accesses in the core loop could result in a performance hit.

ironMann · 2016-07-18T21:15:17Z

@behlendorf Only memory referencing movs have to be "movdqu". Other ones just move content from one reg to another (they are probably optimized out by the CPU (reg renaming) and not even executed). One that should be changed is the following:

@@ -168,13 +168,13 @@ fletcher_4_ssse3_byteswap(const void *buf, uint64_t size, 
        const uint64_t *ip = buf;
        const uint64_t *ipend = (uint64_t *)((uint8_t *)ip + size);

-       asm volatile("movdqa %0, %%xmm7"::"m" (mask));
+       asm volatile("movdqu %0, %%xmm7"::"m" (mask));

(it's probably fine, since static mask is not on stack, but just to be on safe side)

behlendorf · 2016-07-18T21:19:06Z

@tj90241 what @ironMann said. The CentOS 6 builder should uncover any problems if they're there, but I wasn't able to reproduce any other issues after make just the change to fini.

behlendorf · 2016-07-18T23:22:17Z

Closing in favor of #4862. @ironMann @tj90241 thanks for the quick patch and review!

behlendorf force-pushed the fletcher-sse-revert branch from 0648db0 to 7b94a8d Compare July 18, 2016 19:53

tj90241 mentioned this pull request Jul 18, 2016

Prevent segfaults in SSE optimized Fletcher-4 #4862

Closed

behlendorf mentioned this pull request Jul 18, 2016

Wait iput_async before evict_inodes to prevent race #4854

Closed

behlendorf closed this Jul 18, 2016

behlendorf deleted the fletcher-sse-revert branch April 19, 2021 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Implementation of SSE optimized Fletcher-4" #4861

Revert "Implementation of SSE optimized Fletcher-4" #4861

behlendorf commented Jul 18, 2016

behlendorf commented Jul 18, 2016 •

edited

Loading

ironMann commented Jul 18, 2016 •

edited

Loading

tj90241 commented Jul 18, 2016

behlendorf commented Jul 18, 2016

tj90241 commented Jul 18, 2016 •

edited

Loading

ironMann commented Jul 18, 2016

behlendorf commented Jul 18, 2016

behlendorf commented Jul 18, 2016

Revert "Implementation of SSE optimized Fletcher-4" #4861

Revert "Implementation of SSE optimized Fletcher-4" #4861

Conversation

behlendorf commented Jul 18, 2016

behlendorf commented Jul 18, 2016 • edited Loading

ironMann commented Jul 18, 2016 • edited Loading

tj90241 commented Jul 18, 2016

behlendorf commented Jul 18, 2016

tj90241 commented Jul 18, 2016 • edited Loading

ironMann commented Jul 18, 2016

behlendorf commented Jul 18, 2016

behlendorf commented Jul 18, 2016

behlendorf commented Jul 18, 2016 •

edited

Loading

ironMann commented Jul 18, 2016 •

edited

Loading

tj90241 commented Jul 18, 2016 •

edited

Loading