Vectorise BitArray for ARM64 #33749

Gnbrkm41 · 2020-03-19T09:08:33Z

Resolves #33309
Contributes towards #33308 (All the work items under BitArray)

Due to network issues I have with the home network I was not run able to run tests / benchmarks on my only ARM machine (Raspberry Pi). In the meantime, hopefully the CI would be able to verify if anything is wrong with my code.

cc @BruceForstall, @echesakovMSFT, @tannergooding

src/libraries/System.Collections/src/System/Collections/BitArray.cs

EgorBo · 2020-03-19T10:38:53Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

+                            // shuffledLower = ZipLow(v3, v3)   - A0 A0 A0 A0 A0 A0 A0 A0 A1 A1 A1 A1 A1 A1 A1 A1
+                            // shuffledHigher = ZipHigh(v3, v3) - A2 A2 A2 A2 A2 A2 A2 A2 A3 A3 A3 A3 A3 A3 A3 A3
+
+                            Vector128<byte> vector = Vector128.Create(BinaryPrimitives.ReverseEndianness(bits)).AsByte();


Is BinaryPrimitives.ReverseEndianness optimized to bswap equivalent on AArch64 ?

I do not think so, which is why I left a comment on #33308 about it. It would be REV instructions that correspond to x86's bswap instructions.

Gnbrkm41 · 2020-03-19T11:29:36Z

Is there a way to check which tests specifically have failed & their console outputs? Libraries Test Run release mono Linux arm64 Debug is the only ARM test leg that had run & is reporting failures, but it does not actually specify what has failed (AzDO tests page just say "The work item failed") and the logs don't help either.

tannergooding · 2020-03-19T15:28:32Z

@Gnbrkm41, its not super intuitive, but once you are at the Azure logs (click Details on the failed job then View more details on Azure Pipelines), you can click on the build number at the top of the screen (in this case 20200319.8) and then change from Summary to Tests which will give you:

You then have to click on the test, goto Attachments, and download the log.
-- Its worth noting it isn't always this complex. I think this is just because normal unit test results couldn't be reported.

In this case, the log indicates:

exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

tannergooding · 2020-03-19T15:30:32Z

Given that this is a core dump, I'll try to build and run locally to see if I can repro.

Gnbrkm41 · 2020-03-19T15:36:09Z

Oh, huh. That is interesting... I'll take a look as well.

tannergooding · 2020-03-19T16:04:30Z

This actually looks like Mono, I'm not very familiar with how to debug or produce bits for that, do we have a guide somewhere? @EgorBo?

EgorBo · 2020-03-19T16:16:02Z

This actually looks like Mono, I'm not very familiar with how to debug or produce bits for that, do we have a guide somewhere? @EgorBo?

It crashes with StackOverflow in System.Collections.Tests.
I guess because AdvSimd.Arm64.IsSupported is not recognized as an intrinsic (the default impl for it is to call itself). I'll fix it, you can ignore this failure in this PR.

Gnbrkm41 · 2020-03-19T16:20:35Z

Do the remaining test legs get automatically cancelled when one of the legs fail? I see that all the non-Mono ARM CI runs are cancelled. It would be nice if we can selectively run those...

Fixes StackOverflow in dotnet/runtime#33749 (comment) Intrinsify all `get_IsSupported` under `System.Runtime.Intrinsics*` to just `false` (except the sets we support, see mono_emit_simd_intrinsics).

echesakov · 2020-03-19T20:13:25Z

Resolves #33309
Contributes towards #33308 (All the work items under BitArray)

Due to network issues I have with the home network I was not run able to run tests / benchmarks on my only ARM machine (Raspberry Pi). In the meantime, hopefully the CI would be able to verify if anything is wrong with my code.

@Gnbrkm41 Thank you for your contribution! I will collect jitDisasm and post here. In the meantime, I will also try to run benchmarks - not sure if they are supported on arm64 yet.

Fixes StackOverflow in dotnet/runtime#33749 (comment) Intrinsify all `get_IsSupported` under `System.Runtime.Intrinsics*` to just `false` (except the sets we support, see mono_emit_simd_intrinsics). Co-authored-by: EgorBo <[email protected]>

echesakov · 2020-03-19T22:32:59Z

I managed to collect jitDisasms for all the methods in PR. I attached the file but the following are the interesting spots:

And

G_M9258_IG18:
        93407CC9          sxtw    x9, x6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        4E311E10          and     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M9258_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M9258_IG18

Or

G_M43100_IG18:
        93407CC9          sxtw    x9, x6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        4EB11E10          orr     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M43100_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M43100_IG18

Xor

G_M31876_IG18:
        93407CC9          sxtw    x9, x6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        6E311E10          eor     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M31876_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M31876_IG18

Not

G_M14226_IG13:
        93407C65          sxtw    x5, x3
        D37EF4A5          lsl     x5, x5, #2
        8B050085          add     x5, x4, x5
        4C4078B0          ld1     {v16.4s}, [x5]
        6E205A10          mvn     v16.16b, v16.16b
        4C0078B0          st1     {v16.4s}, [x5]
        11001063          add     w3, w3, #4
						;; bbWeight=2    PerfScore 14.00
G_M14226_IG14:
        51000C45          sub     w5, w2, #3
        6B05007F          cmp     w3, w5
        54FFFEEB          blt     G_M14226_IG13

CopyTo

G_M40488_IG28:
        52800018          mov     w24, #0
        710082FF          cmp     w23, #32
        54000B2B          blt     G_M40488_IG32
        D2804020          movz    x0, #513
        F2A10080          movk    x0, #0x804 LSL #16
        F2C40200          movk    x0, #0x2010 LSL #32
        F2F00800          movk    x0, #0x8040 LSL #48
        97FFF0DF          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|29_0(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
        4EA01C08          mov     v8.16b, v0.16b
        52800020          mov     w0, #1
        6E084509          mov     v9.d[0], v8.d[1]
        97FFF0A5          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|20_0(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
        4EA01C0A          mov     v10.16b, v0.16b
        B9400AC0          ldr     w0, [x22,#8]
        6B00029F          cmp     w20, w0
        54001BE2          bhs     G_M40488_IG44
        93407E80          sxtw    x0, x20
        91004000          add     x0, x0, #16
        8B0002C0          add     x0, x22, x0
        F9000BA0          str     x0, [fp,#16]	// [V42 loc39]
        F9400BB9          ldr     x25, [fp,#16]	// [V42 loc39]
        B9401260          ldr     w0, [x19,#16]
        7100801F          cmp     w0, #32
        6E180528          mov     v8.d[1], v9.d[0]
        5400052B          blt     G_M40488_IG30
						;; bbWeight=0.50 PerfScore 11.50
G_M40488_IG29:
        F9400660          ldr     x0, [x19,#8]
        131F7F01          asr     w1, w24, #31
        12001021          and     w1, w1, #31
        0B180021          add     w1, w1, w24
        13057C21          asr     w1, w1, #5
        B9400802          ldr     w2, [x0,#8]
        6B02003F          cmp     w1, w2
        540019C2          bhs     G_M40488_IG44
        93407C21          sxtw    x1, x1
        D37EF421          lsl     x1, x1, #2
        91004021          add     x1, x1, #16
        B8616800          ldr     w0, [x0, x1]
        12009C01          and     w1, w0, #0xff00ff
        13812021          ror     w1, w1, #8
        12089C00          and     w0, w0, #0xff00ff00
        13806000          ror     w0, w0, #24
        0B000020          add     w0, w1, w0
        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        6E180528          mov     v8.d[1], v9.d[0]
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E18056A          mov     v10.d[1], v11.d[0]
        6E2A6E31          umin    v17.16b, v17.16b, v10.16b
        93407F00          sxtw    x0, x24
        8B000320          add     x0, x25, x0
        4C007011          st1     {v17.16b}, [x0]
        4E107A10          zip2    v16.16b, v16.16b, v16.16b
        4E281E10          and     v16.16b, v16.16b, v8.16b
        6E2A6E10          umin    v16.16b, v16.16b, v10.16b
        91004000          add     x0, x0, #16
        4C007010          st1     {v16.16b}, [x0]
        11008318          add     w24, w24, #32
        11008300          add     w0, w24, #32
        B9401261          ldr     w1, [x19,#16]
        6B01001F          cmp     w0, w1
        54FFFB2D          ble     G_M40488_IG29

.ctor

G_M45590_IG07:
        93407EC1          sxtw    x1, x22
        8B010001          add     x1, x0, x1
        4C407030          ld1     {v16.16b}, [x1]
        4E010FF1          dup     v17.16b, wzr
        6E318E10          cmeq    v16.16b, v16.16b, v17.16b
        4E201E10          and     v16.16b, v16.16b, v0.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        91004021          add     x1, x1, #16
        4C407031          ld1     {v17.16b}, [x1]
        4E010FF2          dup     v18.16b, wzr
        6E328E31          cmeq    v17.16b, v17.16b, v18.16b
        4E201E31          and     v17.16b, v17.16b, v0.16b
        4E31BE31          addp    v17.16b, v17.16b, v17.16b
        4E31BE31          addp    v17.16b, v17.16b, v17.16b
        4E31BE31          addp    v17.16b, v17.16b, v17.16b
        4E513A10          zip1    v16.8h, v16.8h, v17.8h
        3D8007B0          str     q16, [fp,#16]	// [V39 tmp11]
        B94013A1          ldr     w1, [fp,#16]	// [V39 tmp11]
        F9400662          ldr     x2, [x19,#8]
        131F7EC3          asr     w3, w22, #31
        12001063          and     w3, w3, #31
        0B160063          add     w3, w3, w22
        13057C63          asr     w3, w3, #5
        B9400844          ldr     w4, [x2,#8]
        6B04007F          cmp     w3, w4
        54000782          bhs     G_M45590_IG18
        93407C63          sxtw    x3, x3
        D37EF463          lsl     x3, x3, #2
        91004063          add     x3, x3, #16
        2A2103E1          mvn     w1, w1
        B8236841          str     w1, [x2, x3]
        110082D6          add     w22, w22, #32
						;; bbWeight=2    PerfScore 90.00
G_M45590_IG08:
        110082C1          add     w1, w22, #32
        6B0102BF          cmp     w21, w1
        54FFFB8A          bge     G_M45590_IG07

Couple notes here:

The generated code for And, Or, Xor and Not looks reasonable to me
If we had a pre-indexed or post-indexed version of the Load intrinsic JIT could generate even more compact code for the loops bodies
I expect all these bl System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback to go away as soon as we implement [Arm64] Load one single-element structure and Replicate to all lanes of one register #33490 and start using the intrinsic in Vector128<T>.Create implementation
cmeq (register) when one of the operands is Vector128<T>.Zero will be replaced with cmeq (zero) when we support this optimization

jitDump.txt

tannergooding · 2020-03-19T22:40:57Z

@TamarChristinaArm, would you or someone else from ARM be able to take a glance at the PR and codegen and make sure we are using the right tricks?

Also CC. @CarolEidt

danmoseley · 2020-03-19T22:49:19Z

I guess because AdvSimd.Arm64.IsSupported is not recognized as an intrinsic (the default impl for it is to call itself). I'll fix it, you can ignore this failure in this PR.

@stephentoub should we have an analyzer for that? public static new bool IsSupported { get => IsSupported; }. But, perhaps it's unlikely.

EgorBo · 2020-03-19T22:55:13Z

@stephentoub should we have an analyzer for that? public static new bool IsSupported { get => IsSupported; }. But, perhaps it's unlikely.

As far as I understand this hack is a trick to allow intrinsics to be used via reflection.

tannergooding · 2020-03-19T23:15:13Z

As far as I understand this hack is a trick to allow intrinsics to be used via reflection.

Yes, its used for basically any form of indirect invocation including: reflection, delegates, the debugger/immediate window, and when non constant inputs are given to an intrinsic that requires them.

src/libraries/System.Collections/src/System/Collections/BitArray.cs

Gnbrkm41 · 2020-03-20T09:15:05Z

I don't see any Libraries test run for CoreCLR on ARM on the CI list (there's Mono one but I assume that Mono doesn't support HWIntrinsics - #33761). Is it correct to assume that Libraries tests are not run on ARM machines w/ CoreCLR? Surprising if this is the case.

TamarChristinaArm · 2020-03-20T11:27:15Z

@TamarChristinaArm, would you or someone else from ARM be able to take a glance at the PR and codegen and make sure we are using the right tricks?

Taking a look now, will have some feedback by end of today.

TamarChristinaArm · 2020-03-20T18:00:05Z

Implementation wise these are all correct and fine.

So I have comments mostly about the codegen out of the JIT than the actual implementation here.

For And, Or, Xor and Not the code is fine, it's mostly the addressing modes that need some fixup as @echesakovMSFT mentioned.

e.g. this code for And

G_M9258_IG18:
        93407CC9          sxtw    x9, w6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        4E311E10          and     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M9258_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M9258_IG18

should ideally be

G_M9258_IG18:
        93407CC9          sxtw    x9, w6
        D37EF529          lsl     x9, x9, #2
        4C407950          ldr     q16, [x7, x9]
        4C407931          ldr     q17, [x8, x9]
        4E311E10          and     v16.4s, v16.4s, v17.4s
        4C007950          str     q16, [x7, x9]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M9258_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M9258_IG18

Which means you don't need the adjustment for x7 and x8.
The writeback would work for the single use case (rightPtr) but
the multi-use cases become tricky as you have to do the writeback only once.

for CopyTo

	Vector128<byte> bitMask = Vector128.Create(0x80402010_08040201).AsByte();
	Vector128<byte> ones = Vector128.Create((byte)1);

currently generate

	D2804020          movz    x0, #513
	F2A10080          movk    x0, #0x804 LSL #16
	F2C40200          movk    x0, #0x2010 LSL #32
	F2F00800          movk    x0, #0x8040 LSL #48
	97FFF0DF          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|29_0(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
	4EA01C08          mov     v8.16b, v0.16b
	52800020          mov     w0, #1
	6E084509          mov     v9.d[0], v8.d[1]
	97FFF0A5          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|20_0(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
	4EA01C0A          mov     v10.16b, v0.16b

Seems like Vector.Create is not optimized for mask creations. What does the software fallback do?
I believe

	Vector128<byte> ones = Vector128.Create((byte)1);

could be movi v0, #1 (it's moving 1 into each lane no?) and the first one should be a dup. That would also get out of needing to make the copies.

Also am I missing something? I see

        6E084509          mov     v9.d[0], v8.d[1]
...
        6E180528          mov     v8.d[1], v9.d[0]

With no actual usages of v8 or v9. and it does the same odd thing again later

        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E180528          mov     v8.d[1], v9.d[0]
        6E18056A          mov     v10.d[1], v11.d[0]

v9 and v11 are untouched. it only uses v8 and v10.

For the algorithm itself, yeah as @Gnbrkm41 says TBL would have been the perfect operation here.
However it only needs the single register version of the TBL which has been API approved.
So we should implement and use that here. The sequence would then become something like

ldr		s0, [x0, x1]
tbl		v1, { v0.8B }, v2 ({0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1})
tbl		v3, { v0.8B }, v3 ({2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3})
and     v1, ..
min     v1, ..
and     v3, ..
min     v3, ..

and you don't need the byte reversals as they would be encoded in the TBL so would be much shorter..
For now this does look like the fastest way of doing the bit compression using the mask and 3 ADDP.

For the constructor

	Vector128<byte> lowerIsFalse = AdvSimd.CompareEqual(lowerVector, Vector128<byte>.Zero);

Aside from what @echesakovMSFT said that this should be a cmpeq against zero it should also
never generate this:

	4E010FF2          dup     v18.16b, wzr

If it ever actually need a 0 vector it should instead do

	movi v18.16b, #0

tannergooding · 2020-03-20T18:26:07Z

Thanks a lot for the analysis @TamarChristinaArm

As for

Seems like Vector.Create is not optimized for mask creations. What does the software fallback do?

The actual implementation is here: https://source.dot.net/#System.Private.CoreLib/Vector128.cs,434, as you may be able to see, we currently have a specialized path for x86 but don't have one for Arm, so it just naively constructs the vector using the stack and a value copy.

Fixing this is tracked by #33308 and #33496 and should be relatively straightforward given the APIs we have exposed now.

BruceForstall · 2020-03-20T23:48:38Z

I don't see any Libraries test run for CoreCLR on ARM on the CI list (there's Mono one but I assume that Mono doesn't support HWIntrinsics - #33761). Is it correct to assume that Libraries tests are not run on ARM machines w/ CoreCLR?

It appears that is true. @safern @ViktorHofer Is this intended? How can we get this libraries change properly tested on ARM64 (Linux, Windows) hardware?

Also, ideally, this should be tested with a Checked CoreCLR also. The ARM64 hardware intrinsics support is new, so we'd prefer to run code through a checked JIT to see if any asserts crop up.

safern · 2020-03-21T00:40:11Z

Due to hardware limitations those are not run in PRs but they are run on merge:

runtime/eng/pipelines/runtime.yml

Lines 669 to 672 in a43d6c4

    
           - ${{ if eq(variables['isFullMatrix'], true) }}: 
        
             - Linux_arm 
        
             - Linux_arm64 
        
             - Linux_musl_arm64

However, we can just queue a manual build from this branch, which will cause those jobs to run. Which I just did: https://dev.azure.com/dnceng/public/_build/results?buildId=568064

safern · 2020-03-21T02:35:45Z

There were indeed some failures on arm: https://dev.azure.com/dnceng/public/_build/results?buildId=568064&view=ms.vss-test-web.build-test-results-tab&runId=17835402&resultId=134928&paneView=debug

echesakov · 2020-03-25T18:10:56Z

New test run https://dev.azure.com/dnceng/public/_build/results?buildId=572708 shows test failures on Linux arm64 Release with https://helix.dot.net/api/2019-06-17/jobs/1e9376b5-c757-4eae-aa25-a887a41cd095/workitems/System.Collections.Tests/console

    System.Collections.Tests.BitArray_CtorTests.Ctor_BitArray [FAIL]
      System.PlatformNotSupportedException : Operation is not supported on this platform.
      Stack Trace:
        /_/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs(3617,0): at System.Runtime.Intrinsics.Arm.AdvSimd.LoadVector128(Byte* address)
        /_/src/libraries/System.Collections/src/System/Collections/BitArray.cs(182,0): at System.Collections.BitArray..ctor(Boolean[] values)
           at System.Collections.Tests.BitArray_CtorTests.Ctor_BitArray_TestData()+MoveNext()
        /_/src/libraries/System.Linq/src/System/Linq/Select.cs(137,0): at System.Linq.Enumerable.SelectEnumerableIterator`2.MoveNext()
    System.Collections.Tests.BitArray_CtorTests.Ctor_BoolArray(values: [True, True, True, True, True, ...]) [FAIL]
      System.PlatformNotSupportedException : Operation is not supported on this platform.
      Stack Trace:
        /_/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs(3617,0): at System.Runtime.Intrinsics.Arm.AdvSimd.LoadVector128(Byte* address)
        /_/src/libraries/System.Collections/src/System/Collections/BitArray.cs(182,0): at System.Collections.BitArray..ctor(Boolean[] values)
        /_/src/libraries/System.Collections/tests/BitArray/BitArray_CtorTests.cs(102,0): at System.Collections.Tests.BitArray_CtorTests.Ctor_BoolArray(Boolean[] values)

I will run this locally to see what's going on

echesakov · 2020-03-26T02:54:32Z

I have been debugging this for awhile - turned out this failure was Linux only. I didn't see the failure yesterday since I was running benchmarks on Windows/arm64 laptop and we didn't see this in CI before since your branch was based on top of an older commit (before #33936).

I submitted a PR to fix the issue #34107 - the fix should be merged before this change can go in.

BruceForstall · 2020-03-28T00:07:12Z

@echesakovMSFT Now that #34107 is merged, should someone kick off the libraries arm64 testing here again?

Maybe @Gnbrkm41 should rebase and push an update to ensure that the manually triggered jobs test the right thing?

echesakov · 2020-03-28T00:17:43Z

Yes, @Gnbrkm41 please do as Bruce suggested one more time (I hope this one is final)

Gnbrkm41 · 2020-03-28T08:45:34Z

@BruceForstall, @echesakovMSFT - Done 🙂

BruceForstall · 2020-03-28T17:54:44Z

I manually triggered https://dev.azure.com/dnceng/public/_build/results?buildId=577868&view=results

BruceForstall · 2020-03-28T21:00:21Z

@tannergooding @echesakovMSFT @GrabYourPitchforks All the tests have passed (except a first-pass set of flaky failures that are still visible?). Anyone want to do a final code review and sign-off so this can be merged?

tannergooding · 2020-03-28T21:38:51Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

+                            // Same logic as SSSE3 path, except we do not have Shuffle instruction.
+                            // (TableVectorLookup could be an alternative - dotnet/runtime#1277)


I'd like a tracking issue to be logged for this, so we don't forget to replace the implementation once TableVectorLookup is implemented.

FWIW, this should be addressed by #38780.

tannergooding

LGTM

echesakov

Looks Good. Thank you!

tannergooding · 2020-03-30T17:45:35Z

Thanks @Gnbrkm41

Gnbrkm41 · 2020-06-02T07:57:05Z

@echesakovMSFT, Has there been any improvements regarding "cmeq (register) when one of the operands is Vector128.Zero will be replaced with cmeq (zero) when we support this optimization"?

echesakov · 2020-06-02T17:40:30Z

Has there been any improvements regarding "cmeq (register) when one of the operands is Vector128.Zero will be replaced with cmeq (zero) when we support this optimization"?

@Gnbrkm41 No, I haven't had time to work on this yet, but I have this on my backlog.

Dotnet-GitSync-Bot added the area-System.Collections label Mar 19, 2020

EgorBo reviewed Mar 19, 2020

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

EgorBo reviewed Mar 19, 2020

View reviewed changes

Gnbrkm41 changed the title ~~Vectorise BitArray for ARM64~~ [WIP] Vectorise BitArray for ARM64 Mar 19, 2020

EgorBo mentioned this pull request Mar 19, 2020

[mono] return false for AdvSimd.IsSupported and friends #33761

Merged

monojenkins mentioned this pull request Mar 19, 2020

[mono] return false for AdvSimd.IsSupported and friends mono/mono#19263

Merged

GrabYourPitchforks reviewed Mar 19, 2020

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

src/libraries/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

Gnbrkm41 force-pushed the vectorisebitarrayarm branch from 07d9f31 to 4e3891e Compare March 20, 2020 07:24

jaredpar mentioned this pull request Mar 25, 2020

Errors installing the SDK during builds #34015

Closed

echesakov mentioned this pull request Mar 26, 2020

Always set InstructionSet_ArmBase in PAL_GetJitCpuCapabilityFlags #34107

Merged

Gnbrkm41 added 7 commits March 28, 2020 17:44

Vectorise BitArray for ARM64

65be9c2

Make algorithms endianness agnostic

19db92e

Change loop counter to uint to prevent overflow

3b0ba24

Fix CopyTo(bool[])

41dee08

Move constant variables outside the loop

f26e8bc

More unsigned goodness

e1a2cad

Use span.Clear instead of Fill(0)

c97520d

Gnbrkm41 force-pushed the vectorisebitarrayarm branch from 9738573 to c97520d Compare March 28, 2020 08:44

tannergooding reviewed Mar 28, 2020

View reviewed changes

tannergooding approved these changes Mar 28, 2020

View reviewed changes

echesakov approved these changes Mar 30, 2020

View reviewed changes

tannergooding merged commit 8511b5b into dotnet:master Mar 30, 2020

Gnbrkm41 deleted the vectorisebitarrayarm branch March 30, 2020 17:51

EgorBo mentioned this pull request Apr 7, 2020

Implement BinaryPrimitives.ReverseEndianness for arm64 using rev #34617

Merged

Gnbrkm41 mentioned this pull request Jul 4, 2020

Evaluate if BitArray() can be further optimized using single AddPairwise operation #38719

Open

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

		// Same logic as SSSE3 path, except we do not have Shuffle instruction.
		// (TableVectorLookup could be an alternative - dotnet/runtime#1277)

Vectorise BitArray for ARM64 #33749

Vectorise BitArray for ARM64 #33749

Conversation

Gnbrkm41 commented Mar 19, 2020

EgorBo Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

Gnbrkm41 Mar 19, 2020

Choose a reason for hiding this comment

Gnbrkm41 commented Mar 19, 2020

tannergooding commented Mar 19, 2020

tannergooding commented Mar 19, 2020

Gnbrkm41 commented Mar 19, 2020

tannergooding commented Mar 19, 2020

EgorBo commented Mar 19, 2020 • edited Loading

Gnbrkm41 commented Mar 19, 2020

echesakov commented Mar 19, 2020

echesakov commented Mar 19, 2020

tannergooding commented Mar 19, 2020

danmoseley commented Mar 19, 2020

EgorBo commented Mar 19, 2020

tannergooding commented Mar 19, 2020

Gnbrkm41 commented Mar 20, 2020

TamarChristinaArm commented Mar 20, 2020

TamarChristinaArm commented Mar 20, 2020

tannergooding commented Mar 20, 2020

BruceForstall commented Mar 20, 2020

safern commented Mar 21, 2020 • edited Loading

safern commented Mar 21, 2020 • edited Loading

echesakov commented Mar 25, 2020

echesakov commented Mar 26, 2020

BruceForstall commented Mar 28, 2020

echesakov commented Mar 28, 2020

Gnbrkm41 commented Mar 28, 2020

BruceForstall commented Mar 28, 2020

BruceForstall commented Mar 28, 2020

tannergooding Mar 28, 2020

Choose a reason for hiding this comment

Gnbrkm41 Jul 4, 2020

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

echesakov left a comment

Choose a reason for hiding this comment

tannergooding commented Mar 30, 2020

Gnbrkm41 commented Jun 2, 2020

echesakov commented Jun 2, 2020

EgorBo Mar 19, 2020 •

edited

Loading

EgorBo commented Mar 19, 2020 •

edited

Loading

safern commented Mar 21, 2020 •

edited

Loading

safern commented Mar 21, 2020 •

edited

Loading