Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorise BitArray for ARM64 #33749

Merged
merged 7 commits into from
Mar 30, 2020
Merged

Conversation

Gnbrkm41
Copy link
Contributor

Resolves #33309
Contributes towards #33308 (All the work items under BitArray)

Due to network issues I have with the home network I was not run able to run tests / benchmarks on my only ARM machine (Raspberry Pi). In the meantime, hopefully the CI would be able to verify if anything is wrong with my code.

cc @BruceForstall, @echesakovMSFT, @tannergooding

// shuffledLower = ZipLow(v3, v3) - A0 A0 A0 A0 A0 A0 A0 A0 A1 A1 A1 A1 A1 A1 A1 A1
// shuffledHigher = ZipHigh(v3, v3) - A2 A2 A2 A2 A2 A2 A2 A2 A3 A3 A3 A3 A3 A3 A3 A3

Vector128<byte> vector = Vector128.Create(BinaryPrimitives.ReverseEndianness(bits)).AsByte();
Copy link
Member

@EgorBo EgorBo Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is BinaryPrimitives.ReverseEndianness optimized to bswap equivalent on AArch64 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so, which is why I left a comment on #33308 about it. It would be REV instructions that correspond to x86's bswap instructions.

@Gnbrkm41
Copy link
Contributor Author

Is there a way to check which tests specifically have failed & their console outputs? Libraries Test Run release mono Linux arm64 Debug is the only ARM test leg that had run & is reporting failures, but it does not actually specify what has failed (AzDO tests page just say "The work item failed") and the logs don't help either.

@Gnbrkm41 Gnbrkm41 changed the title Vectorise BitArray for ARM64 [WIP] Vectorise BitArray for ARM64 Mar 19, 2020
@tannergooding
Copy link
Member

@Gnbrkm41, its not super intuitive, but once you are at the Azure logs (click Details on the failed job then View more details on Azure Pipelines), you can click on the build number at the top of the screen (in this case 20200319.8) and then change from Summary to Tests which will give you:
image

You then have to click on the test, goto Attachments, and download the log.
-- Its worth noting it isn't always this complex. I think this is just because normal unit test results couldn't be reported.

In this case, the log indicates:

exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

@tannergooding
Copy link
Member

Given that this is a core dump, I'll try to build and run locally to see if I can repro.

@Gnbrkm41
Copy link
Contributor Author

Oh, huh. That is interesting... I'll take a look as well.

@tannergooding
Copy link
Member

This actually looks like Mono, I'm not very familiar with how to debug or produce bits for that, do we have a guide somewhere? @EgorBo?

@EgorBo
Copy link
Member

EgorBo commented Mar 19, 2020

This actually looks like Mono, I'm not very familiar with how to debug or produce bits for that, do we have a guide somewhere? @EgorBo?

It crashes with StackOverflow in System.Collections.Tests.
I guess because AdvSimd.Arm64.IsSupported is not recognized as an intrinsic (the default impl for it is to call itself). I'll fix it, you can ignore this failure in this PR.

@Gnbrkm41
Copy link
Contributor Author

Do the remaining test legs get automatically cancelled when one of the legs fail? I see that all the non-Mono ARM CI runs are cancelled. It would be nice if we can selectively run those...

monojenkins pushed a commit to monojenkins/mono that referenced this pull request Mar 19, 2020
Fixes StackOverflow in dotnet/runtime#33749 (comment)
Intrinsify all `get_IsSupported` under `System.Runtime.Intrinsics*` to just `false` (except the sets we support, see mono_emit_simd_intrinsics).
@echesakov
Copy link
Contributor

Resolves #33309
Contributes towards #33308 (All the work items under BitArray)

Due to network issues I have with the home network I was not run able to run tests / benchmarks on my only ARM machine (Raspberry Pi). In the meantime, hopefully the CI would be able to verify if anything is wrong with my code.

@Gnbrkm41 Thank you for your contribution! I will collect jitDisasm and post here. In the meantime, I will also try to run benchmarks - not sure if they are supported on arm64 yet.

EgorBo added a commit to mono/mono that referenced this pull request Mar 19, 2020
Fixes StackOverflow in dotnet/runtime#33749 (comment)
Intrinsify all `get_IsSupported` under `System.Runtime.Intrinsics*` to just `false` (except the sets we support, see mono_emit_simd_intrinsics).

Co-authored-by: EgorBo <[email protected]>
@echesakov
Copy link
Contributor

I managed to collect jitDisasms for all the methods in PR. I attached the file but the following are the interesting spots:

And

G_M9258_IG18:
        93407CC9          sxtw    x9, x6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        4E311E10          and     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M9258_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M9258_IG18

Or

G_M43100_IG18:
        93407CC9          sxtw    x9, x6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        4EB11E10          orr     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M43100_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M43100_IG18

Xor

G_M31876_IG18:
        93407CC9          sxtw    x9, x6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        6E311E10          eor     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M31876_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M31876_IG18

Not

G_M14226_IG13:
        93407C65          sxtw    x5, x3
        D37EF4A5          lsl     x5, x5, #2
        8B050085          add     x5, x4, x5
        4C4078B0          ld1     {v16.4s}, [x5]
        6E205A10          mvn     v16.16b, v16.16b
        4C0078B0          st1     {v16.4s}, [x5]
        11001063          add     w3, w3, #4
						;; bbWeight=2    PerfScore 14.00
G_M14226_IG14:
        51000C45          sub     w5, w2, #3
        6B05007F          cmp     w3, w5
        54FFFEEB          blt     G_M14226_IG13

CopyTo

G_M40488_IG28:
        52800018          mov     w24, #0
        710082FF          cmp     w23, #32
        54000B2B          blt     G_M40488_IG32
        D2804020          movz    x0, #513
        F2A10080          movk    x0, #0x804 LSL #16
        F2C40200          movk    x0, #0x2010 LSL #32
        F2F00800          movk    x0, #0x8040 LSL #48
        97FFF0DF          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|29_0(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
        4EA01C08          mov     v8.16b, v0.16b
        52800020          mov     w0, #1
        6E084509          mov     v9.d[0], v8.d[1]
        97FFF0A5          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|20_0(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
        4EA01C0A          mov     v10.16b, v0.16b
        B9400AC0          ldr     w0, [x22,#8]
        6B00029F          cmp     w20, w0
        54001BE2          bhs     G_M40488_IG44
        93407E80          sxtw    x0, x20
        91004000          add     x0, x0, #16
        8B0002C0          add     x0, x22, x0
        F9000BA0          str     x0, [fp,#16]	// [V42 loc39]
        F9400BB9          ldr     x25, [fp,#16]	// [V42 loc39]
        B9401260          ldr     w0, [x19,#16]
        7100801F          cmp     w0, #32
        6E180528          mov     v8.d[1], v9.d[0]
        5400052B          blt     G_M40488_IG30
						;; bbWeight=0.50 PerfScore 11.50
G_M40488_IG29:
        F9400660          ldr     x0, [x19,#8]
        131F7F01          asr     w1, w24, #31
        12001021          and     w1, w1, #31
        0B180021          add     w1, w1, w24
        13057C21          asr     w1, w1, #5
        B9400802          ldr     w2, [x0,#8]
        6B02003F          cmp     w1, w2
        540019C2          bhs     G_M40488_IG44
        93407C21          sxtw    x1, x1
        D37EF421          lsl     x1, x1, #2
        91004021          add     x1, x1, #16
        B8616800          ldr     w0, [x0, x1]
        12009C01          and     w1, w0, #0xff00ff
        13812021          ror     w1, w1, #8
        12089C00          and     w0, w0, #0xff00ff00
        13806000          ror     w0, w0, #24
        0B000020          add     w0, w1, w0
        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        6E180528          mov     v8.d[1], v9.d[0]
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E18056A          mov     v10.d[1], v11.d[0]
        6E2A6E31          umin    v17.16b, v17.16b, v10.16b
        93407F00          sxtw    x0, x24
        8B000320          add     x0, x25, x0
        4C007011          st1     {v17.16b}, [x0]
        4E107A10          zip2    v16.16b, v16.16b, v16.16b
        4E281E10          and     v16.16b, v16.16b, v8.16b
        6E2A6E10          umin    v16.16b, v16.16b, v10.16b
        91004000          add     x0, x0, #16
        4C007010          st1     {v16.16b}, [x0]
        11008318          add     w24, w24, #32
        11008300          add     w0, w24, #32
        B9401261          ldr     w1, [x19,#16]
        6B01001F          cmp     w0, w1
        54FFFB2D          ble     G_M40488_IG29

.ctor

G_M45590_IG07:
        93407EC1          sxtw    x1, x22
        8B010001          add     x1, x0, x1
        4C407030          ld1     {v16.16b}, [x1]
        4E010FF1          dup     v17.16b, wzr
        6E318E10          cmeq    v16.16b, v16.16b, v17.16b
        4E201E10          and     v16.16b, v16.16b, v0.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        91004021          add     x1, x1, #16
        4C407031          ld1     {v17.16b}, [x1]
        4E010FF2          dup     v18.16b, wzr
        6E328E31          cmeq    v17.16b, v17.16b, v18.16b
        4E201E31          and     v17.16b, v17.16b, v0.16b
        4E31BE31          addp    v17.16b, v17.16b, v17.16b
        4E31BE31          addp    v17.16b, v17.16b, v17.16b
        4E31BE31          addp    v17.16b, v17.16b, v17.16b
        4E513A10          zip1    v16.8h, v16.8h, v17.8h
        3D8007B0          str     q16, [fp,#16]	// [V39 tmp11]
        B94013A1          ldr     w1, [fp,#16]	// [V39 tmp11]
        F9400662          ldr     x2, [x19,#8]
        131F7EC3          asr     w3, w22, #31
        12001063          and     w3, w3, #31
        0B160063          add     w3, w3, w22
        13057C63          asr     w3, w3, #5
        B9400844          ldr     w4, [x2,#8]
        6B04007F          cmp     w3, w4
        54000782          bhs     G_M45590_IG18
        93407C63          sxtw    x3, x3
        D37EF463          lsl     x3, x3, #2
        91004063          add     x3, x3, #16
        2A2103E1          mvn     w1, w1
        B8236841          str     w1, [x2, x3]
        110082D6          add     w22, w22, #32
						;; bbWeight=2    PerfScore 90.00
G_M45590_IG08:
        110082C1          add     w1, w22, #32
        6B0102BF          cmp     w21, w1
        54FFFB8A          bge     G_M45590_IG07

Couple notes here:

  1. The generated code for And, Or, Xor and Not looks reasonable to me
  2. If we had a pre-indexed or post-indexed version of the Load intrinsic JIT could generate even more compact code for the loops bodies
  3. I expect all these bl System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback to go away as soon as we implement [Arm64] Load one single-element structure and Replicate to all lanes of one register #33490 and start using the intrinsic in Vector128<T>.Create implementation
  4. cmeq (register) when one of the operands is Vector128<T>.Zero will be replaced with cmeq (zero) when we support this optimization

jitDump.txt

@tannergooding
Copy link
Member

@TamarChristinaArm, would you or someone else from ARM be able to take a glance at the PR and codegen and make sure we are using the right tricks?

Also CC. @CarolEidt

@danmoseley
Copy link
Member

I guess because AdvSimd.Arm64.IsSupported is not recognized as an intrinsic (the default impl for it is to call itself). I'll fix it, you can ignore this failure in this PR.

@stephentoub should we have an analyzer for that? public static new bool IsSupported { get => IsSupported; }. But, perhaps it's unlikely.

@EgorBo
Copy link
Member

EgorBo commented Mar 19, 2020

@stephentoub should we have an analyzer for that? public static new bool IsSupported { get => IsSupported; }. But, perhaps it's unlikely.

As far as I understand this hack is a trick to allow intrinsics to be used via reflection.

@tannergooding
Copy link
Member

As far as I understand this hack is a trick to allow intrinsics to be used via reflection.

Yes, its used for basically any form of indirect invocation including: reflection, delegates, the debugger/immediate window, and when non constant inputs are given to an intrinsic that requires them.

@Gnbrkm41 Gnbrkm41 force-pushed the vectorisebitarrayarm branch from 07d9f31 to 4e3891e Compare March 20, 2020 07:24
@Gnbrkm41
Copy link
Contributor Author

I don't see any Libraries test run for CoreCLR on ARM on the CI list (there's Mono one but I assume that Mono doesn't support HWIntrinsics - #33761). Is it correct to assume that Libraries tests are not run on ARM machines w/ CoreCLR? Surprising if this is the case.

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm, would you or someone else from ARM be able to take a glance at the PR and codegen and make sure we are using the right tricks?

Taking a look now, will have some feedback by end of today.

@TamarChristinaArm
Copy link
Contributor

Implementation wise these are all correct and fine.

So I have comments mostly about the codegen out of the JIT than the actual implementation here.

For And, Or, Xor and Not the code is fine, it's mostly the addressing modes that need some fixup as @echesakovMSFT mentioned.

e.g. this code for And

G_M9258_IG18:
        93407CC9          sxtw    x9, w6
        D37EF529          lsl     x9, x9, #2
        8B0900EA          add     x10, x7, x9
        4C407950          ld1     {v16.4s}, [x10]
        8B090109          add     x9, x8, x9
        4C407931          ld1     {v17.4s}, [x9]
        4E311E10          and     v16.4s, v16.4s, v17.4s
        4C007950          st1     {v16.4s}, [x10]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M9258_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M9258_IG18

should ideally be

G_M9258_IG18:
        93407CC9          sxtw    x9, w6
        D37EF529          lsl     x9, x9, #2
        4C407950          ldr     q16, [x7, x9]
        4C407931          ldr     q17, [x8, x9]
        4E311E10          and     v16.4s, v16.4s, v17.4s
        4C007950          str     q16, [x7, x9]
        110010C6          add     w6, w6, #4
						;; bbWeight=2    PerfScore 21.00
G_M9258_IG19:
        51000CA9          sub     w9, w5, #3
        6B0900DF          cmp     w6, w9
        54FFFEAB          blt     G_M9258_IG18

Which means you don't need the adjustment for x7 and x8.
The writeback would work for the single use case (rightPtr) but
the multi-use cases become tricky as you have to do the writeback only once.

for CopyTo

	Vector128<byte> bitMask = Vector128.Create(0x80402010_08040201).AsByte();
	Vector128<byte> ones = Vector128.Create((byte)1);

currently generate

	D2804020          movz    x0, #513
	F2A10080          movk    x0, #0x804 LSL #16
	F2C40200          movk    x0, #0x2010 LSL #32
	F2F00800          movk    x0, #0x8040 LSL #48
	97FFF0DF          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|29_0(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
	4EA01C08          mov     v8.16b, v0.16b
	52800020          mov     w0, #1
	6E084509          mov     v9.d[0], v8.d[1]
	97FFF0A5          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|20_0(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
	4EA01C0A          mov     v10.16b, v0.16b

Seems like Vector.Create is not optimized for mask creations. What does the software fallback do?
I believe

	Vector128<byte> ones = Vector128.Create((byte)1);

could be movi v0, #1 (it's moving 1 into each lane no?) and the first one should be a dup. That would also get out of needing to make the copies.

Also am I missing something? I see

        6E084509          mov     v9.d[0], v8.d[1]
...
        6E180528          mov     v8.d[1], v9.d[0]

With no actual usages of v8 or v9. and it does the same odd thing again later

        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E180528          mov     v8.d[1], v9.d[0]
        6E18056A          mov     v10.d[1], v11.d[0]

v9 and v11 are untouched. it only uses v8 and v10.

For the algorithm itself, yeah as @Gnbrkm41 says TBL would have been the perfect operation here.
However it only needs the single register version of the TBL which has been API approved.
So we should implement and use that here. The sequence would then become something like

ldr		s0, [x0, x1]
tbl		v1, { v0.8B }, v2 ({0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1})
tbl		v3, { v0.8B }, v3 ({2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3})
and     v1, ..
min     v1, ..
and     v3, ..
min     v3, ..

and you don't need the byte reversals as they would be encoded in the TBL so would be much shorter..
For now this does look like the fastest way of doing the bit compression using the mask and 3 ADDP.

For the constructor

	Vector128<byte> lowerIsFalse = AdvSimd.CompareEqual(lowerVector, Vector128<byte>.Zero);

Aside from what @echesakovMSFT said that this should be a cmpeq against zero it should also
never generate this:

	4E010FF2          dup     v18.16b, wzr

If it ever actually need a 0 vector it should instead do

	movi v18.16b, #0

@tannergooding
Copy link
Member

Thanks a lot for the analysis @TamarChristinaArm

As for

Seems like Vector.Create is not optimized for mask creations. What does the software fallback do?

The actual implementation is here: https://source.dot.net/#System.Private.CoreLib/Vector128.cs,434, as you may be able to see, we currently have a specialized path for x86 but don't have one for Arm, so it just naively constructs the vector using the stack and a value copy.

Fixing this is tracked by #33308 and #33496 and should be relatively straightforward given the APIs we have exposed now.

@BruceForstall
Copy link
Member

I don't see any Libraries test run for CoreCLR on ARM on the CI list (there's Mono one but I assume that Mono doesn't support HWIntrinsics - #33761). Is it correct to assume that Libraries tests are not run on ARM machines w/ CoreCLR?

It appears that is true. @safern @ViktorHofer Is this intended? How can we get this libraries change properly tested on ARM64 (Linux, Windows) hardware?

Also, ideally, this should be tested with a Checked CoreCLR also. The ARM64 hardware intrinsics support is new, so we'd prefer to run code through a checked JIT to see if any asserts crop up.

@safern
Copy link
Member

safern commented Mar 21, 2020

Due to hardware limitations those are not run in PRs but they are run on merge:

- ${{ if eq(variables['isFullMatrix'], true) }}:
- Linux_arm
- Linux_arm64
- Linux_musl_arm64

However, we can just queue a manual build from this branch, which will cause those jobs to run. Which I just did: https://dev.azure.com/dnceng/public/_build/results?buildId=568064

@safern
Copy link
Member

safern commented Mar 21, 2020

@echesakov
Copy link
Contributor

New test run https://dev.azure.com/dnceng/public/_build/results?buildId=572708 shows test failures on Linux arm64 Release with https://helix.dot.net/api/2019-06-17/jobs/1e9376b5-c757-4eae-aa25-a887a41cd095/workitems/System.Collections.Tests/console

    System.Collections.Tests.BitArray_CtorTests.Ctor_BitArray [FAIL]
      System.PlatformNotSupportedException : Operation is not supported on this platform.
      Stack Trace:
        /_/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs(3617,0): at System.Runtime.Intrinsics.Arm.AdvSimd.LoadVector128(Byte* address)
        /_/src/libraries/System.Collections/src/System/Collections/BitArray.cs(182,0): at System.Collections.BitArray..ctor(Boolean[] values)
           at System.Collections.Tests.BitArray_CtorTests.Ctor_BitArray_TestData()+MoveNext()
        /_/src/libraries/System.Linq/src/System/Linq/Select.cs(137,0): at System.Linq.Enumerable.SelectEnumerableIterator`2.MoveNext()
    System.Collections.Tests.BitArray_CtorTests.Ctor_BoolArray(values: [True, True, True, True, True, ...]) [FAIL]
      System.PlatformNotSupportedException : Operation is not supported on this platform.
      Stack Trace:
        /_/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs(3617,0): at System.Runtime.Intrinsics.Arm.AdvSimd.LoadVector128(Byte* address)
        /_/src/libraries/System.Collections/src/System/Collections/BitArray.cs(182,0): at System.Collections.BitArray..ctor(Boolean[] values)
        /_/src/libraries/System.Collections/tests/BitArray/BitArray_CtorTests.cs(102,0): at System.Collections.Tests.BitArray_CtorTests.Ctor_BoolArray(Boolean[] values)

I will run this locally to see what's going on

@echesakov
Copy link
Contributor

I have been debugging this for awhile - turned out this failure was Linux only. I didn't see the failure yesterday since I was running benchmarks on Windows/arm64 laptop and we didn't see this in CI before since your branch was based on top of an older commit (before #33936).

I submitted a PR to fix the issue #34107 - the fix should be merged before this change can go in.

@BruceForstall
Copy link
Member

@echesakovMSFT Now that #34107 is merged, should someone kick off the libraries arm64 testing here again?

Maybe @Gnbrkm41 should rebase and push an update to ensure that the manually triggered jobs test the right thing?

@echesakov
Copy link
Contributor

Yes, @Gnbrkm41 please do as Bruce suggested one more time (I hope this one is final)

@Gnbrkm41 Gnbrkm41 force-pushed the vectorisebitarrayarm branch from 9738573 to c97520d Compare March 28, 2020 08:44
@Gnbrkm41
Copy link
Contributor Author

@BruceForstall, @echesakovMSFT - Done 🙂

@BruceForstall
Copy link
Member

I manually triggered https://dev.azure.com/dnceng/public/_build/results?buildId=577868&view=results

@BruceForstall
Copy link
Member

@tannergooding @echesakovMSFT @GrabYourPitchforks All the tests have passed (except a first-pass set of flaky failures that are still visible?). Anyone want to do a final code review and sign-off so this can be merged?

Comment on lines +994 to +995
// Same logic as SSSE3 path, except we do not have Shuffle instruction.
// (TableVectorLookup could be an alternative - dotnet/runtime#1277)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a tracking issue to be logged for this, so we don't forget to replace the implementation once TableVectorLookup is implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this should be addressed by #38780.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good. Thank you!

@tannergooding
Copy link
Member

Thanks @Gnbrkm41

@tannergooding tannergooding merged commit 8511b5b into dotnet:master Mar 30, 2020
@Gnbrkm41 Gnbrkm41 deleted the vectorisebitarrayarm branch March 30, 2020 17:51
@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Jun 2, 2020

@echesakovMSFT, Has there been any improvements regarding "cmeq (register) when one of the operands is Vector128.Zero will be replaced with cmeq (zero) when we support this optimization"?

@echesakov
Copy link
Contributor

Has there been any improvements regarding "cmeq (register) when one of the operands is Vector128.Zero will be replaced with cmeq (zero) when we support this optimization"?

@Gnbrkm41 No, I haven't had time to work on this yet, but I have this on my backlog.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize System.Collections.BitArray using arm64 intrinsics
10 participants