-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorise BitArray for ARM64 #33749
Vectorise BitArray for ARM64 #33749
Conversation
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
// shuffledLower = ZipLow(v3, v3) - A0 A0 A0 A0 A0 A0 A0 A0 A1 A1 A1 A1 A1 A1 A1 A1 | ||
// shuffledHigher = ZipHigh(v3, v3) - A2 A2 A2 A2 A2 A2 A2 A2 A3 A3 A3 A3 A3 A3 A3 A3 | ||
|
||
Vector128<byte> vector = Vector128.Create(BinaryPrimitives.ReverseEndianness(bits)).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is BinaryPrimitives.ReverseEndianness
optimized to bswap
equivalent on AArch64 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so, which is why I left a comment on #33308 about it. It would be REV
instructions that correspond to x86's bswap instructions.
Is there a way to check which tests specifically have failed & their console outputs? |
@Gnbrkm41, its not super intuitive, but once you are at the Azure logs (click You then have to click on the test, goto In this case, the log indicates:
|
Given that this is a core dump, I'll try to build and run locally to see if I can repro. |
Oh, huh. That is interesting... I'll take a look as well. |
This actually looks like Mono, I'm not very familiar with how to debug or produce bits for that, do we have a guide somewhere? @EgorBo? |
It crashes with StackOverflow in System.Collections.Tests. |
Do the remaining test legs get automatically cancelled when one of the legs fail? I see that all the non-Mono ARM CI runs are cancelled. It would be nice if we can selectively run those... |
Fixes StackOverflow in dotnet/runtime#33749 (comment) Intrinsify all `get_IsSupported` under `System.Runtime.Intrinsics*` to just `false` (except the sets we support, see mono_emit_simd_intrinsics).
@Gnbrkm41 Thank you for your contribution! I will collect jitDisasm and post here. In the meantime, I will also try to run benchmarks - not sure if they are supported on arm64 yet. |
Fixes StackOverflow in dotnet/runtime#33749 (comment) Intrinsify all `get_IsSupported` under `System.Runtime.Intrinsics*` to just `false` (except the sets we support, see mono_emit_simd_intrinsics). Co-authored-by: EgorBo <[email protected]>
I managed to collect jitDisasms for all the methods in PR. I attached the file but the following are the interesting spots: And G_M9258_IG18:
93407CC9 sxtw x9, x6
D37EF529 lsl x9, x9, #2
8B0900EA add x10, x7, x9
4C407950 ld1 {v16.4s}, [x10]
8B090109 add x9, x8, x9
4C407931 ld1 {v17.4s}, [x9]
4E311E10 and v16.4s, v16.4s, v17.4s
4C007950 st1 {v16.4s}, [x10]
110010C6 add w6, w6, #4
;; bbWeight=2 PerfScore 21.00
G_M9258_IG19:
51000CA9 sub w9, w5, #3
6B0900DF cmp w6, w9
54FFFEAB blt G_M9258_IG18 Or G_M43100_IG18:
93407CC9 sxtw x9, x6
D37EF529 lsl x9, x9, #2
8B0900EA add x10, x7, x9
4C407950 ld1 {v16.4s}, [x10]
8B090109 add x9, x8, x9
4C407931 ld1 {v17.4s}, [x9]
4EB11E10 orr v16.4s, v16.4s, v17.4s
4C007950 st1 {v16.4s}, [x10]
110010C6 add w6, w6, #4
;; bbWeight=2 PerfScore 21.00
G_M43100_IG19:
51000CA9 sub w9, w5, #3
6B0900DF cmp w6, w9
54FFFEAB blt G_M43100_IG18 Xor G_M31876_IG18:
93407CC9 sxtw x9, x6
D37EF529 lsl x9, x9, #2
8B0900EA add x10, x7, x9
4C407950 ld1 {v16.4s}, [x10]
8B090109 add x9, x8, x9
4C407931 ld1 {v17.4s}, [x9]
6E311E10 eor v16.4s, v16.4s, v17.4s
4C007950 st1 {v16.4s}, [x10]
110010C6 add w6, w6, #4
;; bbWeight=2 PerfScore 21.00
G_M31876_IG19:
51000CA9 sub w9, w5, #3
6B0900DF cmp w6, w9
54FFFEAB blt G_M31876_IG18 Not G_M14226_IG13:
93407C65 sxtw x5, x3
D37EF4A5 lsl x5, x5, #2
8B050085 add x5, x4, x5
4C4078B0 ld1 {v16.4s}, [x5]
6E205A10 mvn v16.16b, v16.16b
4C0078B0 st1 {v16.4s}, [x5]
11001063 add w3, w3, #4
;; bbWeight=2 PerfScore 14.00
G_M14226_IG14:
51000C45 sub w5, w2, #3
6B05007F cmp w3, w5
54FFFEEB blt G_M14226_IG13 CopyTo G_M40488_IG28:
52800018 mov w24, #0
710082FF cmp w23, #32
54000B2B blt G_M40488_IG32
D2804020 movz x0, #513
F2A10080 movk x0, #0x804 LSL #16
F2C40200 movk x0, #0x2010 LSL #32
F2F00800 movk x0, #0x8040 LSL #48
97FFF0DF bl System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|29_0(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
4EA01C08 mov v8.16b, v0.16b
52800020 mov w0, #1
6E084509 mov v9.d[0], v8.d[1]
97FFF0A5 bl System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|20_0(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
4EA01C0A mov v10.16b, v0.16b
B9400AC0 ldr w0, [x22,#8]
6B00029F cmp w20, w0
54001BE2 bhs G_M40488_IG44
93407E80 sxtw x0, x20
91004000 add x0, x0, #16
8B0002C0 add x0, x22, x0
F9000BA0 str x0, [fp,#16] // [V42 loc39]
F9400BB9 ldr x25, [fp,#16] // [V42 loc39]
B9401260 ldr w0, [x19,#16]
7100801F cmp w0, #32
6E180528 mov v8.d[1], v9.d[0]
5400052B blt G_M40488_IG30
;; bbWeight=0.50 PerfScore 11.50
G_M40488_IG29:
F9400660 ldr x0, [x19,#8]
131F7F01 asr w1, w24, #31
12001021 and w1, w1, #31
0B180021 add w1, w1, w24
13057C21 asr w1, w1, #5
B9400802 ldr w2, [x0,#8]
6B02003F cmp w1, w2
540019C2 bhs G_M40488_IG44
93407C21 sxtw x1, x1
D37EF421 lsl x1, x1, #2
91004021 add x1, x1, #16
B8616800 ldr w0, [x0, x1]
12009C01 and w1, w0, #0xff00ff
13812021 ror w1, w1, #8
12089C00 and w0, w0, #0xff00ff00
13806000 ror w0, w0, #24
0B000020 add w0, w1, w0
6E084509 mov v9.d[0], v8.d[1]
6E08454B mov v11.d[0], v10.d[1]
97FFF096 bl System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
4E003810 zip1 v16.16b, v0.16b, v0.16b
4E103A10 zip1 v16.16b, v16.16b, v16.16b
4E103A11 zip1 v17.16b, v16.16b, v16.16b
6E180528 mov v8.d[1], v9.d[0]
4E281E31 and v17.16b, v17.16b, v8.16b
6E18056A mov v10.d[1], v11.d[0]
6E2A6E31 umin v17.16b, v17.16b, v10.16b
93407F00 sxtw x0, x24
8B000320 add x0, x25, x0
4C007011 st1 {v17.16b}, [x0]
4E107A10 zip2 v16.16b, v16.16b, v16.16b
4E281E10 and v16.16b, v16.16b, v8.16b
6E2A6E10 umin v16.16b, v16.16b, v10.16b
91004000 add x0, x0, #16
4C007010 st1 {v16.16b}, [x0]
11008318 add w24, w24, #32
11008300 add w0, w24, #32
B9401261 ldr w1, [x19,#16]
6B01001F cmp w0, w1
54FFFB2D ble G_M40488_IG29 .ctor G_M45590_IG07:
93407EC1 sxtw x1, x22
8B010001 add x1, x0, x1
4C407030 ld1 {v16.16b}, [x1]
4E010FF1 dup v17.16b, wzr
6E318E10 cmeq v16.16b, v16.16b, v17.16b
4E201E10 and v16.16b, v16.16b, v0.16b
4E30BE10 addp v16.16b, v16.16b, v16.16b
4E30BE10 addp v16.16b, v16.16b, v16.16b
4E30BE10 addp v16.16b, v16.16b, v16.16b
91004021 add x1, x1, #16
4C407031 ld1 {v17.16b}, [x1]
4E010FF2 dup v18.16b, wzr
6E328E31 cmeq v17.16b, v17.16b, v18.16b
4E201E31 and v17.16b, v17.16b, v0.16b
4E31BE31 addp v17.16b, v17.16b, v17.16b
4E31BE31 addp v17.16b, v17.16b, v17.16b
4E31BE31 addp v17.16b, v17.16b, v17.16b
4E513A10 zip1 v16.8h, v16.8h, v17.8h
3D8007B0 str q16, [fp,#16] // [V39 tmp11]
B94013A1 ldr w1, [fp,#16] // [V39 tmp11]
F9400662 ldr x2, [x19,#8]
131F7EC3 asr w3, w22, #31
12001063 and w3, w3, #31
0B160063 add w3, w3, w22
13057C63 asr w3, w3, #5
B9400844 ldr w4, [x2,#8]
6B04007F cmp w3, w4
54000782 bhs G_M45590_IG18
93407C63 sxtw x3, x3
D37EF463 lsl x3, x3, #2
91004063 add x3, x3, #16
2A2103E1 mvn w1, w1
B8236841 str w1, [x2, x3]
110082D6 add w22, w22, #32
;; bbWeight=2 PerfScore 90.00
G_M45590_IG08:
110082C1 add w1, w22, #32
6B0102BF cmp w21, w1
54FFFB8A bge G_M45590_IG07 Couple notes here:
|
@TamarChristinaArm, would you or someone else from ARM be able to take a glance at the PR and codegen and make sure we are using the right tricks? Also CC. @CarolEidt |
@stephentoub should we have an analyzer for that? |
As far as I understand this hack is a trick to allow intrinsics to be used via reflection. |
Yes, its used for basically any form of indirect invocation including: reflection, delegates, the debugger/immediate window, and when non constant inputs are given to an intrinsic that requires them. |
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
07d9f31
to
4e3891e
Compare
I don't see any Libraries test run for CoreCLR on ARM on the CI list (there's Mono one but I assume that Mono doesn't support HWIntrinsics - #33761). Is it correct to assume that Libraries tests are not run on ARM machines w/ CoreCLR? Surprising if this is the case. |
Taking a look now, will have some feedback by end of today. |
Implementation wise these are all correct and fine. So I have comments mostly about the codegen out of the JIT than the actual implementation here. For e.g. this code for
should ideally be
Which means you don't need the adjustment for for
currently generate
Seems like
could be Also am I missing something? I see
With no actual usages of
For the algorithm itself, yeah as @Gnbrkm41 says
and you don't need the byte reversals as they would be encoded in the For the constructor
Aside from what @echesakovMSFT said that this should be a
If it ever actually need a
|
Thanks a lot for the analysis @TamarChristinaArm As for
The actual implementation is here: https://source.dot.net/#System.Private.CoreLib/Vector128.cs,434, as you may be able to see, we currently have a specialized path for Fixing this is tracked by #33308 and #33496 and should be relatively straightforward given the APIs we have exposed now. |
It appears that is true. @safern @ViktorHofer Is this intended? How can we get this libraries change properly tested on ARM64 (Linux, Windows) hardware? Also, ideally, this should be tested with a Checked CoreCLR also. The ARM64 hardware intrinsics support is new, so we'd prefer to run code through a checked JIT to see if any asserts crop up. |
Due to hardware limitations those are not run in PRs but they are run on merge: runtime/eng/pipelines/runtime.yml Lines 669 to 672 in a43d6c4
However, we can just queue a manual build from this branch, which will cause those jobs to run. Which I just did: https://dev.azure.com/dnceng/public/_build/results?buildId=568064 |
There were indeed some failures on arm: https://dev.azure.com/dnceng/public/_build/results?buildId=568064&view=ms.vss-test-web.build-test-results-tab&runId=17835402&resultId=134928&paneView=debug |
New test run https://dev.azure.com/dnceng/public/_build/results?buildId=572708 shows test failures on Linux arm64 Release with https://helix.dot.net/api/2019-06-17/jobs/1e9376b5-c757-4eae-aa25-a887a41cd095/workitems/System.Collections.Tests/console
I will run this locally to see what's going on |
I have been debugging this for awhile - turned out this failure was Linux only. I didn't see the failure yesterday since I was running benchmarks on Windows/arm64 laptop and we didn't see this in CI before since your branch was based on top of an older commit (before #33936). I submitted a PR to fix the issue #34107 - the fix should be merged before this change can go in. |
Yes, @Gnbrkm41 please do as Bruce suggested one more time (I hope this one is final) |
9738573
to
c97520d
Compare
@BruceForstall, @echesakovMSFT - Done 🙂 |
I manually triggered https://dev.azure.com/dnceng/public/_build/results?buildId=577868&view=results |
@tannergooding @echesakovMSFT @GrabYourPitchforks All the tests have passed (except a first-pass set of flaky failures that are still visible?). Anyone want to do a final code review and sign-off so this can be merged? |
// Same logic as SSSE3 path, except we do not have Shuffle instruction. | ||
// (TableVectorLookup could be an alternative - dotnet/runtime#1277) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a tracking issue to be logged for this, so we don't forget to replace the implementation once TableVectorLookup is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this should be addressed by #38780.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good. Thank you!
Thanks @Gnbrkm41 |
@echesakovMSFT, Has there been any improvements regarding "cmeq (register) when one of the operands is Vector128.Zero will be replaced with cmeq (zero) when we support this optimization"? |
@Gnbrkm41 No, I haven't had time to work on this yet, but I have this on my backlog. |
Resolves #33309
Contributes towards #33308 (All the work items under BitArray)
Due to network issues I have with the home network I was not run able to run tests / benchmarks on my only ARM machine (Raspberry Pi). In the meantime, hopefully the CI would be able to verify if anything is wrong with my code.
cc @BruceForstall, @echesakovMSFT, @tannergooding