Improve Ascii (and Utf8) encoding #85266

Daniel-Svensson · 2023-04-24T18:10:09Z

Reduce overhead of NarrowUtf16ToAscii when Vector128 is hardware accelerated.

Inline NarrowUtf16ToAscii_Intrinsified
For 33 char all ascii case the speedup is roughly

On skylake (i6700k) class hardware the time taken goes from > 6,6 -> < 5,9 ns (<90% of time)
On Zen 3 5,1 ns -> 3,9ns (~77% time)

Use more efficient code to store "half a vector"
Use a single movsd instruction instead of a long series of 10-16 instruction which stores temp on the stack.

* from 10 /17 to 1 instruction for 64/32 bit x86

…ToAscii_Intrinsified

ghost · 2023-04-24T18:10:22Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Reduce overhead of NarrowUtf16ToAscii when Vector128 is hardware accelerated.

Inline NarrowUtf16ToAscii_Intrinsified
For 33 char all ascii case the speedup is roughly

On skylake (i6700k) class hardware the time taken goes from > 6,6 -> < 5,9 ns (<90% of time)
On Zen 3 5,1 ns -> 3,9ns (~77% time)

Use more efficient code to store "half a vector"
Use a single movq instruction instead of a long series of 10-16 instruction which stores temp on the stack.

Author:	Daniel-Svensson
Assignees:	-
Labels:	`area-System.Text.Encoding`, `community-contribution`
Milestone:	-

EgorBo · 2023-04-24T19:48:32Z

Inline NarrowUtf16ToAscii_Intrinsified
For 33 char all ascii case the speedup is roughly

What about smaller size (which are more popular)?

We're seeing more and more problems because of AggressiveInlining attribute placed on large methods (like yours) in the inliner

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs

Daniel-Svensson · 2023-05-02T10:51:02Z

Inline NarrowUtf16ToAscii_Intrinsified
For 33 char all ascii case the speedup is roughly

What about smaller size (which are more popular)?

We're seeing more and more problems because of AggressiveInlining attribute placed on large methods (like yours) in the inliner

I did not see any regression for smaller inputs, but maybe an unexptected improvement.
I did re run som benchmarks with smaller inputs with Intel Core i7-6700K and AMD Zen3

"Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower" is the numbers without inlining and
"Ascii_Local_NarrowUtf16ToAscii_v2_Inline" is with inlining. You can ignore the rests (they are experiments that I might create separate PRs for)

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1555/22H2/2022Update/SunValley2)
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.2.23157.25
  [Host]     : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
  Job-BXGTAG : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2

MaxRelativeError=0.01  IterationTime=300.0000 ms  WarmupCount=1

Method	StringLengthInChars	Scenario	Mean	Error	StdDev	Median	Ratio	RatioSD
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	5	AsciiOnly	2.048 ns	0.0214 ns	0.0376 ns	2.038 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	5	AsciiOnly	1.713 ns	0.0161 ns	0.0143 ns	1.714 ns	0.83	0.02

Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	8	AsciiOnly	2.499 ns	0.0342 ns	0.0722 ns	2.465 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	8	AsciiOnly	2.096 ns	0.0245 ns	0.0229 ns	2.091 ns	0.84	0.03

Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	15	AsciiOnly	3.167 ns	0.0261 ns	0.0218 ns	3.168 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	15	AsciiOnly	2.723 ns	0.0326 ns	0.0694 ns	2.693 ns	0.89	0.02

Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	16	AsciiOnly	3.099 ns	0.0314 ns	0.0294 ns	3.109 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	16	AsciiOnly	2.736 ns	0.0387 ns	0.0414 ns	2.717 ns	0.88	0.02

Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	19	AsciiOnly	3.405 ns	0.0407 ns	0.0381 ns	3.401 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	19	AsciiOnly	3.097 ns	0.0338 ns	0.0299 ns	3.090 ns	0.91	0.01

Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	31	AsciiOnly	4.393 ns	0.0278 ns	0.0260 ns	4.388 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	31	AsciiOnly	4.280 ns	0.0536 ns	0.0573 ns	4.283 ns	0.97	0.01
Ascii_Local_NarrowUtf16ToAscii_simple_loop	31	AsciiOnly	2.746 ns	0.0349 ns	0.0327 ns	2.738 ns	0.63	0.01
Ascii_Local_NarrowUtf16ToAscii_v3	31	AsciiOnly	2.868 ns	0.0378 ns	0.0405 ns	2.866 ns	0.65	0.01
Ascii_Local_NarrowUtf16ToAscii_v4_if	31	AsciiOnly	2.900 ns	0.0245 ns	0.0230 ns	2.893 ns	0.66	0.01

Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower	33	AsciiOnly	3.908 ns	0.0358 ns	0.0525 ns	3.881 ns	1.00	0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline	33	AsciiOnly	2.916 ns	0.0284 ns	0.0266 ns	2.905 ns	0.74	0.01

EgorBo · 2023-05-06T15:18:23Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs

+        {
+            // Below code translates to a single write on x86 (for both 32 and 64 bit)
+            // - we use double instead of long so that the JIT writes directly to memory without intermediate (register or stack in case of 32 bit)
+            Unsafe.WriteUnaligned<double>(ref Unsafe.Add(ref bytePtr, elementOffset), byteVector.AsDouble().ToScalar());


LGTM, but I don't think that comment is needed because it is also expected to be a single instruction on arm. Also I'm not sure double is any better than long here, jit is expected to do the right thing anyway.

I can believe that the double hack helps on 32-bit x86. We do not pay as much attention to the codegen quality on 32-bit x86 and there are definitely issues. I do not think it is worth it to be adding workarounds like this for x86 quality issues to CoreLib. If issues like this one are important to fix, it would be better to fix it in the JIT.

Ah, I missed the fact that the same code with long generates bad codegen on x86, I hoped JIT would emit the same as for double 🙁 @jkotas do you mean the whole helper call is not needed? (it is needed because the original code was using Vector64 that we don't recommend using on x86/64) or you're fine with changing this to long and remove notes? (I agree that we'd better fix this in JIT for 32bit we likely already a few similar patterns in BCL)

@jkotas do you mean the whole helper call is not needed?

I assume that the helper call is needed to avoid Vector64 that does not produce good code on x64. Without the helper call., the alternative sequence that avoids Vector64 would have to be manually inlined in every place.

EgorBo · 2023-05-07T14:34:55Z

@Daniel-Svensson can you please do the same for https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs#LL532C46-L532C46 ? (you can either inline Unsafe.WriteUnaligned there or move your helper to Vector128 as an internal API)

Daniel-Svensson · 2023-05-08T06:35:55Z

@Daniel-Svensson can you please do the same for https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs#LL532C46-L532C46 ? (you can either inline Unsafe.WriteUnaligned there or move your helper to Vector128 as an internal API)

I moved the helper to Vector128 and called it from there.

I also found an old unused helper (0) references with the same purpose in that file, do you want me to remove that method or update it to call the new helper ?

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs

Lines 466 to 499 in 5f0d620

    
           [MethodImpl(MethodImplOptions.AggressiveInlining)] 
        
           private static unsafe void Widen8To16AndAndWriteTo(Vector128<byte> narrowVector, char* pDest, nuint destOffset) 
        
           { 
        
               if (Vector256.IsHardwareAccelerated) 
        
               { 
        
                   Vector256<ushort> wide = Vector256.WidenLower(narrowVector.ToVector256Unsafe()); 
        
                   wide.StoreUnsafe(ref *(ushort*)pDest, destOffset); 
        
               } 
        
               else 
        
               { 
        
                   Vector128.WidenLower(narrowVector).StoreUnsafe(ref *(ushort*)pDest, destOffset); 
        
                   Vector128.WidenUpper(narrowVector).StoreUnsafe(ref *(ushort*)pDest, destOffset + 8); 
        
               } 
        
           } 
        
           [MethodImpl(MethodImplOptions.AggressiveInlining)] 
        
           private static unsafe void Narrow16To8AndAndWriteTo(Vector128<ushort> wideVector, byte* pDest, nuint destOffset) 
        
           { 
        
               Vector128<byte> narrow = Vector128.Narrow(wideVector, wideVector); 
        
               if (Sse2.IsSupported) 
        
               { 
        
                   // MOVQ is supported even on x86, unaligned accesses allowed 
        
                   Sse2.StoreScalar((ulong*)(pDest + destOffset), narrow.AsUInt64()); 
        
               } 
        
               else if (Vector64.IsHardwareAccelerated) 
        
               { 
        
                   narrow.GetLower().StoreUnsafe(ref *pDest, destOffset); 
        
               } 
        
               else 
        
               { 
        
                   Unsafe.WriteUnaligned<ulong>(pDest + destOffset, narrow.AsUInt64().ToScalar()); 
        
               } 
        
           }

adamsitnik

Impressive optimization @Daniel-Svensson !

I also found an old unused helper (0) references with the same purpose in that file,

Please just remove the unused helper.

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs

…lity.cs Co-authored-by: Adam Sitnik <[email protected]>

adamsitnik

LGTM, again thank you for your contribution @Daniel-Svensson !

Daniel-Svensson added 2 commits April 2, 2023 20:15

Improve writing of lower vector part in ascii convertion

9f15f72

* from 10 /17 to 1 instruction for 64/32 bit x86

Add [MethodImpl(MethodImplOptions.AggressiveInlining)] to NarrowUtf16…

64fca83

…ToAscii_Intrinsified

dotnet-issue-labeler bot added the area-System.Text.Encoding label Apr 24, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 24, 2023

Daniel-Svensson mentioned this pull request Apr 24, 2023

[Perf] Linux/x64: 13 Regressions on 4/4/2023 11:36:00 PM dotnet/perf-autofiling-issues#15719

Closed

EgorBo reviewed Apr 24, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs Outdated Show resolved Hide resolved

This was referenced Apr 24, 2023

[wasm] interpreter timeouts when WebSocket closes unexpectedly #84101

Closed

Various WASM timeouts on CI #85304

Closed

rewrite StoreLower without Sse2.StoreScalar

c397631

This was referenced May 2, 2023

Test_EventSource_EtwManifestGeneration* tests failing in CI #48798

Closed

Failures in System.Net.Mail.Tests.SmtpClientTest tests #85637

Closed

System.IO.Tests.RandomAccess_NoBuffering.ReadUsingSingleBuffer timing out #85659

Closed

Daniel-Svensson requested a review from EgorBo May 6, 2023 13:22

EgorBo reviewed May 6, 2023

View reviewed changes

update comment

1103134

EgorBo approved these changes May 7, 2023

View reviewed changes

move helper to Vector128 and call in case conversion

70d71d0

BrennanConroy mentioned this pull request May 8, 2023

[API Proposal]: Ascii.ToUtf16 overload that treats \0 as invalid #80366

Closed

adamsitnik reviewed May 11, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs Outdated Show resolved Hide resolved

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs Outdated Show resolved Hide resolved

adamsitnik added this to the 8.0.0 milestone May 11, 2023

adamsitnik added the tenet-performance Performance related issue label May 11, 2023

adamsitnik self-assigned this May 11, 2023

adamsitnik mentioned this pull request May 11, 2023

Ascii benchmarks dotnet/performance#3016

Merged

Daniel-Svensson and others added 2 commits May 11, 2023 14:51

Update src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Uti…

c560cf3

…lity.cs Co-authored-by: Adam Sitnik <[email protected]>

remove unused helpers

8411b06

Daniel-Svensson added 2 commits May 11, 2023 14:55

merge upstream/main

dcf8a70

remove unused methods after merge

3a99d13

build-analysis bot mentioned this pull request May 11, 2023

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

adamsitnik approved these changes May 12, 2023

View reviewed changes

adamsitnik merged commit f1819bd into dotnet:main May 12, 2023

Daniel-Svensson deleted the ascii_speedup branch May 12, 2023 15:17

kunalspathak mentioned this pull request May 16, 2023

[Perf] Linux/x64: 1 Improvement on 5/12/2023 10:43:40 AM dotnet/perf-autofiling-issues#17867

Closed

radekdoulik mentioned this pull request May 17, 2023

[Perf] Linux/x64: 1 Regression on 5/11/2023 9:15:05 PM dotnet/perf-autofiling-issues#17838

Closed

ghost locked as resolved and limited conversation to collaborators Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Ascii (and Utf8) encoding #85266

Improve Ascii (and Utf8) encoding #85266

Daniel-Svensson commented Apr 24, 2023 •

edited

Loading

ghost commented Apr 24, 2023

EgorBo commented Apr 24, 2023 •

edited

Loading

Daniel-Svensson commented May 2, 2023 •

edited

Loading

EgorBo May 6, 2023

jkotas May 7, 2023

EgorBo May 7, 2023 •

edited

Loading

EgorBo May 7, 2023

jkotas May 7, 2023

EgorBo commented May 7, 2023 •

edited

Loading

Daniel-Svensson commented May 8, 2023

adamsitnik left a comment

adamsitnik left a comment

Improve Ascii (and Utf8) encoding #85266

Improve Ascii (and Utf8) encoding #85266

Conversation

Daniel-Svensson commented Apr 24, 2023 • edited Loading

ghost commented Apr 24, 2023

EgorBo commented Apr 24, 2023 • edited Loading

Daniel-Svensson commented May 2, 2023 • edited Loading

EgorBo May 6, 2023

Choose a reason for hiding this comment

jkotas May 7, 2023

Choose a reason for hiding this comment

EgorBo May 7, 2023 • edited Loading

Choose a reason for hiding this comment

EgorBo May 7, 2023

Choose a reason for hiding this comment

jkotas May 7, 2023

Choose a reason for hiding this comment

EgorBo commented May 7, 2023 • edited Loading

Daniel-Svensson commented May 8, 2023

adamsitnik left a comment

Choose a reason for hiding this comment

adamsitnik left a comment

Choose a reason for hiding this comment

Daniel-Svensson commented Apr 24, 2023 •

edited

Loading

EgorBo commented Apr 24, 2023 •

edited

Loading

Daniel-Svensson commented May 2, 2023 •

edited

Loading

EgorBo May 7, 2023 •

edited

Loading

EgorBo commented May 7, 2023 •

edited

Loading