Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ChaCha and Salsa intrinsics. #128

Open
wants to merge 65 commits into
base: master
Choose a base branch
from

Conversation

TimothyMakkison
Copy link

@TimothyMakkison TimothyMakkison commented Oct 11, 2022

Added @macaba intrinsics implementation #58 for ChaCha20. Salsa20 has since been added to NaCl, so I created Salsa20BaseIntrinsics based on macaba's work. At minimum both got a 4x speed up.

  • Added ChaCha20 & Salsa20.
  • Added benchmarks for Salsa20 and XSalsa20.
  • Added checks for intrinsics support and failovers.
  • Added VariableLengthCiphers tests, to ensure that the intrinsics methods work when using length specific methods.

This pr will require a way of testing with and without intrinsics, I'm not sure how to do this but I think COMPlus_EnableAVX is related?

A future update could speed up ProcessKeyStreamblock by using the ProcessStream method.

private static byte[] _zeros = new byte[BLOCK_SIZE_IN_BYTES];

public override void ProcessKeyStreamBlock(ReadOnlySpan<byte> nonce, int counter, Span<byte> block)
{

    if (block.Length != BLOCK_SIZE_IN_BYTES)
        throw new CryptographicException($"The key stream block length is not valid. The length in bytes must be {BLOCK_SIZE_IN_BYTES}.");
#if INTRINSICS
    if (Sse2.IsSupported)
    {
        ProcessStream(nonce, block, _zeros, counter);
        return;
    }
...

@codecov-commenter
Copy link

codecov-commenter commented Oct 12, 2022

Codecov Report

Merging #128 (d7d04d4) into master (ed3bf52) will decrease coverage by 10.75%.
The diff coverage is 89.41%.

@@             Coverage Diff             @@
##           master     #128       +/-   ##
===========================================
- Coverage   99.78%   89.02%   -10.76%     
===========================================
  Files          14       26       +12     
  Lines         455     1421      +966     
  Branches       57       97       +40     
===========================================
+ Hits          454     1265      +811     
- Misses          1      146      +145     
- Partials        0       10       +10     
Impacted Files Coverage Δ
src/NaCl.Core/Base/ChaChaCore/ChaCha20Core.cs 0.00% <0.00%> (ø)
src/NaCl.Core/Base/SalsaCore/Salsa20Core.cs 0.00% <0.00%> (ø)
src/NaCl.Core/Base/Snuffle.cs 57.14% <ø> (-40.00%) ⬇️
src/NaCl.Core/ChaCha20.cs 100.00% <ø> (ø)
src/NaCl.Core/Salsa20.cs 100.00% <ø> (ø)
src/NaCl.Core/XChaCha20.cs 100.00% <ø> (ø)
src/NaCl.Core/XSalsa20.cs 100.00% <ø> (ø)
src/NaCl.Core/Base/ChaCha20Base.cs 73.46% <62.50%> (-26.54%) ⬇️
src/NaCl.Core/Base/Salsa20Base.cs 57.77% <62.50%> (-42.23%) ⬇️
src/NaCl.Core/Base/ChaCha20BaseIntrinsics.cs 77.77% <77.77%> (ø)
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@daviddesmet
Copy link
Owner

Thanks for the contribution and for adding Salsa20 intrinsics too!
I wonder if we can reduce the method complexity of those intrinsics ones by breaking them apart. Afaik, the initial code was grabbed from another repository which I currently don't recall.

There's also XSalsa20 over there but I haven't promoted those to stable since I felt it needed more tests. The target was to also support XSalsa20Poly1305.

Regarding testing of intrinsics turned on and off, yes, it was something around COMPlus_EnableAVX, I'm not sure since I looked at this a while ago. For the CI, we might need to add tests for running with and without intrinsics I guess.

Btw, do you have some benchmark numbers you can share to see the perf increase?

I might need to take a look at why the Azure CI failed with Windows only.

@TimothyMakkison
Copy link
Author

I wonder if we can reduce the method complexity of those intrinsics ones by breaking them apart.

I've split Intrinsics up by chunk size and altered ChaCha64 and Salsa64 to add HChaCha/HSalsa support. Aside from that I haven't changed the other methods. I guess I could extract some of the Xor/Store/Unpack methods and give them nicer names. I'll add ProcessKeyStreamIntrinsics, the changes to the 64 chunk methods should make it painless.

There's also XSalsa20 over there but I haven't promoted those to stable since I felt it needed more tests. The target was to also support XSalsa20Poly1305.

Regarding testing of intrinsics turned on and off, yes, it was something around COMPlus_EnableAVX, I'm not sure since I looked at this a while ago. For the CI, we might need to add tests for running with and without intrinsics I guess.

I'll take a look, I've never used env variables in unit tests before, hopefully Google will help.

Btw, do you have some benchmark numbers you can share to see the perf increase?

My numbers are from some benchmarks I did yesterday. Sorry about the quality, my laptops about to die 😅

Salsa

image

ChaCha

image

@daviddesmet
Copy link
Owner

I've split Intrinsics up by chunk size and altered ChaCha64 and Salsa64 to add HChaCha/HSalsa support. Aside from that I haven't changed the other methods. I guess I could extract some of the Xor/Store/Unpack methods and give them nicer names. I'll add ProcessKeyStreamIntrinsics, the changes to the 64 chunk methods should make it painless.

Sounds good, thanks a lot for the effort you are putting into this.

I'll take a look, I've never used env variables in unit tests before, hopefully Google will help.

If you run the benchmarks from Powershell or cmd via dotnet run, you can set the variable using set VARIABLE=VALUE before the dotnet command.

My numbers are from some benchmarks I did yesterday. Sorry about the quality, my laptops about to die 😅

Salsa

image

ChaCha

image

Thanks for sharing those, the numbers look nice :)

src/NaCl.Core/Base/ChaCha20Base.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/ChaCha20Base.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/ChaCha20BaseIntrinsics.cs Show resolved Hide resolved
src/NaCl.Core/Base/ChaChaCore/ChaCha20CoreIntrinsics.cs Outdated Show resolved Hide resolved
test/NaCl.Core.SimdTests/ChaCha20IntrinsicsTests .cs Outdated Show resolved Hide resolved
test/NaCl.Core.SimdTests/ChaCha20IntrinsicsTests .cs Outdated Show resolved Hide resolved
test/NaCl.Core.SimdTests/ChaCha20ScalarTests.cs Outdated Show resolved Hide resolved
test/NaCl.Core.SimdTests/Salsa20IntrinsicsTests.cs Outdated Show resolved Hide resolved
test/NaCl.Core.SimdTests/Salsa20ScalarTests.cs Outdated Show resolved Hide resolved
@daviddesmet
Copy link
Owner

@TimothyMakkison my apologies for the late follow-up, it has been hectic over here.

Thanks a ton for pulling this big task off, I added a few comments to the PR; most of them are minor stuff.

I also noticed a PowerShell script but I didn't see it being called from DevOps, is it something you intended to do? The build defined in this project uses Cake, we could set the environment variable in there afaik; I will need to investigate further into that. Perhaps we can use that and just define tests that should be running with specific environment variables without the need to update the GitHub actions.

Last but not least, I wonder if you could work on the coverage a little bit more since we dropped quite a lot with this PR, you can check codecov from above.

src/NaCl.Core/Base/ChaCha20BaseIntrinsics.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/ChaCha20BaseIntrinsics.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/ChaCha20BaseIntrinsics.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/Salsa20BaseIntrinsics.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/Salsa20BaseIntrinsics.cs Outdated Show resolved Hide resolved
src/NaCl.Core/Base/Salsa20BaseIntrinsics.cs Outdated Show resolved Hide resolved
@TimothyMakkison
Copy link
Author

Thanks for the review. I got sidetracked and forgot about this PR so it's unfinished.

I wonder if we do need the preprocessor directive since we are checking if the hardware supports it

System.Runtime.Intrinsics.X86 is only supported for .NET 3.0+. Intrinsics code won't compile for libraries that multi target versions unless we use a preprocessor.

Could you make this access modifier implicit?

Not sure what you mean by implicit, do you mean private?

We could switch to scoped namespace declaration here since we already using it on the other files =)

Should I convert all the files in the solution to scoped namespaces?

I also noticed a PowerShell script but I didn't see it being called from DevOps, is it something you intended to do? The build defined in this project uses Cake, we could set the environment variable in there afaik;

Sorry I was using the benchmark code and powershell script for testing and forgot to remove them. I'll modify the cake file.

Do you have an opinion on the refactor?

The logic up to 202ff15 was fairly simple; the intrinsics code was added with pre processing and an IsSupported check. This has the advantage of not touching the api, keeping the code/code flow the same and adding minimal additional code and files.

Commit 4d68791 separates the intrinsic and non intrinsic code from SalsaBase/ChaChaBase creating individual files for them, SalsaBase/ChaChaBase act as a facade calling the internal IChaChaCore interface for all methods. This has the disadvantage of adding a lot of code, altering a lot of code to keep the api the same, duplicating code and arguably being overly complex.

I'll probably undo the refactor unless you prefer it, I thinks it's too complicated, harder to maintain, and isn't needed if the CI will test SIMD and non SIMD versions. Should also improve the code coverage and reduce duplication.

@daviddesmet
Copy link
Owner

Thanks for the review. I got sidetracked and forgot about this PR so it's unfinished.

No worries, I got a pile of work and also lost track about your PRs, sorry about that!

I wonder if we do need the preprocessor directive since we are checking if the hardware supports it

System.Runtime.Intrinsics.X86 is only supported for .NET 3.0+. Intrinsics code won't compile for libraries that multi target versions unless we use a preprocessor.

I think we can drop .NET 3 and just stick with the supported versions going forward. Supporting obsolete versions of .NET just adds a lot of maintainability headaches.

Could you make this access modifier implicit?

Not sure what you mean by implicit, do you mean private?

Yes, I mean adding to those the private modifier.

We could switch to scoped namespace declaration here since we already using it on the other files =)

Should I convert all the files in the solution to scoped namespaces?

I would opt to have consistency in all files whenever we can.

I also noticed a PowerShell script but I didn't see it being called from DevOps, is it something you intended to do? The build defined in this project uses Cake, we could set the environment variable in there afaik;

Sorry I was using the benchmark code and powershell script for testing and forgot to remove them. I'll modify the cake file.

No worries, it was just a question since I didn't know the purpose of it, you could leave that one if you want.

Do you have an opinion on the refactor?

The logic up to 202ff15 was fairly simple; the intrinsics code was added with pre processing and an IsSupported check. This has the advantage of not touching the api, keeping the code/code flow the same and adding minimal additional code and files.

Commit 4d68791 separates the intrinsic and non intrinsic code from SalsaBase/ChaChaBase creating individual files for them, SalsaBase/ChaChaBase act as a facade calling the internal IChaChaCore interface for all methods. This has the disadvantage of adding a lot of code, altering a lot of code to keep the api the same, duplicating code and arguably being overly complex.

I'll probably undo the refactor unless you prefer it, I thinks it's too complicated, harder to maintain, and isn't needed if the CI will test SIMD and non SIMD versions. Should also improve the code coverage and reduce duplication.

Yeah, I agree with that. It added a lot of maintainability complexity, if we limit the scope of the supported .NET versions we can stick with the first logic you proposed.

@TimothyMakkison
Copy link
Author

TimothyMakkison commented Nov 9, 2022

Yeah, I agree with that. It added a lot of maintainability complexity, if we limit the scope of the supported .NET versions we can stick with the first logic you proposed.
I think we can drop .NET 3 and just stick with the supported versions going forward. Supporting obsolete versions of .NET just adds a lot of maintainability headaches.

The "refactor" was added to make the code testable, with the CI and cake file this thankfully isn't needed. This pr is compatible with all the different versoins of .NET it just requires a couple of pre processor checks.

  • Updated the cake file, NET 7 RC 2 has broken my sdk so I can't install the cake tool and test it. Hopefully it works 🤞
  • Reverted the changes to Salsa20Base and ChaCha20Base.
  • Deleted intrinsic/scalar tests.
  • Optimized the shuffle function for Salsa64, this used a different instruction set so I updated the guard checks.
  • To reduce the scope of this PR I haven't updated all the namespaces. I'm tempted to create a code cleanup pr changing the namespaces, exception parameters, empty arrays and unused fields etc.

Do you think it would be worth getting someone who knows intrinsics to double check this? This was my first time using intrinisics/unsafe code in this way.

@daviddesmet
Copy link
Owner

  • To reduce the scope of this PR I haven't updated all the namespaces. I'm tempted to create a code cleanup pr changing the namespaces, exception parameters, empty arrays and unused fields etc.

Yeah, sounds great!

Do you think it would be worth getting someone who knows intrinsics to double check this? This was my first time using intrinisics/unsafe code in this way.

That would be nice but I really don't know anyone from my side. :/

@TimothyMakkison
Copy link
Author

I had the cunning idea of submitting a pr to the bouncycastle project. Give them a performance bump for large cipherstreams and have another set of eyes look at the implementations. Once I get a grasp of IDigest I'll make a pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants