Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Use Span overloads with Rfc2898DeriveBytes computations #23269

Merged
merged 1 commit into from
Aug 15, 2017

Conversation

bartonjs
Copy link
Member

@bartonjs bartonjs commented Aug 15, 2017

Change from ComputeHash(byte[])=>byte[] to TryComputeHash(src, dest) to
reduce the number of allocations involved.

For iteration counts of 1000, 10000, and 100000 it shows a 15% reduction in time,
and almost entire elimination of GC (most of that 15%).

Fixes https://github.com/dotnet/corefx/issues/16925.

Change from ComputeHash(byte[])=>byte[] to TryComputeHash(src, dest) to
reduce the number of allocations involved.

For iteration counts of 1000, 10000, and 100000 it shows a 15% reduction in time,
and almost entire elimination of GC (most of that 15%).
{
temp = _hmac.ComputeHash(temp);
Span<byte> uiSpan = new Span<byte>(ui, 0, _blockSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm confused. Why are we getting an array from the pool, hashing into it, then allocating a return array, and copying the results into that... why not just allocate the return array initially and do the initial hash into that? i.e. why bother with the pool at all?

If we're going to use the pool, seems like we'd want to use it for temp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pooled the U_i, and allocated the return value.

F (P, S, c, i) = U_1 \xor U_2 \xor ... \xor U_c

where

U_1 = PRF (P, S || INT (i)) ,
U_2 = PRF (P, U_1) ,
...
U_c = PRF (P, U_{c-1}) .

@bartonjs
Copy link
Member Author

bartonjs commented Aug 15, 2017

The test rig:

[Theory]
[InlineData("potato", 1000, 100, 32)]
[InlineData("potato", 10000, 100, 32)]
[InlineData("potato", 100000, 100, 32)]
[InlineData("potato", 1000, 100, 16)]
[InlineData("potato", 10000, 100, 16)]
[InlineData("potato", 100000, 100, 16)]
[InlineData("potato", 1000, 100, 17)]
[InlineData("potato", 10000, 100, 17)]
[InlineData("potato", 100000, 100, 17)]
[InlineData("potato", 1000, 100, 31)]
[InlineData("potato", 10000, 100, 31)]
[InlineData("potato", 100000, 100, 31)]
public static void ProfilerRun(string password, int iterationCount, int runCount, int byteLen)
{
    using (Rfc2898DeriveBytes pbkdf2 = new Rfc2898DeriveBytes(password, new byte[8], iterationCount))
    {
        var stopwatch = System.Diagnostics.Stopwatch.StartNew();

        for (int i = 0; i < runCount; i++)
        {
            pbkdf2.GetBytes(byteLen);
        }

        stopwatch.Stop();
        long millis = stopwatch.ElapsedMilliseconds;

        Console.WriteLine($"{iterationCount}, {runCount}, {byteLen} => {millis} ({(millis + runCount - 1) / runCount})");
    }
}

Before:

1000, 100, 32 => 110 (2)
10000, 100, 32 => 1098 (11)
100000, 100, 32 => 10701 (108)
1000, 100, 16 => 54 (1)
10000, 100, 16 => 528 (6)
100000, 100, 16 => 5243 (53)
1000, 100, 17 => 56 (1)
10000, 100, 17 => 557 (6)
100000, 100, 17 => 5553 (56)
1000, 100, 31 => 102 (2)
10000, 100, 31 => 1016 (11)
100000, 100, 31 => 10163 (102)

Total execution time: 35.319s. 1221 total GC events, happening about every 0.03s.

After:

1000, 100, 32 => 93 (1)
10000, 100, 32 => 926 (10)
100000, 100, 32 => 8917 (90)
1000, 100, 16 => 44 (1)
10000, 100, 16 => 441 (5)
100000, 100, 16 => 4406 (45)
1000, 100, 17 => 47 (1)
10000, 100, 17 => 468 (5)
100000, 100, 17 => 4739 (48)
1000, 100, 31 => 84 (1)
10000, 100, 31 => 845 (9)
100000, 100, 31 => 8530 (86)

Total execution time: 29.680s. 2 total GC events, both right at startup, and probably before this code started

Excel says my scenario by scenario time scale was


0.845455
0.843352
0.833287
0.814815
0.835227
0.840359
0.839286
0.840215
0.853413
0.823529
0.831693
0.839319

With a total scale of 0.840341. So the real answer is closer to 16%, but 15% is a "nice, round" number.

.


for (int j = 0; j < _blockSize; j++)
{
ret[j] ^= ui[j];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this answers my question above... we need to xor each time with the initial hashed data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, looks like this should be pretty straightforward to vectorize, though I don't know how big _blockSize generally is or whether this is a bottleneck compared to the hashing.

@@ -249,30 +250,45 @@ private void Initialize()
}

// This function is defined as follows:
// Func (S, i) = HMAC(S || i) | HMAC2(S || i) | ... | HMAC(iterations) (S || i)
// Func (S, i) = HMAC(S || i) ^ HMAC2(S || i) ^ ... ^ HMAC(iterations) (S || i)
// where i is the block number.
private byte[] Func()
{
byte[] temp = new byte[_salt.Length + sizeof(uint)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is _salt, or is it arbitrarily large? Wondering if temp should be a stackalloc. If not, array pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's user-provided and arbitrarily large. I thought about removing this one, too, but the average number of calls per object lifetime is ~2. So I went with sanitization's 5-nines instead of sterilizations 9-nines.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement in throughput and memory :)


for (int j = 0; j < _blockSize; j++)
if (!_hmac.TryComputeHash(temp, uiSpan, out int bytesWritten) || bytesWritten != _blockSize)
throw new CryptographicException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that we're throwing when we didn't previously. In what situations could we hit this? Should this just be an assert instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be unreachable (a CryptographicException should have already been thrown); but I'd rather be runtime defensive here than possibly have an edge case where we give back a non-compatible answer.

@stephentoub stephentoub merged commit b9010ec into dotnet:master Aug 15, 2017
@bartonjs bartonjs deleted the spanify_pbkdf2 branch August 16, 2017 19:16
@bartonjs bartonjs removed their assignment Aug 16, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…x#23269)

Change from ComputeHash(byte[])=>byte[] to TryComputeHash(src, dest) to
reduce the number of allocations involved.

For iteration counts of 1000, 10000, and 100000 it shows a 15% reduction in time,
and almost entire elimination of GC (most of that 15%).

Commit migrated from dotnet/corefx@b9010ec
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants