-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Use Span overloads with Rfc2898DeriveBytes computations #23269
Conversation
Change from ComputeHash(byte[])=>byte[] to TryComputeHash(src, dest) to reduce the number of allocations involved. For iteration counts of 1000, 10000, and 100000 it shows a 15% reduction in time, and almost entire elimination of GC (most of that 15%).
{ | ||
temp = _hmac.ComputeHash(temp); | ||
Span<byte> uiSpan = new Span<byte>(ui, 0, _blockSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm confused. Why are we getting an array from the pool, hashing into it, then allocating a return array, and copying the results into that... why not just allocate the return array initially and do the initial hash into that? i.e. why bother with the pool at all?
If we're going to use the pool, seems like we'd want to use it for temp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pooled the U_i, and allocated the return value.
F (P, S, c, i) = U_1 \xor U_2 \xor ... \xor U_c
where
U_1 = PRF (P, S || INT (i)) ,
U_2 = PRF (P, U_1) ,
...
U_c = PRF (P, U_{c-1}) .
The test rig: [Theory]
[InlineData("potato", 1000, 100, 32)]
[InlineData("potato", 10000, 100, 32)]
[InlineData("potato", 100000, 100, 32)]
[InlineData("potato", 1000, 100, 16)]
[InlineData("potato", 10000, 100, 16)]
[InlineData("potato", 100000, 100, 16)]
[InlineData("potato", 1000, 100, 17)]
[InlineData("potato", 10000, 100, 17)]
[InlineData("potato", 100000, 100, 17)]
[InlineData("potato", 1000, 100, 31)]
[InlineData("potato", 10000, 100, 31)]
[InlineData("potato", 100000, 100, 31)]
public static void ProfilerRun(string password, int iterationCount, int runCount, int byteLen)
{
using (Rfc2898DeriveBytes pbkdf2 = new Rfc2898DeriveBytes(password, new byte[8], iterationCount))
{
var stopwatch = System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < runCount; i++)
{
pbkdf2.GetBytes(byteLen);
}
stopwatch.Stop();
long millis = stopwatch.ElapsedMilliseconds;
Console.WriteLine($"{iterationCount}, {runCount}, {byteLen} => {millis} ({(millis + runCount - 1) / runCount})");
}
} Before:
Total execution time: 35.319s. 1221 total GC events, happening about every 0.03s. After:
Total execution time: 29.680s. 2 total GC events, both right at startup, and probably before this code started Excel says my scenario by scenario time scale was
With a total scale of . |
|
||
for (int j = 0; j < _blockSize; j++) | ||
{ | ||
ret[j] ^= ui[j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this answers my question above... we need to xor each time with the initial hashed data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, looks like this should be pretty straightforward to vectorize, though I don't know how big _blockSize generally is or whether this is a bottleneck compared to the hashing.
@@ -249,30 +250,45 @@ private void Initialize() | |||
} | |||
|
|||
// This function is defined as follows: | |||
// Func (S, i) = HMAC(S || i) | HMAC2(S || i) | ... | HMAC(iterations) (S || i) | |||
// Func (S, i) = HMAC(S || i) ^ HMAC2(S || i) ^ ... ^ HMAC(iterations) (S || i) | |||
// where i is the block number. | |||
private byte[] Func() | |||
{ | |||
byte[] temp = new byte[_salt.Length + sizeof(uint)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is _salt, or is it arbitrarily large? Wondering if temp should be a stackalloc. If not, array pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's user-provided and arbitrarily large. I thought about removing this one, too, but the average number of calls per object lifetime is ~2. So I went with sanitization's 5-nines instead of sterilizations 9-nines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement in throughput and memory :)
|
||
for (int j = 0; j < _blockSize; j++) | ||
if (!_hmac.TryComputeHash(temp, uiSpan, out int bytesWritten) || bytesWritten != _blockSize) | ||
throw new CryptographicException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that we're throwing when we didn't previously. In what situations could we hit this? Should this just be an assert instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be unreachable (a CryptographicException should have already been thrown); but I'd rather be runtime defensive here than possibly have an edge case where we give back a non-compatible answer.
…x#23269) Change from ComputeHash(byte[])=>byte[] to TryComputeHash(src, dest) to reduce the number of allocations involved. For iteration counts of 1000, 10000, and 100000 it shows a 15% reduction in time, and almost entire elimination of GC (most of that 15%). Commit migrated from dotnet/corefx@b9010ec
Change from ComputeHash(byte[])=>byte[] to TryComputeHash(src, dest) to
reduce the number of allocations involved.
For iteration counts of 1000, 10000, and 100000 it shows a 15% reduction in time,
and almost entire elimination of GC (most of that 15%).
Fixes https://github.com/dotnet/corefx/issues/16925.