Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HTTP2 HPack huffman decoding #43603

Merged

Conversation

rokonec
Copy link
Member

@rokonec rokonec commented Oct 19, 2020

Optimized to use 8 bits lookup tables tree with result of about 0.35 CPU utilization as oppose to former version.
Decoding table is lazy generated as ushort[].

#1506

After this changes goes through review and receive final approve, I will create corresponding pull request to aspnetcore including these changes and new benchmark similar to existing hpack benchmark

Optimized to use 8 bits lookup tables tree with  result of about 0.35 CPU utilization as oppose to former version.
Decoding table is lazy generated as ushort[].

runtime#1506
@ghost
Copy link

ghost commented Oct 19, 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

- removing var where unclear
- remove outdated comments
- using local variable as oppose to static
@rokonec rokonec requested a review from stephentoub October 22, 2020 15:55
@rokonec rokonec self-assigned this Oct 22, 2020
@rokonec rokonec closed this Oct 26, 2020
@rokonec rokonec reopened this Oct 26, 2020
@rokonec
Copy link
Member Author

rokonec commented Oct 26, 2020

Closed/reopened to pickup fixes from master - worked :-)

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of GenerateDecodingLookupTree, will continue on Monday.

- renaming LUT
- swapping leaf flag meaning (MSB)
- typos
Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. Some little comments and questions. Thank you!

BTW, this should go through ASP.NET folks before merging. Do you have a person of contact or should I invoke someone?

@Tratcher
Copy link
Member

Tratcher commented Nov 9, 2020

BTW, this should go through ASP.NET folks before merging. Do you have a person of contact or should I invoke someone?

That's me 😁.

Decoding table is lazy generated as ushort[].

You're trading disk size for startup time. How long in ms does it take generate the table?

rokonec and others added 6 commits November 9, 2020 23:01
- fix shift count computation
- change acc size to 16 bits and append to it by byte
This reverts commit a09b4d3, partially, all but Huffman.cs
Copy link
Contributor

@scalablecory scalablecory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once outstanding comments are addressed. I'm not an expert in Huffman -- lets please make sure tests are exercising everything.

@geoffkizer
Copy link
Contributor

geoffkizer commented Nov 12, 2020

The "jump table" algorithm is an optimization of the classic, naive algorithm for decoding bit-by-bit using a binary tree.

In the classic algorithm, you start at the root of the tree and process one bit at a time. If the bit is 0, you walk to the left; if it's 1 you walk to the right. When you hit a leaf, you emit the associated character and start over at the root.

Of course you wouldn't literally represent this as a tree in memory. Instead you'd assign each internal node in the tree a number, from 0 (root) to 256. Then you'd have a lookup table that takes (node#, bit) and maps it to (isLeaf, emittedChar, nextNode). Note that emittedChar is only valid when isLeaf is true, and nextNode is always 0 for leaf nodes.

The interesting observation here is that you don't need to walk the tree just one bit at a time. You can build a lookup table that takes multiple bits at a time. The best choice here seems to be 4, as it's easy to get 4 bits at a time from a byte, and it keeps the lookup table from getting too huge.

So, your lookup table is now (node#, 4bits) -> (isCharEmitted, emittedChar, nextNode). Note that isCharEmitted means that a character was emitted in walking any of those 4 bits -- and since all chars need at least 5 bits to encode, we know only one can be emitted per 4 bits. And note that nextNode now is independent of whether a character was emitted or not -- we could have emitted a character after the first bit, but nextNode will give the node that we ended up on after returning to the root and walking the next three bits.

So the inner loop here basically looks like:

  bits = getnext4bits();
  (isCharEmitted, emittedChar, currentNode) = lookuptable[currentNode, bits];
  if (isCharEmitted)
    output[index++] = emittedChar;

There's a little bit more complexity to handle invalid sequences (EOS) and the end of the encoding.

@rokonec
Copy link
Member Author

rokonec commented Nov 12, 2020

I have had carefully read your description and I do not see difference between "jump table" and my algorithm.

Here is what my loop body does:

bits = getnext8bits();
(isCharEmitted, emittedChar, unusedBits, nextNode) = lookuptable[currentNode, bits];
if (isCharEmitted)
  output[index++] = emittedChar;
  keepBits(unusedBits); // not all of 8 bits are used by emitted char, keep them for next code
  currentNode=rootNode;
else
  currentNode=nextNode;

Because we consume 8 bits it could well be that particular node could have enough info to emit two characters. For example if we would have 9-bit-code,5bit-code. I tried to write data structure which would allow multiple chars to be emitted but not only the decoding table was 50% bigger (in bytes size) it was slower in benchmark due to added complexity in supporting multiple emitted chars.
So in such situation we simple keep 7 bits from last lookup/jump, add 1 bit from source, reset to root and start over.

We have chosen 8 bits as it render best performance, about twice as fast as 4 bit, with acceptable size of decoding tree.

My code is much less readable than above pseudocode as I have tried to write it as high perf code with these optimization:

  • 2d array flatten into 1d array
  • (isCharEmitted, emittedChar, unusedBits, nextNode) encoded into ushort
  • final 7- bits process separately to have main hot-loop as simple as possible

These made it about 30% faster than not-optimized version.

@rokonec
Copy link
Member Author

rokonec commented Nov 16, 2020

Can we get the benchmark code added to the PR as well? This is useful for measuring future improvements.

@geoffkizer After this PR gets approved, I will create corresponding PR to aspnetcore, as described here also creating new benchmark close to existing hpack benchmark.

Expected Benchmark will looks like: https://gist.github.com/rokonec/4f63d464fd9a359add75056789e0f9e1

@geoffkizer
Copy link
Contributor

I will create corresponding PR to aspnetcore

It would be good to have the benchmark code in this repo too. Let's figure out how to get it in here.

@geoffkizer
Copy link
Contributor

@Tratcher can you review this as well?

@Tratcher Tratcher requested a review from halter73 November 19, 2020 19:47
Copy link
Member

@Tratcher Tratcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know huffman, but I don't see anything alarming and I'm content with the test coverage we have here.

I've added @halter73 in case he wants to take a look.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few technicalities around the benchmark project, otherwise LGTM.
Thanks for this perf improvement!

@ManickaP
Copy link
Member

And please also update .gitignore because I'm seeing System.Net.Http/tests/PerformanceTests/HPackHuffmanBenchmark/BenchmarkDotNet.Artifacts in un-commited changes.

@ManickaP ManickaP merged commit 7171407 into dotnet:master Nov 27, 2020
@rokonec rokonec deleted the rokonec/1506-optimize-huffman-decoding-using-lut branch November 27, 2020 14:23
@ghost ghost locked as resolved and limited conversation to collaborators Dec 27, 2020
@karelz karelz added this to the 6.0.0 milestone Jan 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants