Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable avx512 support for base64 encoding. Reuse WojciechMula/base64-… #102

Closed
wants to merge 4 commits into from

Conversation

lucshi
Copy link

@lucshi lucshi commented Aug 18, 2022

Add AVX512 support for base64 encoding based on WojciechMula's code.

aklomp code base has been borrowed by Node.js upstream and I would like to enable avx512 support which can improve base64 encoding performance by additional +12% for node.js base64 on. The measurement of aklomp benchmark result also showed performance improvement for <100KB data, compared with AVX2. Improvements are around 36%-86%.

Testing with buffer size 100 KB, fastest of 10 * 100
AVX2    encode  17238.22 MB/sec
AVX2    decode  16882.58 MB/sec
AVX512  encode  23613.74 MB/sec
AVX512  decode  16260.03 MB/sec
Testing with buffer size 10 KB, fastest of 100 * 100
AVX2    encode  16852.97 MB/sec
AVX2    decode  16605.95 MB/sec
AVX512  encode  31398.70 MB/sec
AVX512  decode  16316.29 MB/sec
Testing with buffer size 1 KB, fastest of 100 * 1000
AVX2    encode  11344.30 MB/sec
AVX2    decode  10261.57 MB/sec
AVX512  encode  19500.83 MB/sec
AVX512  decode  10841.66 MB/sec

Above results were collected on Icelake cpu which is widely used in Amazon AWS EC2 cloud instances.

@htot
Copy link
Contributor

htot commented Aug 18, 2022

Ordinarily, you would not benefit from the memory cache. That means that depending on your L3 cache size 10MB or 100MB would be the buffer size you need for the benchmark.

@aklomp
Copy link
Owner

aklomp commented Aug 18, 2022

Thanks for the pull request. I'll try to comment on it in the next few days. I like the idea of adding an AVX512 codec if @WojciechMula's code is under a compatible license. But there are a few things that I don't like in this pull request, such as copypasting code from Chromium and not updating the user documentation. I'll try to respond more substantially in the coming days.

@lucshi
Copy link
Author

lucshi commented Aug 19, 2022

Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks!

@htot
Copy link
Contributor

htot commented Aug 20, 2022

Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks!

The benchmark repeatedly encodes the same array of data. If the array fits in your (L3) cache you measure very high speeds that you would normally not see when encoding a single image once. I think most of our benchmarks were done with 10MB, however I now have a system with 16MB cache.

@lucshi
Copy link
Author

lucshi commented Aug 21, 2022

Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks!

The benchmark repeatedly encodes the same array of data. If the array fits in your (L3) cache you measure very high speeds that you would normally not see when encoding a single image once. I think most of our benchmarks were done with 10MB, however I now have a system with 16MB cache.

L3 cache of each of CPU socket is 48MB (2 sockets are 96MB in total). Are you suggesting I should enlarge the benchmark data size bigger than 48MB/96MB, or just measure the first round performance to avoid cache benefit? I did not get your point why it would be a problem when benchmark data size is smaller than cache size. I thought L3 cache cannot make two algorithem with the same performance if the algorithms have performance gaps in nature.

@htot
Copy link
Contributor

htot commented Aug 21, 2022

That's a lot of cache :-) I was lazy and just added a benchmark with 100MB. I'm note sure how the split L3 cache works. But in principle I would say that caching code but not the data is fair. Caching data can be useful to compare the speed of the algorithms excluding time to access memory.

@lucshi
Copy link
Author

lucshi commented Sep 29, 2022

Hi @aklomp , I updated my patch and accepted your review comments of "adding readme and removing chromium code". Could you please have a review? Thanks! After it is landed it will be merged back into Node.js repo.

@lucshi
Copy link
Author

lucshi commented Oct 9, 2022

Hi @aklomp , I'm wondering if you have bandwidth to review this PR. Thanks!

@aklomp
Copy link
Owner

aklomp commented Oct 9, 2022

@lucshi Sorry for the delay, I'm still here! Your pull request is near the top of my mental to-do list of things to process (along with #105), but I've been really busy the past weeks and haven't found a moment to deeply engage with it. I do plan on giving it a closer look soon. If you feel that I'm taking too long, don't hesitate to reach out.

@lucshi
Copy link
Author

lucshi commented Oct 9, 2022

@lucshi Sorry for the delay, I'm still here! Your pull request is near the top of my mental to-do list of things to process (along with #105), but I've been really busy the past weeks and haven't found a moment to deeply engage with it. I do plan on giving it a closer look soon. If you feel that I'm taking too long, don't hesitate to reach out.

Thank you @aklomp for the response. I would like to make it landed if possible very soon to catch up with the Node.js upstream optimization. Do you require me to write some notes about the PR to make the review faster?

}

static inline void
dec_loop_avx2 (const uint8_t **s, size_t *slen, uint8_t **o, size_t *olen)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? Should be dec_loop_avx512

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This PR is only for the encoding part for AVX512 because Node.js only depends on base64 SIMD encoding. In general Base64 decoding cannot be vectorized when there are space chars in input. To not break your project in general, I reuse the AVX2 for decoding part in my PR.

@aklomp
Copy link
Owner

aklomp commented Oct 9, 2022

@lucshi The main question I had was whether WojciechMula's code was under a compatible licence, but apparently it is (3-clause BSD).

Before merging, I also want to do a thorough code review to ensure that the code conforms to the library's general structure. It's this code review that I'm blocked on, but I'll try to put it on the fast path. I do feel bad for making you wait this long for a proper PR review. I really do appreciate the effort that you put in this, and it's definitely a very nice enhancement.

@lucshi
Copy link
Author

lucshi commented Oct 9, 2022

@lucshi The main question I had was whether WojciechMula's code was under a compatible licence, but apparently it is (3-clause BSD).

Before merging, I also want to do a thorough code review to ensure that the code conforms to the library's general structure. It's this code review that I'm blocked on, but I'll try to put it on the fast path. I do feel bad for making you wait this long for a proper PR review. I really do appreciate the effort that you put in this, and it's definitely a very nice enhancement.

I'm totally OK for a careful code review. Per my understanding, BSD-3 is a most popular license nowadays and the code can be used freely in BSD-2 project as long as including the copyright text and the authors and projects names not be used in market promotion.

I'll also start the IP scan for the code base along with my PR by Protex IP Scanning tool to see if any other IP issues. Will update the results soon.

@lucshi
Copy link
Author

lucshi commented Oct 9, 2022

There may be other IP related revise, but the IP scanning tool Protex did not found IP conflictions.

@aklomp
Copy link
Owner

aklomp commented Oct 11, 2022

Hey, I've informally reviewed the updated PR and I think it looks good on the whole. It's something I think can be merged, but it needs some work to clean it up. I'm volunteering to do that; more on that below. The main sticking points regarding the code I have are:

  • The commit history is not nice, clean, composable and atomic. The first commit makes all kinds of changes, the next commit undoes a whole bunch of stuff. The commits are also either way too large or way too small. IMO, a patchset should tell a story and introduce (or remove) bits of functionality in a way that's atomic. A good split here would be "add codec", then "add cmake and makefile support for avx512", then "update the test suite", then "update the readme", or something along those lines.
  • There are a lot of code style errors: trailing whitespace, indents with spaces instead of tabs, overlong lines, overlong commit messages, inconsistent code formatting, and so on.
  • There's code duplication: the AVX2 decoder is copypasted. I understand the reason why, but it should be changed before merging. The "nicer" way of doing the same thing is to #include "../avx2/dec_loop.c" in lib/arch/avx512/codec.c.
  • There are some minor things missing, such as adding Daniel Lemire and Wojciech Muła to the copyright section of the LICENSE file.

I'm giving you this short list instead of a full review because I don't want to come across as mean by pointing out boring details like trailing whitespace in overlong detail. I'm saving us the effort. But if you want though, I can certainly do a full review.

For the way forward, what I had in mind was that I would take your PR and rework it to fix the issues above, retaining you as the author, and then come back to you for permission to merge that new branch. I think it's faster and less frustrating (and less work for you!) to do it that way. Of course, if you prefer, we can also do it classically and I'll be the reviewer/oracle while you push corrections. What do you say?

Then there are two ephemeral issues that should also be mentioned:

  • Muła's code is under BSD3, my code is under BSD2. Are they compatible? I'm not a license expert, but BSD3 is less permissive than BSD2, so I would guess that they might not be compatible. Any thoughts on this?
  • CI is currently failing in a number of places, but that's due to the issue mentioned in benchmarks on Atom crashes on AVX2 #77. It's not something that I can fix now without rearchitecting the library, so I suggest we leave it as-is.

(Sorry for accidentally closing this issue, I clicked the wrong button!)

@aklomp
Copy link
Owner

aklomp commented Oct 11, 2022

I pushed a branch called avx512 which squashes your work on top of master and makes some cleanups/fixes. I didn't break this up into smaller atomic commits yet, but it gives you an idea of the direction I'm moving in.

I also set up Intel SDE locally to test the AVX512 code.. very good for building confidence in the code! 😅

@lucshi
Copy link
Author

lucshi commented Oct 12, 2022

Hi @aklomp , Thank you for the quick review comments. I'm OK to go with the fastest way that you do the fixes and made me one of the authors.

When everything is done, I will submit a PR on node.js and merge updated code.

@lucshi
Copy link
Author

lucshi commented Oct 12, 2022

For the license issue. I have consulted the license expert in my company for the process of adding a BSD3 source file avx512.c into a BSD2 project.

The answer I got as below:
image

image

My understanding is that the bottom line is that you make sure the bsd-3 license text is in the avx512.c file.

@aklomp
Copy link
Owner

aklomp commented Oct 17, 2022

Again sorry for the delay. I intend to refactor the avx512 branch soon and merge it, so that we can both move on :)

Thanks for checking about the license. I agree with the proposed solution of putting the BSD-3 license text in the file with the imported code; that should indeed satisfy the license requirements.

I also created an issue, #106, to discuss moving this project from BSD-2 to BSD-3. It's a very small change that I think is for the better, and it will allow this project to import code from more locations. The one remaining question is if it would impact downstream users of this library, but I don't really think so because this library is generally included as a source bundle complete with license file.

@aklomp
Copy link
Owner

aklomp commented Oct 18, 2022

Regarding the license issue, I noticed that @WojciechMula has also published his AVX512 encoder in his base64simd project, specificially here. That library is under the BSD-2 license. That simplifies things, all we need to do is update the copyright year in Muła's mention in the LICENSE.

@lucshi
Copy link
Author

lucshi commented Oct 19, 2022

Regarding the license issue, I noticed that @WojciechMula has also published his AVX512 encoder in his base64simd project, specificially here. That library is under the BSD-2 license. That simplifies things, all we need to do is update the copyright year in Muła's mention in the LICENSE.

Great!

@aklomp
Copy link
Owner

aklomp commented Oct 19, 2022

Alright, I pushed an avx512 branch that contains four commits, crediting you as the author, that implement the AVX512 encoder. The branch contains a number of relatively minor fixes I made to your PR; you can review them with a git diff.

Since your name is on these commits, I have to ask you: are you OK if I merge this?

@lucshi
Copy link
Author

lucshi commented Oct 20, 2022

I'm OK to merge. Thank you!

@aklomp aklomp closed this in 6b1a8b8 Oct 20, 2022
@aklomp aklomp mentioned this pull request Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants