-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable avx512 support for base64 encoding. Reuse WojciechMula/base64-… #102
Conversation
Ordinarily, you would not benefit from the memory cache. That means that depending on your L3 cache size 10MB or 100MB would be the buffer size you need for the benchmark. |
Thanks for the pull request. I'll try to comment on it in the next few days. I like the idea of adding an AVX512 codec if @WojciechMula's code is under a compatible license. But there are a few things that I don't like in this pull request, such as copypasting code from Chromium and not updating the user documentation. I'll try to respond more substantially in the coming days. |
Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks! |
The benchmark repeatedly encodes the same array of data. If the array fits in your (L3) cache you measure very high speeds that you would normally not see when encoding a single image once. I think most of our benchmarks were done with 10MB, however I now have a system with 16MB cache. |
L3 cache of each of CPU socket is 48MB (2 sockets are 96MB in total). Are you suggesting I should enlarge the benchmark data size bigger than 48MB/96MB, or just measure the first round performance to avoid cache benefit? I did not get your point why it would be a problem when benchmark data size is smaller than cache size. I thought L3 cache cannot make two algorithem with the same performance if the algorithms have performance gaps in nature. |
That's a lot of cache :-) I was lazy and just added a benchmark with 100MB. I'm note sure how the split L3 cache works. But in principle I would say that caching code but not the data is fair. Caching data can be useful to compare the speed of the algorithms excluding time to access memory. |
Hi @aklomp , I updated my patch and accepted your review comments of "adding readme and removing chromium code". Could you please have a review? Thanks! After it is landed it will be merged back into Node.js repo. |
Hi @aklomp , I'm wondering if you have bandwidth to review this PR. Thanks! |
@lucshi Sorry for the delay, I'm still here! Your pull request is near the top of my mental to-do list of things to process (along with #105), but I've been really busy the past weeks and haven't found a moment to deeply engage with it. I do plan on giving it a closer look soon. If you feel that I'm taking too long, don't hesitate to reach out. |
Thank you @aklomp for the response. I would like to make it landed if possible very soon to catch up with the Node.js upstream optimization. Do you require me to write some notes about the PR to make the review faster? |
} | ||
|
||
static inline void | ||
dec_loop_avx2 (const uint8_t **s, size_t *slen, uint8_t **o, size_t *olen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo? Should be dec_loop_avx512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This PR is only for the encoding part for AVX512 because Node.js only depends on base64 SIMD encoding. In general Base64 decoding cannot be vectorized when there are space chars in input. To not break your project in general, I reuse the AVX2 for decoding part in my PR.
@lucshi The main question I had was whether WojciechMula's code was under a compatible licence, but apparently it is (3-clause BSD). Before merging, I also want to do a thorough code review to ensure that the code conforms to the library's general structure. It's this code review that I'm blocked on, but I'll try to put it on the fast path. I do feel bad for making you wait this long for a proper PR review. I really do appreciate the effort that you put in this, and it's definitely a very nice enhancement. |
I'm totally OK for a careful code review. Per my understanding, BSD-3 is a most popular license nowadays and the code can be used freely in BSD-2 project as long as including the copyright text and the authors and projects names not be used in market promotion. I'll also start the IP scan for the code base along with my PR by Protex IP Scanning tool to see if any other IP issues. Will update the results soon. |
There may be other IP related revise, but the IP scanning tool Protex did not found IP conflictions. |
Hey, I've informally reviewed the updated PR and I think it looks good on the whole. It's something I think can be merged, but it needs some work to clean it up. I'm volunteering to do that; more on that below. The main sticking points regarding the code I have are:
I'm giving you this short list instead of a full review because I don't want to come across as mean by pointing out boring details like trailing whitespace in overlong detail. I'm saving us the effort. But if you want though, I can certainly do a full review. For the way forward, what I had in mind was that I would take your PR and rework it to fix the issues above, retaining you as the author, and then come back to you for permission to merge that new branch. I think it's faster and less frustrating (and less work for you!) to do it that way. Of course, if you prefer, we can also do it classically and I'll be the reviewer/oracle while you push corrections. What do you say? Then there are two ephemeral issues that should also be mentioned:
(Sorry for accidentally closing this issue, I clicked the wrong button!) |
I pushed a branch called I also set up Intel SDE locally to test the |
Hi @aklomp , Thank you for the quick review comments. I'm OK to go with the fastest way that you do the fixes and made me one of the authors. When everything is done, I will submit a PR on node.js and merge updated code. |
Again sorry for the delay. I intend to refactor the Thanks for checking about the license. I agree with the proposed solution of putting the BSD-3 license text in the file with the imported code; that should indeed satisfy the license requirements. I also created an issue, #106, to discuss moving this project from BSD-2 to BSD-3. It's a very small change that I think is for the better, and it will allow this project to import code from more locations. The one remaining question is if it would impact downstream users of this library, but I don't really think so because this library is generally included as a source bundle complete with license file. |
Regarding the license issue, I noticed that @WojciechMula has also published his AVX512 encoder in his base64simd project, specificially here. That library is under the BSD-2 license. That simplifies things, all we need to do is update the copyright year in Muła's mention in the |
Great! |
Alright, I pushed an Since your name is on these commits, I have to ask you: are you OK if I merge this? |
I'm OK to merge. Thank you! |
Add AVX512 support for base64 encoding based on WojciechMula's code.
aklomp code base has been borrowed by Node.js upstream and I would like to enable avx512 support which can improve base64 encoding performance by additional +12% for node.js base64 on. The measurement of aklomp benchmark result also showed performance improvement for <100KB data, compared with AVX2. Improvements are around 36%-86%.
Above results were collected on Icelake cpu which is widely used in Amazon AWS EC2 cloud instances.