-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make nvLogBase2
more efficient
#177
base: main
Are you sure you want to change the base?
Conversation
Why don't use the builtin compiler intrinsics for this? There are special CPU instructions on most architectures to figure out most significant bit. |
Yes, GCC provides As for architecture specific instructions, again, the same problem of portability arises. We could use them if there was only a single target architecture, but that's not the case. |
clang and gcc usually both have the same set of builtins. |
Hello and thank you for helping us optimize the driver. We do already have These (Yes, we are aware this is not the linux way, but in a shared codebase it is a necessary evil) |
* Implemented using compiler intrinsics and architecture specific instructions, so even faster
Hi! Replacing all the usages does not seem to maintain the expected behavior of |
Good point, let's preserve the exact behavior just to be on the safe side. This looks good!
I'll start the process to merge this into the internal p4 repo. |
Merged into internal development branch. Unfortunately, due to our complex versioning and overlapping releases/QA, I don't know if it'll be available in the next release's code drop. So, if I merge the PR now, it might be undone by the next code drop. Thank you for the contribution! |
I wonder what the performance differences are, have you benchmarked this? |
I am confident that using a builtin will be faster than a higher level loop. Compiler optimizations might lead to the same assembly being generated. Also, this might not be hot code, it also might not be reached very often. But in any case, implementing the function is redundant, since we already have a builtin for it (which is as fast as it can get). |
The loop is more simple and easy to understand. Naive testing shows an improvement of 2x in speed.
In every single test, the new loop seems to perform better (tested using 100000000 random integers each). The median time of execution for the old loop on my machine is approximately 5.25 seconds, whereas the median time of execution for the new loop is 2.6 seconds.