Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nvLogBase2 more efficient #177

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Make nvLogBase2 more efficient #177

wants to merge 3 commits into from

Conversation

DMaroo
Copy link

@DMaroo DMaroo commented May 14, 2022

The loop is more simple and easy to understand. Naive testing shows an improvement of 2x in speed.

In every single test, the new loop seems to perform better (tested using 100000000 random integers each). The median time of execution for the old loop on my machine is approximately 5.25 seconds, whereas the median time of execution for the new loop is 2.6 seconds.

@CLAassistant
Copy link

CLAassistant commented May 14, 2022

CLA assistant check
All committers have signed the CLA.

@matwey
Copy link

matwey commented May 14, 2022

Why don't use the builtin compiler intrinsics for this? There are special CPU instructions on most architectures to figure out most significant bit.

@DMaroo
Copy link
Author

DMaroo commented May 14, 2022

Yes, GCC provides __builtin_clz for leading zeroes. I don't know about Clang. Since the README states that both GCC and Clang toolchains are supported, it would not be a good idea to use GCC specific compiler intrinsics.

As for architecture specific instructions, again, the same problem of portability arises. We could use them if there was only a single target architecture, but that's not the case.

@matwey
Copy link

matwey commented May 14, 2022

clang and gcc usually both have the same set of builtins.

@mtijanic
Copy link
Collaborator

Hello and thank you for helping us optimize the driver.

We do already have portUtilCount{Leading,Trailing}Zeros{32,64}, as declared in nvport/util.h and generally included by any code also using this function. It would be better to replace all usages of this function with one of those.

These portXxx functions from nvport/ generally have an implementation for every architecture/OS/compiler that we support, often implemented by compiler intrinsics. Please try to use them instead of platform-specific options whenever possible.

(Yes, we are aware this is not the linux way, but in a shared codebase it is a necessary evil)

  * Implemented using compiler intrinsics and architecture specific
    instructions, so even faster
@DMaroo
Copy link
Author

DMaroo commented May 14, 2022

Hi! Replacing all the usages does not seem to maintain the expected behavior of nvLogBase2. It is expected that nvLogBase2 will fail the assert if 0 is passed to it, but portUtilCountTrailingZeros64 will just silently return 64. I suppose I will have to write a wrapper around nvLogBase2 everywhere it is used to preserve the behavior. I find changing the function definition to be a neater choice. I could convert nvLogBase2 to a macro though. Let me know if that is needed.

@mtijanic
Copy link
Collaborator

Good point, let's preserve the exact behavior just to be on the safe side. This looks good!
I'd make two more changes though:

  1. The comment is superfluous. The logic for splitting the asserts is good, and it is something we encourage generally, but there's no reason to add the comment explaining the pattern every time it is used.
  2. Just return portUtilCountTrailingZeros64(val); directly. The extra variable serves no purpose.

I'll start the process to merge this into the internal p4 repo.

@mtijanic mtijanic added the Implemented Fixed, in test prior to release integration label May 16, 2022
@mtijanic
Copy link
Collaborator

Merged into internal development branch. Unfortunately, due to our complex versioning and overlapping releases/QA, I don't know if it'll be available in the next release's code drop. So, if I merge the PR now, it might be undone by the next code drop.
I'll leave it open until I'm sure the next release will have it.

Thank you for the contribution!

@DonielMoins
Copy link

I wonder what the performance differences are, have you benchmarked this?

@DMaroo
Copy link
Author

DMaroo commented May 31, 2022

I am confident that using a builtin will be faster than a higher level loop. Compiler optimizations might lead to the same assembly being generated. Also, this might not be hot code, it also might not be reached very often. But in any case, implementing the function is redundant, since we already have a builtin for it (which is as fast as it can get).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Implemented Fixed, in test prior to release integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants