Make `nvLogBase2` more efficient #177

DMaroo · 2022-05-14T10:59:07Z

The loop is more simple and easy to understand. Naive testing shows an improvement of 2x in speed.

In every single test, the new loop seems to perform better (tested using 100000000 random integers each). The median time of execution for the old loop on my machine is approximately 5.25 seconds, whereas the median time of execution for the new loop is 2.6 seconds.

CLAassistant · 2022-05-14T10:59:17Z

All committers have signed the CLA.

matwey · 2022-05-14T11:50:17Z

Why don't use the builtin compiler intrinsics for this? There are special CPU instructions on most architectures to figure out most significant bit.

DMaroo · 2022-05-14T12:29:54Z

Yes, GCC provides __builtin_clz for leading zeroes. I don't know about Clang. Since the README states that both GCC and Clang toolchains are supported, it would not be a good idea to use GCC specific compiler intrinsics.

As for architecture specific instructions, again, the same problem of portability arises. We could use them if there was only a single target architecture, but that's not the case.

matwey · 2022-05-14T13:00:49Z

clang and gcc usually both have the same set of builtins.

mtijanic · 2022-05-14T20:43:45Z

Hello and thank you for helping us optimize the driver.

We do already have portUtilCount{Leading,Trailing}Zeros{32,64}, as declared in nvport/util.h and generally included by any code also using this function. It would be better to replace all usages of this function with one of those.

These portXxx functions from nvport/ generally have an implementation for every architecture/OS/compiler that we support, often implemented by compiler intrinsics. Please try to use them instead of platform-specific options whenever possible.

(Yes, we are aware this is not the linux way, but in a shared codebase it is a necessary evil)

* Implemented using compiler intrinsics and architecture specific instructions, so even faster

DMaroo · 2022-05-14T22:18:44Z

Hi! Replacing all the usages does not seem to maintain the expected behavior of nvLogBase2. It is expected that nvLogBase2 will fail the assert if 0 is passed to it, but portUtilCountTrailingZeros64 will just silently return 64. I suppose I will have to write a wrapper around nvLogBase2 everywhere it is used to preserve the behavior. I find changing the function definition to be a neater choice. I could convert nvLogBase2 to a macro though. Let me know if that is needed.

mtijanic · 2022-05-15T13:46:56Z

Good point, let's preserve the exact behavior just to be on the safe side. This looks good!
I'd make two more changes though:

The comment is superfluous. The logic for splitting the asserts is good, and it is something we encourage generally, but there's no reason to add the comment explaining the pattern every time it is used.
Just return portUtilCountTrailingZeros64(val); directly. The extra variable serves no purpose.

I'll start the process to merge this into the internal p4 repo.

mtijanic · 2022-05-17T08:36:14Z

Merged into internal development branch. Unfortunately, due to our complex versioning and overlapping releases/QA, I don't know if it'll be available in the next release's code drop. So, if I merge the PR now, it might be undone by the next code drop.
I'll leave it open until I'm sure the next release will have it.

Thank you for the contribution!

DonielMoins · 2022-05-29T23:12:50Z

I wonder what the performance differences are, have you benchmarked this?

DMaroo · 2022-05-31T15:12:19Z

I am confident that using a builtin will be faster than a higher level loop. Compiler optimizations might lead to the same assembly being generated. Also, this might not be hot code, it also might not be reached very often. But in any case, implementing the function is redundant, since we already have a builtin for it (which is as fast as it can get).

Make nvLogBase2 more efficient

8b24fa5

Use portUtilCountTrailingZeros64 from nvport

cf8d063

* Implemented using compiler intrinsics and architecture specific instructions, so even faster

Remove unnecessary comment and local variable

8ad52df

mtijanic added the Implemented Fixed, in test prior to release integration label May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `nvLogBase2` more efficient #177

Make `nvLogBase2` more efficient #177

DMaroo commented May 14, 2022

CLAassistant commented May 14, 2022 •

edited

Loading

matwey commented May 14, 2022

DMaroo commented May 14, 2022

matwey commented May 14, 2022

mtijanic commented May 14, 2022

DMaroo commented May 14, 2022 •

edited

Loading

mtijanic commented May 15, 2022

mtijanic commented May 17, 2022

DonielMoins commented May 29, 2022

DMaroo commented May 31, 2022

Make nvLogBase2 more efficient #177

Are you sure you want to change the base?

Make nvLogBase2 more efficient #177

Conversation

DMaroo commented May 14, 2022

CLAassistant commented May 14, 2022 • edited Loading

matwey commented May 14, 2022

DMaroo commented May 14, 2022

matwey commented May 14, 2022

mtijanic commented May 14, 2022

DMaroo commented May 14, 2022 • edited Loading

mtijanic commented May 15, 2022

mtijanic commented May 17, 2022

DonielMoins commented May 29, 2022

DMaroo commented May 31, 2022

Make `nvLogBase2` more efficient #177

Make `nvLogBase2` more efficient #177

CLAassistant commented May 14, 2022 •

edited

Loading

DMaroo commented May 14, 2022 •

edited

Loading