-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigInteger performance improvements #41495
Comments
Tagging subscribers to this area: @tannergooding, @pgovind |
I'm generally happy to take perf PRs, provided all tests pass and numbers look good (both local and those on the perf lab hardware). |
@adamsitnik please take a look. It seems that the issue can be closed with #83457 |
It looks like #83457 has included the |
Yes, it need to be clarified. |
The BigInteger class is fast already, but after doing a bit-twiddling kata/review on the source code I realized there were some further improvements to be made. Due to the ongoing effort to spanify the BigInteger #35565 I will leave the proposed code changes and their rationales below, with performance measurements as delta from existing code. I see no reason the same shouldn't apply for the spanified version as well. The code is spanified for later convenience.
All code can be found in the BigIntegerCalculator.* files.
If the proposed changes are accepted (with some further refinements, of course) I am willing to do the implementation on top of the spanification once it is merged. Further benchmarking also needs to be done to ensure this is not a "my machine" thing.
Configuration
.NET 5.0 (from repository, commit 3ac735b)
Windows 10 19041 x64
Intel i7-8700K (Unclocked for the benchmark runs)
Add
Add should impose a check for carry == 0 in the last loop of both the half trivial case and the "full add":
Note the use of single
&
. This is to keep the single branch instructions. The JIT seem to fold this nicely in the x64 disassembly (It's one more instruction). The single & could also be used inAddSelf
. In general, the carry will very quickly reach zero, so this should enable big + small additions to mainly bememmove
. For some reason the change also gave better performance on cases I did not expect. The results are repeatable with new baselines as well.Subtract
The same as for Add. The carry will tend toward zero.
The results are repeatable.
Divide
This is where it gets wild. Changing the
SubtractDivisor
accordingly removes a TON of branch misspredictions. Measurements are needed on 32-bit machines due to the 64 bit operations.The results are a bit varied on this one, surprisingly. I do not fully understand why the first case degrades so much. It did have much lower missprediction rate, so there could be further optimizations to be done by perhaps checking the value
q
(for example, single binary digitq
values gives a non-uniform distribution under truncated multiplication with randomleft
andright
, IE. good for the current algorithm) and selecting the current code or the proposed one. On average I would suspect the proposed code to perform better, as the ratio of true to false branches is fairly close to 1 on average (slight overweight on branch taken).Multiply
The last one I have is uncertain. It has to do with loop-unrolling in the trivial case of
Multiply
. This enables the use of a dirty result buffers. The effects can only be seen in PowMod (with removed zeroing of the BitBuffers). However, there are some potential negative effects due to increased code size (double the instruction cache misses? Hard to say, processors are good at predictively streaming into them these days).The text was updated successfully, but these errors were encountered: