BigInteger performance improvements #41495

Grevor · 2020-08-28T05:31:42Z

The BigInteger class is fast already, but after doing a bit-twiddling kata/review on the source code I realized there were some further improvements to be made. Due to the ongoing effort to spanify the BigInteger #35565 I will leave the proposed code changes and their rationales below, with performance measurements as delta from existing code. I see no reason the same shouldn't apply for the spanified version as well. The code is spanified for later convenience.

All code can be found in the BigIntegerCalculator.* files.

If the proposed changes are accepted (with some further refinements, of course) I am willing to do the implementation on top of the spanification once it is merged. Further benchmarking also needs to be done to ensure this is not a "my machine" thing.

Configuration

.NET 5.0 (from repository, commit 3ac735b)
Windows 10 19041 x64
Intel i7-8700K (Unclocked for the benchmark runs)

Add

Add should impose a check for carry == 0 in the last loop of both the half trivial case and the "full add":

            int i = 0;
            for ( ; carry != 0 & i < left.Length; i++)
            {
                long digit = left[i] + carry;
                bits[i] = unchecked((uint)digit);
                carry = digit >> 32;
            }
            left.Slice(i).CopyTo(bits.Slice(i));
            bits[left.Length] = (uint)carry;

Note the use of single &. This is to keep the single branch instructions. The JIT seem to fold this nicely in the x64 disassembly (It's one more instruction). The single & could also be used in AddSelf. In general, the carry will very quickly reach zero, so this should enable big + small additions to mainly be memmove. For some reason the change also gave better performance on cases I did not expect. The results are repeatable with new baselines as well.

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536,65536 bits)      |      1.14 |          2403.37 |          2102.15 |         |
| System.Numerics.Tests.Perf_BigInteger.Add(arguments: 16,16 bits)            |      1.14 |             7.30 |             6.43 |         |

Subtract

The same as for Add. The carry will tend toward zero.

The results are repeatable.

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 16,16 bits)       |      1.07 |             7.60 |             7.07 |         |
| System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits) |      1.14 |          2391.64 |          2094.65 |         |

Divide

This is where it gets wild. Changing the SubtractDivisor accordingly removes a TON of branch misspredictions. Measurements are needed on 32-bit machines due to the 64 bit operations.

           ulong carry = 0UL;

            for (int i = 0; i < right.Length; i++)
            {
                carry += right[i] * q;
                uint digit = unchecked((uint)carry);
                carry = carry >> 32;
                ulong newDigit = unchecked((ulong)left[i] - digit);
                carry += (newDigit  >> 32 ) & 0x1; // This is the same as if (leftElement < digit) ++carry
                left[i] = unchecked((uint)newDigit);
            }

            return (uint)carry;

The results are a bit varied on this one, surprisingly. I do not fully understand why the first case degrades so much. It did have much lower missprediction rate, so there could be further optimizations to be done by perhaps checking the value q (for example, single binary digit q values gives a non-uniform distribution under truncated multiplication with random left and right, IE. good for the current algorithm) and selecting the current code or the proposed one. On average I would suspect the proposed code to perform better, as the ratio of true to false branches is fairly close to 1 on average (slight overweight on branch taken).

| Slower                                                                 | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Divide(arguments: 1024,512 bits) |      1.27 |           524.26 |           665.83 |         |
| System.Numerics.Tests.Perf_BigInteger.Remainder(arguments: 1024,512 bits)        |      1.31 |           524.66 |           685.61 |         |

| Faster                                                                    | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Divide(arguments: 65536,32768 bits) |      2.63 |       4249482.81 |       1617236.56 |         |
| System.Numerics.Tests.Perf_BigInteger.Remainder(arguments: 65536,32768 bits) |      2.61 |       4257183.59 |       1633850.63 |

Multiply

The last one I have is uncertain. It has to do with loop-unrolling in the trivial case of Multiply . This enables the use of a dirty result buffers. The effects can only be seen in PowMod (with removed zeroing of the BitBuffers). However, there are some potential negative effects due to increased code size (double the instruction cache misses? Hard to say, processors are good at predictively streaming into them these days).

                if (right.Length <= 0)
                {
                    result.Clear();
                    return;
                }
                // Implied i = 0 and result = 0
                ulong carry = 0UL;
                for (int j = 0; j < left.Length; j++)
                {
                    ulong digits = carry
                        + (ulong)left[j] * right[0];
                    result[j] = unchecked((uint)digits);
                    carry = digits >> 32;
                }
                result[left.Length] = (uint)carry;

                for (int i = 1; i < right.Length; i++)
                {
                    carry = 0UL;
                    for (int j = 0; j < left.Length; j++)
                    {
                        ulong digits = result[i + j] + carry
                            + (ulong)left[j] * right[i];
                        result[i + j] = unchecked((uint)digits);
                        carry = digits >> 32;
                    }
                    result[i + left.Length] = (uint)carry;
                }

| Faster                                                                       | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.ModPow(arguments: 1024,1024,64 bits)   |      1.14 |        130853.59 |        114545.93 |         |
| System.Numerics.Tests.Perf_BigInteger.ModPow(arguments: 16384,16384,64 bits) |      1.09 |       2186651.56 |       1998425.00 |         |

The text was updated successfully, but these errors were encountered:

ghost · 2020-08-28T05:31:48Z

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

tannergooding · 2020-09-14T17:03:30Z

I'm generally happy to take perf PRs, provided all tests pass and numbers look good (both local and those on the perf lab hardware).

speshuric · 2023-07-19T06:13:07Z

@adamsitnik please take a look. It seems that the issue can be closed with #83457

adamsitnik · 2023-07-19T13:51:59Z

@adamsitnik please take a look. It seems that the issue can be closed with #83457

It looks like #83457 has included the + and - perf improvements, how about * and / ? It would be great to try them and see what perf benefits we can get. Since this seems to be a small change I can give it a quick try and report back.

speshuric · 2023-07-19T17:17:16Z

Yes, it need to be clarified.
But Karatsuba/Toom-Cook branch already use Add method which contains this improvement.

Grevor added the tenet-performance Performance related issue label Aug 28, 2020

Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Aug 28, 2020

sakno mentioned this issue Aug 31, 2020

Spannified internals of BigInteger #35565

Merged

8 tasks

tannergooding added this to the Future milestone Sep 14, 2020

tannergooding removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2020

jeffhandley mentioned this issue Oct 28, 2022

.NET 7.0 RC2 Microbenchmarks Performance Study Report #77150

Open

26 tasks

speshuric mentioned this issue Apr 17, 2023

System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

Closed

6 tasks

adamsitnik mentioned this issue Jul 19, 2023

[DRAFT] BigInteger.Divide perf optimization #89211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigInteger performance improvements #41495

BigInteger performance improvements #41495

Grevor commented Aug 28, 2020

ghost commented Aug 28, 2020

tannergooding commented Sep 14, 2020

speshuric commented Jul 19, 2023

adamsitnik commented Jul 19, 2023

speshuric commented Jul 19, 2023

BigInteger performance improvements #41495

BigInteger performance improvements #41495

Comments

Grevor commented Aug 28, 2020

Configuration

Add

Subtract

Divide

Multiply

ghost commented Aug 28, 2020

tannergooding commented Sep 14, 2020

speshuric commented Jul 19, 2023

adamsitnik commented Jul 19, 2023

speshuric commented Jul 19, 2023