Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements of Random.GetItems<T> performance #82286

Open
sakno opened this issue Feb 17, 2023 · 9 comments
Open

Improvements of Random.GetItems<T> performance #82286

sakno opened this issue Feb 17, 2023 · 9 comments
Assignees
Labels
Milestone

Comments

@sakno
Copy link
Contributor

sakno commented Feb 17, 2023

.NET 8 introduces a new API for generating random sequences from the specified choices with Random.GetItems<T> and RandomNumberGenerator.GetItems<T> methods. I found that the recently submitted implementation is not as fast as it can be:

public void GetItems<T>(ReadOnlySpan<T> choices, Span<T> destination)
{
if (choices.IsEmpty)
throw new ArgumentException(SR.Arg_EmptySpan, nameof(choices));
for (int i = 0; i < destination.Length; i++)
{
destination[i] = choices[Next(choices.Length)];
}
}

Next is method is virtual and called within the loop that prevents it from inlining (as well as loop inside of XoshiroImpl.Next prevents this). Moreover, XoshiroImpl generates 64-bit number efficiently, but the current implementation only uses 32-bit (because span index is of type int). Thus, a half or random bits just dropped. If we can fully reuse all 64 bits, the performance can be improved twice.

This approach is implemented in my open-source library, which gives the following differences in the benchmark:

Method AllowedChars Mean Error StdDev Ratio
GetRandomItemsOptimized 1234567890abcdef 90.08 ns 0.625 ns 0.584 ns 0.40
GetRandomItemsDotNet 1234567890abcdef 227.49 ns 3.438 ns 3.216 ns 1.00
GetRandomItemsOptimized 1234567890abcdefg 155.86 ns 0.983 ns 0.919 ns 0.27
GetRandomItemsDotNet 1234567890abcdefg 571.74 ns 8.116 ns 7.195 ns 1.00

, where AllowedChars is a choices span. The length of the destination buffer is 36. When choices length is a power of 2, it is possible to perform additional optimizations that give even better performance.

Instead of calling Next method, I prefer to generate a vector of random 32-bit integers only once using NextBytes. Each element of the vector represents an index within choices. All I need here is to preserve range of each index (when length is a power of 2, modulo can be replaced effectively with bitwise AND). Otherwise, adjusting index can be done without division (or modulo) operator and the loop as described by Daniel Lemire in his paper on Arxiv.

Here is implementation of a whole algorithm from my library (MIT licensed).

For a relatively small destination, it is possible to allocate buffer on the stack. Otherwise, ArrayPool can be used to rent the necessary vector.

If this looks interesting, I'm ready to adopt and port the implementation.

P.S.: Benchmark results are obtained on .NET SDK 6.0.406 Linux

@sakno sakno added the tenet-performance Performance related issue label Feb 17, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 17, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 17, 2023
@vcsjones vcsjones added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 17, 2023
@ghost
Copy link

ghost commented Feb 17, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

.NET 8 introduces a new API for generating random sequences from the specified choices with Random.GetItems<T> and RandomNumberGenerator.GetItems<T> methods. I found that the recently submitted implementation is not as fast as it can be:

public void GetItems<T>(ReadOnlySpan<T> choices, Span<T> destination)
{
if (choices.IsEmpty)
throw new ArgumentException(SR.Arg_EmptySpan, nameof(choices));
for (int i = 0; i < destination.Length; i++)
{
destination[i] = choices[Next(choices.Length)];
}
}

Next is method is virtual and called within the loop that prevents it from inlining (as well as loop inside of XoshiroImpl.Next prevents this). Moreover, XoshiroImpl generates 64-bit number efficiently, but the current implementation only uses 32-bit (because span index is of type int). Thus, a half or random bits just dropped. If we can fully reuse all 64 bits, the performance can be improved twice.

This approach is implemented in my open-source library, which gives the following differences in the benchmark:

Method AllowedChars Mean Error StdDev Ratio
GetRandomItemsOptimized 1234567890abcdef 90.08 ns 0.625 ns 0.584 ns 0.40
GetRandomItemsDotNet 1234567890abcdef 227.49 ns 3.438 ns 3.216 ns 1.00
GetRandomItemsOptimized 1234567890abcdefg 155.86 ns 0.983 ns 0.919 ns 0.27
GetRandomItemsDotNet 1234567890abcdefg 571.74 ns 8.116 ns 7.195 ns 1.00

, where AllowedChars is a choices span. The length of the destination buffer is 36. When choices length is a power of 2, it is possible to perform additional optimizations that give even better performance.

Instead of calling Next method, I prefer to generate a vector of random 32-bit integers only once using NextBytes. Each element of the vector represents an index within choices. All I need here is to preserve range of each index (when length is a power of 2, modulo can be replaced effectively with bitwise AND). Otherwise, adjusting index can be done without division (or modulo) operator and the loop as described by Daniel Lemire in his paper on Arxiv.

Here is implementation of a whole algorithm from my library (MIT licensed).

For a relatively small destination, it is possible to allocate buffer on the stack. Otherwise, ArrayPool can be used to rent the necessary vector.

If this looks interesting, I'm ready to adopt and port the implementation.

P.S.: Benchmark results are obtained on .NET SDK 6.0.406 Linux

Author: sakno
Assignees: -
Labels:

area-System.Runtime, tenet-performance, untriaged

Milestone: -

@danmoseley
Copy link
Member

danmoseley commented Feb 17, 2023

See #79790

let’s merge that first

@stephentoub
Copy link
Member

Next is method is virtual and called within the loop that prevents it from inlining (as well as loop inside of XoshiroImpl.Next prevents this

An alternative would be making GetItems<T> virtual (or a generic private protected virtual it delegates to). That'd be an issue for shorter outputs as well, because generic virtual methods have non-trivial overheads associated with their invocation.

For a relatively small destination, it is possible to allocate buffer on the stack. Otherwise, ArrayPool can be used to rent the necessary vector.

We could explore the approach of having a private protected virtual void Next(Span<Int32>) that's used to make a single virtual call to fill up the necessary ints and then iterate through those in GetItems. I'm hesitant, however, to use ArrayPool here. It's one thing to use ArrayPool when you have to have an array; you're replacing a guaranteed allocation with one that may or may not require an allocation. But it's another thing to introduce the need for something that might allocate. On occasion if the pool didn't have an appropriate length buffer, GetItems would become allocating, which is far from ideal.

Here is implementation

I would not want to use unsafe code here as in your example for indexing into choices, in particular if the values being randomly generated aren't controlled by us. In the existing implementation, for example, it's possible for a derived implementation to return a value from Next that's erroneously out of range; today that would result in an exception, but if the bounds checks were removed explicitly, it could result in walking off the span.

@sakno
Copy link
Contributor Author

sakno commented Feb 17, 2023

in particular if the values being randomly generated aren't controlled by us

True. But in my example, all we need is to make NextBytes call which is virtual, and only once. It's fine if we have no control over its implementation. But Lemire's NextUInt32 needs to be applied to the cached vector of random values. Thus, no need to expose this implementation detail as a public method. In that case, it's possible to skip bounds check in an unsafe way. In other words, GetItems<T> doesn't use public virtual int Next(int maxValue) at all.

@stephentoub
Copy link
Member

In other words, GetItems doesn't use public virtual int Next(int maxValue) at all.

Understood. Which leads into my previous comment about ArrayPool allocation. If we want to explore making a single call for sizes we're willing to stack allocate, that'd be ok, although it'd then be weird if we used a derived Random's NextBytes some of the time and Next(int) other times.

@sakno
Copy link
Contributor Author

sakno commented Feb 21, 2023

@danmoseley , #79790 has been merged. I'm ready to proceed with this task. Could you assign it to me?

@sakno
Copy link
Contributor Author

sakno commented Feb 21, 2023

Preliminary benchmark results (new changes from PR are taken into account):

Method AllowedChars Mean Error StdDev Ratio
GetItemsOptimized 1234567890abcdef 94.93 ns 1.194 ns 1.117 ns 0.45
GetItemsFromDotNet 1234567890abcdef 213.14 ns 2.100 ns 1.862 ns 1.00
GetItemsOptimized 1234567890abcdefg 150.76 ns 2.117 ns 1.876 ns 0.71
GetItemsFromDotNet 1234567890abcdefg 211.84 ns 2.756 ns 2.443 ns 1.00

Host : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2

Source code for the benchmark
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Order;
using System;
using System.Runtime.CompilerServices;

namespace DotNext;

[SimpleJob(runStrategy: RunStrategy.Throughput, launchCount: 1)]
[Orderer(SummaryOrderPolicy.FastestToSlowest)]
public class RandomStringBenchmark
{
    [Params("1234567890abcdef", "1234567890abcdefg")]
    public string AllowedChars;
    private readonly Random rnd = new();

    [Benchmark]
    public void GetItemsOptimized()
    {
        Span<char> destination = stackalloc char[36];
        RandomExtensions.NextChars(rnd, AllowedChars, destination);
    }

    [Benchmark(Baseline = true)]
    public void GetItemsFromDotNet()
    {
        Span<char> destination = stackalloc char[36];

        GetItems(rnd, AllowedChars, destination);

        static void GetItems<T>(Random rnd, ReadOnlySpan<T> choices, Span<T> destination)
        {
            if (choices.IsEmpty)
            {
                throw new ArgumentException(nameof(choices));
            }

            for (int i = 0; i < destination.Length; i++)
            {
                destination[i] = choices[(int)NextUInt32((uint)choices.Length, rnd)];
            }
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint NextUInt32(uint maxValue, Random rnd)
    {
        ulong randomProduct = (ulong)maxValue * (uint)rnd.Next();
        uint lowPart = (uint)randomProduct;

        if (lowPart < maxValue)
        {
            uint remainder = (0u - maxValue) % maxValue;

            while (lowPart < remainder)
            {
                randomProduct = (ulong)maxValue * (uint)rnd.Next();
                lowPart = (uint)randomProduct;
            }
        }

        return (uint)(randomProduct >> 32);
    }
}

@danmoseley
Copy link
Member

@sakno are you still working on this? would be a nice win.

@sakno
Copy link
Contributor Author

sakno commented Mar 28, 2023

@danmoseley , yes, I'm working on it. But now this work is on pause, I'm waiting for #83305 PR because it may have impact on my work: merge conflicts, new algorithm (which probably can be more efficient that my proposal).

@ericstj ericstj added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 14, 2023
@ericstj ericstj added this to the Future milestone Aug 14, 2023
@jeffhandley jeffhandley removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants