-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize conversions between Half
and Single
#69667
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsDescriptionCurrently the conversion between ConfigurationBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.300-preview.22204.3
[Host] : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
DefaultJob : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT Regression?No DataI benchmarked the code below. using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
namespace HalfConversionBenchmarks
{
[SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
[DisassemblyDiagnoser(maxDepth: int.MaxValue)]
public class HalfToSingleConversionBenchmarks
{
[Params(65535)]
public int Frames { get; set; }
private float[] bufferDst;
private Half[] bufferA;
[GlobalSetup]
public void Setup()
{
var samples = Frames;
bufferDst = new float[samples];
bufferA = new Half[samples];
bufferA.AsSpan().Fill((Half)1.5f);
}
[Benchmark]
public void SimpleLoop()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
}
}
[Benchmark]
public void UnrolledLoop()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
var olen = length - 3;
for (; i < olen; i += 4)
{
Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
}
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
}
}
}
}
AnalysisThe current code looks like a source of inefficiency, using a lot of branches. My proposal for new software fallbackI wrote this code for conversion from using System.Runtime.CompilerServices;
namespace BetterHalfToSingleConversion
{
public static class HalfUtils
{
[MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
public static float ConvertHalfToSingle(Half value)
{
var h = BitConverter.HalfToInt16Bits(value);
var v = (uint)(int)h;
var b = (v & 0x7c00u) == 0x7c00u;
var hb = (ulong)-(long)Unsafe.As<bool, byte>(ref b);
v <<= 13;
v &= 0x8FFF_E000;
var j = 0x0700000000000000ul + (hb & 0x3F00000000000000ul);
var d = BitConverter.DoubleToUInt64Bits((double)BitConverter.UInt32BitsToSingle(v));
d += j;
return (float)BitConverter.UInt64BitsToDouble(d);
}
}
} Test code: using System;
using BetterHalfToSingleConversion;
using NUnit.Framework;
namespace BetterHalfConversionTests
{
[TestFixture]
public class BetterHalfToSingleConversionTests
{
[Test]
public void ConvertHalfToSingleConvertsAllValuesCorrectly()
{
for (uint i = 0; i <= ushort.MaxValue; i++)
{
var h = BitConverter.UInt16BitsToHalf((ushort)i);
var exp = (float)h;
var act = HalfUtils.ConvertHalfToSingle(h);
Assert.AreEqual(exp, act, $"Evaluating {i}th value:");
}
}
}
} And benchmarked with: using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BetterHalfToSingleConversion;
namespace HalfConversionBenchmarks
{
[SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
[DisassemblyDiagnoser(maxDepth: int.MaxValue)]
public class HalfToSingleConversionBenchmarks
{
[Params(65535)]
public int Frames { get; set; }
private float[] bufferDst;
private Half[] bufferA;
[GlobalSetup]
public void Setup()
{
var samples = Frames;
bufferDst = new float[samples];
bufferA = new Half[samples];
bufferA.AsSpan().Fill((Half)1.5f);
}
[Benchmark]
public void SimpleLoopStandard()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
}
}
[Benchmark]
public void UnrolledLoopStandard()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
var olen = length - 3;
for (; i < olen; i += 4)
{
Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
}
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
}
}
[Benchmark]
public void SimpleLoopNew()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i));
}
}
[Benchmark]
public void UnrolledLoopNew()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
var olen = length - 3;
for (; i < olen; i += 4)
{
Unsafe.Add(ref rdi, i + 0) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 0));
Unsafe.Add(ref rdi, i + 1) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 1));
Unsafe.Add(ref rdi, i + 2) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 2));
Unsafe.Add(ref rdi, i + 3) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 3));
}
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i));
}
}
}
} And result is: BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.300-preview.22204.3
[Host] : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
DefaultJob : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
I also added a new repository with some alternative approach.
|
Half
and Single
is suboptimally implemented
Half
and Single
is suboptimally implementedHalf
and Single
@adamsitnik I think this should maybe be left open until the half conversions get |
Description
Currently the conversion between
Half
andfloat
is only implemented in software, leading to performance issues.It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.
Configuration
Regression?
No
Data
I benchmarked the code below.
EDIT: Removed data biases.
EDIT2: Added random permutation.
Benchmark code for Half to Single conversion
The conversion of sequential values seems to be accelerated in some way, such as branch prediction.
Benchmark code for Single to Half conversion
Analysis
Converting
Half
tofloat
The current code looks like a source of inefficiency, using a lot of branches.
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.
My proposal for new software fallback converting Half to float
EDIT: The previously proposed algorithm turned out to be slower with new input data.
The code below converts
Half
tofloat
about twice faster than the current implementation.I've tested this code in test project for all possible 65536
Half
values.Test and benchmark code is available in this repository, along with several alternative approaches.
The result is:
Converting
float
toHalf
The current code has a lot of branches, which leads to possible inefficiency.
Again, by getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.
My proposal for new software fallback converting float to Half
The code below converts
float
toHalf
twice faster than the current implementation.I've tested this code in test project for all possible 4,294,967,296
float
values.Test and benchmark code is available in this repository, along with several alternative approaches.
The benchmark result is as follows:
The text was updated successfully, but these errors were encountered: