ByteAddressBuffer templated loads/stores with optional alignment argument #258

sebbbi · 2019-05-16T06:32:39Z

SM 6.2 added templated ByteAddressBuffer loads and templated stores were added in microsoft/DirectXShaderCompiler#2176. With these new additions the usability of raw buffers has increased drastically. Code looks nice:

ByteAddressBuffer buffer;
float4 v = buffer.Load<float4>(idx);
buffer.Store<float4>(idx, v);

There's one remaining problem in DirectX raw buffer: Both the original raw load API and the new templated one use alignment of 4 bytes for all load widths. There's no way to define a custom alignment. This results in poor codegen on some GPUs which require natural alignment (8 or 16 bytes) for 2/4 wide load/store instructions. Seems that their internal compiler chops these wide loads into 2/3/4 individual 32 bit loads, clearly decreasing performance in mem issue bound cases.

I propose a new optional syntax to provide alignment as second template parameter to give compilers more knowledge, allowing them to emit wide load/store instructions on all architectures:

float4 v = buffer.Load<float4, 16>(idx);
buffer.Store<float4, 16>(idx, v);

Investigation:

I analyzed the performance of raw loads with my L1$ load benchmark:
https://github.com/sebbbi/perftest

On AMD GPUs Load2/3/4 instructions only require alignment of 4 bytes, matching DirectX specification. This results in peak throughput for all widths:

ByteAddressBuffer.Load random: 22.153ms 1.004x
ByteAddressBuffer.Load2 random: 22.435ms 0.992x
ByteAddressBuffer.Load4 random: 23.508ms 0.947x

On Nvidia GPUs (Kepler, Maxwell, Pascal), we get clear linear regression with load width:

ByteAddressBuffer.Load random: 24.336ms 1.479x
ByteAddressBuffer.Load2 random: 48.744ms 0.738x
ByteAddressBuffer.Load4 random: 209.280ms 0.172x

Structured buffer doesn't show the same regression on Nvidia GPUs, as StructuredBuffer guarantees alignment. Now we get 100% performance with float2 (64 bit) loads. Nvidia has half rate 128 bit loads (including textures). That explains half rate for float4 case. Still much better than raw load case:

StructuredBuffer<float>.Load random: 34.092ms 1.056x
StructuredBuffer<float2>.Load random: 34.125ms 1.055x
StructuredBuffer<float4>.Load random: 67.950ms 0.530x

Nvidia has vec2 and vec4 wide loads exposed in CUDA (https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/), but these instructions require alignment of 8 and 16 bytes. Because DirectX only guarantees alignment of 4 bytes, these instructions can't be used by the compiler. Instead the compiler needs to be conservative and emit multiple 32 bit load instructions. Which is clearly visible in the benchmark results.

The text was updated successfully, but these errors were encountered:

ChasBoyd · 2019-05-16T18:36:42Z

@sebbbi Matias Goldberg just proposed a syntax closer to standard LLVM:
May 15 Replying to @SebAaltonen

struct MyStruct
{
float4 a [[align(16)]];
float b [[align(16)]];
};

RawRWViewvar = rawBuffer.reinterpret(offsetStart, align);

xhcao · 2020-04-08T00:51:02Z

Hi, Currently, Does dxc support RWByteAddressBuffer 16-bit float templated Store method in shader model 6.2 in -eanble-16bit-types mode? For example, "buf.Store<float16_t2>(0, float16_t2(0.9, 0.9));".
I use dxc compile the shader with above statement, no error happens. But I read the data back to CPU side, the data is incorrect and zeros.

john-h-k · 2021-05-09T09:02:57Z

Using raw byte buffers on NV without this is nigh-on impossible with maxwell/pascal. Has there been any feedback on this issue?

john-h-k · 2021-05-09T09:25:00Z

float4 v = buffer.Load<float4, 16>(idx);
buffer.Store<float4, 16>(idx, v);

Could also have it at resource site, if the entire resource has an alignment.

[align(16)]
ByteAddressBuffer myBuffer;

Should allow the compiler to workout which loads are aligned most of the time,

// https://devblogs.microsoft.com/directx/hlsl-shader-model-6-6/ // https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_6.html // https://devblogs.microsoft.com/directx/in-the-works-hlsl-shader-model-6-6/ https://github.com/microsoft/DirectXShaderCompiler/issues/2193 Bindless只能ByteAddressBuffer或者RWByteAddressBuffer，所以Shader里面通过ByteAddressBuffer去索引，同时在SM6.2支持了Template Lod，所以可以获取Float4这样的数据了，而不必直接强转

simondeschenes · 2023-10-29T21:52:26Z

Could we get a new method on ByteAddressBuffer such as this?
MyType myData = myBuffer.LoadAligned<MyType, 16>(offsetInMultipleOfAlignment);

If I wrote the following liine, the actual offset in bytes would be 32:
MyType myData = myBuffer.LoadAligned<MyType, 16>(2);

I think that would solve the problem, it guarantees aligned load to a specific multiple while also supporting templated loads. Alignment would only allow power of 2 higher or equal to 4 and that would be checked at compilation time.

We could also add a matching Store<T, Align>(...) method too.

tristanlabelle added the enhancement New feature or request label May 17, 2019

jeremyong-az mentioned this issue Aug 30, 2021

Add "sledgehammer" option for buffer packing/unpacking for bound shader data o3de/o3de#3726

Open

jaebaek mentioned this issue Feb 4, 2022

[spirv] support 'T vk::RawBufferLoad<T = uint>(uint64_t address [, uint alignment = 4] )' microsoft/DirectXShaderCompiler#4226

Merged

llvm-beanz transferred this issue from microsoft/DirectXShaderCompiler Jun 11, 2024

llvm-beanz added the needs-triage label Jun 13, 2024

llvm-beanz added Theme:Alignment Issues related to data alignment or packing Requires Shader Model Feature requests that require DXIL and Shader Model updates and removed needs-triage labels Aug 6, 2024

llvm-beanz added this to the Shader Model Backlog milestone Aug 6, 2024

llvm-beanz moved this to Triaged in HLSL Triage Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteAddressBuffer templated loads/stores with optional alignment argument #258

ByteAddressBuffer templated loads/stores with optional alignment argument #258

sebbbi commented May 16, 2019 •

edited

Loading

ChasBoyd commented May 16, 2019

xhcao commented Apr 8, 2020

john-h-k commented May 9, 2021

john-h-k commented May 9, 2021

simondeschenes commented Oct 29, 2023 •

edited

Loading

ByteAddressBuffer templated loads/stores with optional alignment argument #258

ByteAddressBuffer templated loads/stores with optional alignment argument #258

Comments

sebbbi commented May 16, 2019 • edited Loading

ChasBoyd commented May 16, 2019

xhcao commented Apr 8, 2020

john-h-k commented May 9, 2021

john-h-k commented May 9, 2021

simondeschenes commented Oct 29, 2023 • edited Loading

sebbbi commented May 16, 2019 •

edited

Loading

simondeschenes commented Oct 29, 2023 •

edited

Loading