Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteAddressBuffer templated loads/stores with optional alignment argument #258

Open
sebbbi opened this issue May 16, 2019 · 5 comments
Open
Labels
enhancement New feature or request Requires Shader Model Feature requests that require DXIL and Shader Model updates Theme:Alignment Issues related to data alignment or packing

Comments

@sebbbi
Copy link

sebbbi commented May 16, 2019

SM 6.2 added templated ByteAddressBuffer loads and templated stores were added in microsoft/DirectXShaderCompiler#2176. With these new additions the usability of raw buffers has increased drastically. Code looks nice:

ByteAddressBuffer buffer;
float4 v = buffer.Load<float4>(idx);
buffer.Store<float4>(idx, v);

There's one remaining problem in DirectX raw buffer: Both the original raw load API and the new templated one use alignment of 4 bytes for all load widths. There's no way to define a custom alignment. This results in poor codegen on some GPUs which require natural alignment (8 or 16 bytes) for 2/4 wide load/store instructions. Seems that their internal compiler chops these wide loads into 2/3/4 individual 32 bit loads, clearly decreasing performance in mem issue bound cases.

I propose a new optional syntax to provide alignment as second template parameter to give compilers more knowledge, allowing them to emit wide load/store instructions on all architectures:

float4 v = buffer.Load<float4, 16>(idx);
buffer.Store<float4, 16>(idx, v);

Investigation:

I analyzed the performance of raw loads with my L1$ load benchmark:
https://github.com/sebbbi/perftest

On AMD GPUs Load2/3/4 instructions only require alignment of 4 bytes, matching DirectX specification. This results in peak throughput for all widths:

ByteAddressBuffer.Load random: 22.153ms 1.004x
ByteAddressBuffer.Load2 random: 22.435ms 0.992x
ByteAddressBuffer.Load4 random: 23.508ms 0.947x

On Nvidia GPUs (Kepler, Maxwell, Pascal), we get clear linear regression with load width:

ByteAddressBuffer.Load random: 24.336ms 1.479x
ByteAddressBuffer.Load2 random: 48.744ms 0.738x
ByteAddressBuffer.Load4 random: 209.280ms 0.172x

Structured buffer doesn't show the same regression on Nvidia GPUs, as StructuredBuffer guarantees alignment. Now we get 100% performance with float2 (64 bit) loads. Nvidia has half rate 128 bit loads (including textures). That explains half rate for float4 case. Still much better than raw load case:

StructuredBuffer<float>.Load random: 34.092ms 1.056x
StructuredBuffer<float2>.Load random: 34.125ms 1.055x
StructuredBuffer<float4>.Load random: 67.950ms 0.530x

Nvidia has vec2 and vec4 wide loads exposed in CUDA (https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/), but these instructions require alignment of 8 and 16 bytes. Because DirectX only guarantees alignment of 4 bytes, these instructions can't be used by the compiler. Instead the compiler needs to be conservative and emit multiple 32 bit load instructions. Which is clearly visible in the benchmark results.

@ChasBoyd
Copy link

@sebbbi Matias Goldberg just proposed a syntax closer to standard LLVM:
May 15 Replying to @SebAaltonen

struct MyStruct
{
float4 a [[align(16)]];
float b [[align(16)]];
};

RawRWViewvar = rawBuffer.reinterpret(offsetStart, align);

@tristanlabelle tristanlabelle added the enhancement New feature or request label May 17, 2019
@xhcao
Copy link

xhcao commented Apr 8, 2020

Hi, Currently, Does dxc support RWByteAddressBuffer 16-bit float templated Store method in shader model 6.2 in -eanble-16bit-types mode? For example, "buf.Store<float16_t2>(0, float16_t2(0.9, 0.9));".
I use dxc compile the shader with above statement, no error happens. But I read the data back to CPU side, the data is incorrect and zeros.

@john-h-k
Copy link

john-h-k commented May 9, 2021

Using raw byte buffers on NV without this is nigh-on impossible with maxwell/pascal. Has there been any feedback on this issue?

@john-h-k
Copy link

john-h-k commented May 9, 2021

float4 v = buffer.Load<float4, 16>(idx);
buffer.Store<float4, 16>(idx, v);

Could also have it at resource site, if the entire resource has an alignment.

[align(16)]
ByteAddressBuffer myBuffer;

Should allow the compiler to workout which loads are aligned most of the time,

@simondeschenes
Copy link

simondeschenes commented Oct 29, 2023

Could we get a new method on ByteAddressBuffer such as this?
MyType myData = myBuffer.LoadAligned<MyType, 16>(offsetInMultipleOfAlignment);

If I wrote the following liine, the actual offset in bytes would be 32:
MyType myData = myBuffer.LoadAligned<MyType, 16>(2);

I think that would solve the problem, it guarantees aligned load to a specific multiple while also supporting templated loads. Alignment would only allow power of 2 higher or equal to 4 and that would be checked at compilation time.

We could also add a matching Store<T, Align>(...) method too.

@llvm-beanz llvm-beanz transferred this issue from microsoft/DirectXShaderCompiler Jun 11, 2024
@llvm-beanz llvm-beanz added Theme:Alignment Issues related to data alignment or packing Requires Shader Model Feature requests that require DXIL and Shader Model updates and removed needs-triage labels Aug 6, 2024
@llvm-beanz llvm-beanz added this to the Shader Model Backlog milestone Aug 6, 2024
@llvm-beanz llvm-beanz moved this to Triaged in HLSL Triage Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Requires Shader Model Feature requests that require DXIL and Shader Model updates Theme:Alignment Issues related to data alignment or packing
Projects
Status: Triaged
Development

No branches or pull requests

7 participants