-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteAddressBuffer templated loads/stores with optional alignment argument #258
Comments
@sebbbi Matias Goldberg just proposed a syntax closer to standard LLVM: struct MyStruct RawRWViewvar = rawBuffer.reinterpret(offsetStart, align); |
Hi, Currently, Does dxc support RWByteAddressBuffer 16-bit float templated Store method in shader model 6.2 in -eanble-16bit-types mode? For example, "buf.Store<float16_t2>(0, float16_t2(0.9, 0.9));". |
Using raw byte buffers on NV without this is nigh-on impossible with maxwell/pascal. Has there been any feedback on this issue? |
Could also have it at resource site, if the entire resource has an alignment.
Should allow the compiler to workout which loads are aligned most of the time, |
// https://devblogs.microsoft.com/directx/hlsl-shader-model-6-6/ // https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_6.html // https://devblogs.microsoft.com/directx/in-the-works-hlsl-shader-model-6-6/ https://github.com/microsoft/DirectXShaderCompiler/issues/2193 Bindless只能ByteAddressBuffer或者RWByteAddressBuffer,所以Shader里面通过ByteAddressBuffer去索引,同时在SM6.2支持了Template Lod,所以可以获取Float4这样的数据了,而不必直接强转
Could we get a new method on ByteAddressBuffer such as this? If I wrote the following liine, the actual offset in bytes would be 32: I think that would solve the problem, it guarantees aligned load to a specific multiple while also supporting templated loads. Alignment would only allow power of 2 higher or equal to 4 and that would be checked at compilation time. We could also add a matching |
SM 6.2 added templated ByteAddressBuffer loads and templated stores were added in microsoft/DirectXShaderCompiler#2176. With these new additions the usability of raw buffers has increased drastically. Code looks nice:
There's one remaining problem in DirectX raw buffer: Both the original raw load API and the new templated one use alignment of 4 bytes for all load widths. There's no way to define a custom alignment. This results in poor codegen on some GPUs which require natural alignment (8 or 16 bytes) for 2/4 wide load/store instructions. Seems that their internal compiler chops these wide loads into 2/3/4 individual 32 bit loads, clearly decreasing performance in mem issue bound cases.
I propose a new optional syntax to provide alignment as second template parameter to give compilers more knowledge, allowing them to emit wide load/store instructions on all architectures:
Investigation:
I analyzed the performance of raw loads with my L1$ load benchmark:
https://github.com/sebbbi/perftest
On AMD GPUs Load2/3/4 instructions only require alignment of 4 bytes, matching DirectX specification. This results in peak throughput for all widths:
On Nvidia GPUs (Kepler, Maxwell, Pascal), we get clear linear regression with load width:
Structured buffer doesn't show the same regression on Nvidia GPUs, as StructuredBuffer guarantees alignment. Now we get 100% performance with float2 (64 bit) loads. Nvidia has half rate 128 bit loads (including textures). That explains half rate for float4 case. Still much better than raw load case:
Nvidia has vec2 and vec4 wide loads exposed in CUDA (https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/), but these instructions require alignment of 8 and 16 bytes. Because DirectX only guarantees alignment of 4 bytes, these instructions can't be used by the compiler. Instead the compiler needs to be conservative and emit multiple 32 bit load instructions. Which is clearly visible in the benchmark results.
The text was updated successfully, but these errors were encountered: