You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation update in January clarified the language around wave intrinsics ( https://github.com/microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics/f9ac5be748c4bcd694c91a44a30ba5e810015d3d )
As far as I can tell the DirectX Specs and GDK documentation both link that page so I'm assuming it's the authority on their expected behaviour.
I have been verifying the behaviour we get with wave intrinsics across various platforms and have found a few issues with the documentation as written.
Driver support for excluding helper lanes from inputs
The Wave Vote intrinsics and Wave Reduction intrinsics are now marked as explicitly excluding helper lanes from the input.
The semantics described here do seem to apply in various AMD drivers I have tested locally. Shader model doesn't appear to make a difference. All consistent with new docs. Although I only had RDNA gpus available for PC testing, so I can't be sure for GCN or much older drivers.
These new rules don't seem to apply to many NVIDIA drivers. Drivers tested from ~January 2021 and ~November 2019 (before the doc update) do include helper lanes as inputs to these intrinsics. Up to date drivers do exclude helper lanes. I think drivers that support SM6.6 all do have the new behaviour.
If these are intended to be guaranteed behaviour after SM6.6 only, then the documentation shouldn't define the behaviour unconditionally.
WaveReadLaneFirst return value uniformity
Before the documentation update, WaveReadLaneFirst & WaveReadLaneAt were declared as "The following routines enable all active lanes in the current wave to receive the value from the specified lane, effectively broadcasting it.".
After the documentation update, WaveReadLaneFirst & WaveReadLaneAt are now declared as "The following routines enable all active non-helper lanes in the current wave to receive the value(s) from the specified lane(s)."
In both versions WaveReadLaneFirst also claims "The resulting value is thus uniform across the wave"
If helper lanes do not receive the output from WaveReadLaneFirst, then the output would not be uniform.
While inspecting codegen on AMD gpus I always get the result of this function ending up in a vector register, instead of a scalar value like I would natural expect. This is a little hard to be certain as the driver compiler is within its right to use vector instructions for uniform values if it can generate better code this way. The values do seem consistent between helper and non-helper lanes in simple tests, even if it is slower generated code than I would have liked.
So if I hopefully assume it is still guaranteed to be uniform, the wording shouldn’t exclude helper lanes from receiving the value (in the non-shuffle case).
Wave Reduction Intrinsics output uniformity
Before the documentation update, Wave Reduction Intrinsics were declared as "These intrinsics compute the specified operation across all active lanes in the wave and broadcast the final result to all active lanes.".
After the documentation update, Wave Reduction Intrinsics are now declared as "These intrinsics compute the specified operation across all active non-helper lanes in the wave and broadcast the final result to all active non-helper lanes.".
In both versions Wave Reduction Intrinsics also claims "Therefore, the final output is guaranteed uniform across the wave."
The old wording guarantees the output on helper lanes being consistent with non-helper lanes.
The new ones sort of does in the "Therefore, the final output is guaranteed uniform across the wave." line. But the change in wording in the first half implies it is not. All code gen I have seen so far does give scalar output.
Is this meant to say "These intrinsics compute the specified operation across all active non-helper lanes in the wave and broadcast the final result to all active lanes."?
The text was updated successfully, but these errors were encountered:
The documentation update in January clarified the language around wave intrinsics ( https://github.com/microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics/f9ac5be748c4bcd694c91a44a30ba5e810015d3d )
As far as I can tell the DirectX Specs and GDK documentation both link that page so I'm assuming it's the authority on their expected behaviour.
I have been verifying the behaviour we get with wave intrinsics across various platforms and have found a few issues with the documentation as written.
Driver support for excluding helper lanes from inputs
The Wave Vote intrinsics and Wave Reduction intrinsics are now marked as explicitly excluding helper lanes from the input.
The semantics described here do seem to apply in various AMD drivers I have tested locally. Shader model doesn't appear to make a difference. All consistent with new docs. Although I only had RDNA gpus available for PC testing, so I can't be sure for GCN or much older drivers.
These new rules don't seem to apply to many NVIDIA drivers. Drivers tested from ~January 2021 and ~November 2019 (before the doc update) do include helper lanes as inputs to these intrinsics. Up to date drivers do exclude helper lanes. I think drivers that support SM6.6 all do have the new behaviour.
If these are intended to be guaranteed behaviour after SM6.6 only, then the documentation shouldn't define the behaviour unconditionally.
WaveReadLaneFirst return value uniformity
Before the documentation update, WaveReadLaneFirst & WaveReadLaneAt were declared as "The following routines enable all active lanes in the current wave to receive the value from the specified lane, effectively broadcasting it.".
After the documentation update, WaveReadLaneFirst & WaveReadLaneAt are now declared as "The following routines enable all active non-helper lanes in the current wave to receive the value(s) from the specified lane(s)."
In both versions WaveReadLaneFirst also claims "The resulting value is thus uniform across the wave"
If helper lanes do not receive the output from WaveReadLaneFirst, then the output would not be uniform.
While inspecting codegen on AMD gpus I always get the result of this function ending up in a vector register, instead of a scalar value like I would natural expect. This is a little hard to be certain as the driver compiler is within its right to use vector instructions for uniform values if it can generate better code this way. The values do seem consistent between helper and non-helper lanes in simple tests, even if it is slower generated code than I would have liked.
So if I hopefully assume it is still guaranteed to be uniform, the wording shouldn’t exclude helper lanes from receiving the value (in the non-shuffle case).
Wave Reduction Intrinsics output uniformity
Before the documentation update, Wave Reduction Intrinsics were declared as "These intrinsics compute the specified operation across all active lanes in the wave and broadcast the final result to all active lanes.".
After the documentation update, Wave Reduction Intrinsics are now declared as "These intrinsics compute the specified operation across all active non-helper lanes in the wave and broadcast the final result to all active non-helper lanes.".
In both versions Wave Reduction Intrinsics also claims "Therefore, the final output is guaranteed uniform across the wave."
The old wording guarantees the output on helper lanes being consistent with non-helper lanes.
The new ones sort of does in the "Therefore, the final output is guaranteed uniform across the wave." line. But the change in wording in the first half implies it is not. All code gen I have seen so far does give scalar output.
Is this meant to say "These intrinsics compute the specified operation across all active non-helper lanes in the wave and broadcast the final result to all active lanes."?
The text was updated successfully, but these errors were encountered: