-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for setting floating-point rounding modes #1765
Comments
This isn't even supported by CUDA.jl. AFAIK, it's would require us to emit special versions of library calls (e.g. In summary, no this is currently not possible and would require (potentially a fair bit of) compiler work to make it possible. |
Since this requires work on CUDA.jl I've transfered the issue there. cc @vchuravy, you may have thoughts (we possibly could implement this somewhat generally in GPUCompiler.jl, but it seems hard) |
Demo: # the default
julia> fadd_rn(x,y) = ccall("llvm.nvvm.add.rn.f", llvmcall, Float32, (Float32, Float32), x, y)
fadd_rn (generic function with 1 method)
julia> CUDA.code_llvm(fadd_rn, Tuple{Float32, Float32})
; @ REPL[3]:1 within `fadd_rn`
define float @julia_fadd_rn_1936(float %0, float %1) local_unnamed_addr #0 {
top:
%2 = fadd float %0, %1
ret float %2
}
# round to zero
julia> fadd_rz(x,y) = ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
fadd_rz (generic function with 1 method)
julia> CUDA.code_llvm(fadd_rz, Tuple{Float32, Float32})
; @ REPL[5]:1 within `fadd_rz`
define float @julia_fadd_rz_3211(float %0, float %1) local_unnamed_addr #0 {
top:
%2 = call float @llvm.nvvm.add.rz.f(float %0, float %1)
ret float %2
}
julia> CUDA.code_ptx(fadd_rz, Tuple{Float32, Float32})
//
// Generated by LLVM NVPTX Back-End
//
.version 6.3
.target sm_75
.address_size 64
// .globl julia_fadd_rz_3252 // -- Begin function julia_fadd_rz_3252
// @julia_fadd_rz_3252
.visible .func (.param .b32 func_retval0) julia_fadd_rz_3252(
.param .b32 julia_fadd_rz_3252_param_0,
.param .b32 julia_fadd_rz_3252_param_1
)
{
.reg .f32 %f<4>;
// %bb.0: // %top
ld.param.f32 %f1, [julia_fadd_rz_3252_param_0];
ld.param.f32 %f2, [julia_fadd_rz_3252_param_1];
add.rz.f32 %f3, %f1, %f2;
st.param.f32 [func_retval0+0], %f3;
ret;
// -- End function
} Which means that in principle we could do: @device_override Base.:(+)(x::Float32, y::Float32) =
ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(+)(x::Float64, y::Float64) =
ccall("llvm.nvvm.add.rz.d", llvmcall, Float64, (Float64, Float64), x, y)
@device_override Base.:(-)(x::Float32, y::Float32) =
ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, -y)
@device_override Base.:(-)(x::Float64, y::Float64) =
ccall("llvm.nvvm.add.rz.d", llvmcall, Float64, (Float64, Float64), x, -y)
@device_override Base.:(*)(x::Float32, y::Float32) =
ccall("llvm.nvvm.mul.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(*)(x::Float64, y::Float64) =
ccall("llvm.nvvm.mul.rz.d", llvmcall, Float64, (Float64, Float64), x, y)
@device_override Base.:(/)(x::Float32, y::Float32) =
ccall("llvm.nvvm.div.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(/)(x::Float64, y::Float64) =
ccall("llvm.nvvm.div.rz.d", llvmcall, Float64, (Float64, Float64), x, y)
julia> @device_code_ptx CUDA.rand(1) .+ CUDA.rand(1)
ld.global.f32 %f8, [%rd82];
add.rz.f32 %f9, %f7, %f8;
add.s64 %rd83, %rd50, %rd94;
st.global.f32 [%rd83], %f9; I would have this to run into JuliaGPU/GPUCompiler.jl#384 and heavily pessimize codegen, but surprisingly it's just some accuracy tests that fail. |
Thank you for looking into this issue @maleadt If it is possible, this would be really groundbreaking: computing the enclosure using Rump algorithm Correct me if I am wrong @lucaferranti. The code for matrix multiplication is the following. mA, mB, R, Csup = setrounding(T, RoundUp) do
mA = Ainf + 0.5 * (Asup - Ainf)
mB = Binf + 0.5 * (Bsup - Binf)
rA = mA - Ainf
rB = mB - Binf
R = abs.(mA) * rB + rA * (abs.(mB) + rB)
Csup = mA * mB + R
return mA, mB, R, Csup
end
Cinf = setrounding(T, RoundDown) do
mA * mB - R
end As you can see, it runs 3 independent matrix products with rounding up and 1 matrix product with rounding down, |
I was wondering if it possible to set directed rounding modes in the GEMM kernels,
i.e., compile a kernel with RoundUp and one with RoundDown.
This would allow to use GPU acceleration for the sake of Interval Linear Algebra.
I know that on Nvidia GPUs the Rounding mode is selected at compile time instead that at runtime;
I'm curious to know if GemmKernels could allow us to compile different kernels for different rounding modes.
The text was updated successfully, but these errors were encountered: