Support for setting floating-point rounding modes #1765

orkolorko · 2023-02-10T09:19:35Z

I was wondering if it possible to set directed rounding modes in the GEMM kernels,
i.e., compile a kernel with RoundUp and one with RoundDown.

This would allow to use GPU acceleration for the sake of Interval Linear Algebra.
I know that on Nvidia GPUs the Rounding mode is selected at compile time instead that at runtime;
I'm curious to know if GemmKernels could allow us to compile different kernels for different rounding modes.

maleadt · 2023-02-10T09:33:12Z

This isn't even supported by CUDA.jl. AFAIK, it's would require us to emit special versions of library calls (e.g. __nv_fmaf_rn instead of __nv_fmaf) as well as specify the rounding mode for each floating-point operation (e.g. instead of a plain fadd emit call float @llvm.experimental.constrained.fadd.f32(float, float, metadata !"round.tonearest", metadata !"fpexcept.strict") or call float @llvm.nvvm.add.rn.f(float, float)). Maybe we could do this using overlay methods, dispatching on GPUInterpreter properties, or by using a post-optimization rewrite pass.

In summary, no this is currently not possible and would require (potentially a fair bit of) compiler work to make it possible.

maleadt · 2023-02-10T09:41:11Z

Since this requires work on CUDA.jl I've transfered the issue there. cc @vchuravy, you may have thoughts (we possibly could implement this somewhat generally in GPUCompiler.jl, but it seems hard)

maleadt · 2023-02-10T11:28:06Z

Demo:

# the default
julia> fadd_rn(x,y) = ccall("llvm.nvvm.add.rn.f", llvmcall, Float32, (Float32, Float32), x, y)
fadd_rn (generic function with 1 method)

julia> CUDA.code_llvm(fadd_rn, Tuple{Float32, Float32})
;  @ REPL[3]:1 within `fadd_rn`
define float @julia_fadd_rn_1936(float %0, float %1) local_unnamed_addr #0 {
top:
  %2 = fadd float %0, %1
  ret float %2
}


# round to zero
julia> fadd_rz(x,y) = ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
fadd_rz (generic function with 1 method)

julia> CUDA.code_llvm(fadd_rz, Tuple{Float32, Float32})
;  @ REPL[5]:1 within `fadd_rz`
define float @julia_fadd_rz_3211(float %0, float %1) local_unnamed_addr #0 {
top:
  %2 = call float @llvm.nvvm.add.rz.f(float %0, float %1)
  ret float %2
}

julia> CUDA.code_ptx(fadd_rz, Tuple{Float32, Float32})
//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_75
.address_size 64

	// .globl	julia_fadd_rz_3252      // -- Begin function julia_fadd_rz_3252
                                        // @julia_fadd_rz_3252
.visible .func  (.param .b32 func_retval0) julia_fadd_rz_3252(
	.param .b32 julia_fadd_rz_3252_param_0,
	.param .b32 julia_fadd_rz_3252_param_1
)
{
	.reg .f32 	%f<4>;

// %bb.0:                               // %top
	ld.param.f32 	%f1, [julia_fadd_rz_3252_param_0];
	ld.param.f32 	%f2, [julia_fadd_rz_3252_param_1];
	add.rz.f32 	%f3, %f1, %f2;
	st.param.f32 	[func_retval0+0], %f3;
	ret;
                                        // -- End function
}

Which means that in principle we could do:

@device_override Base.:(+)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(+)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.add.rz.d", llvmcall, Float64, (Float64, Float64), x, y)

@device_override Base.:(-)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, -y)
@device_override Base.:(-)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.add.rz.d", llvmcall, Float64, (Float64, Float64), x, -y)

@device_override Base.:(*)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.mul.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(*)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.mul.rz.d", llvmcall, Float64, (Float64, Float64), x, y)

@device_override Base.:(/)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.div.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(/)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.div.rz.d", llvmcall, Float64, (Float64, Float64), x, y)

julia> @device_code_ptx CUDA.rand(1) .+ CUDA.rand(1)

	ld.global.f32 	%f8, [%rd82];
	add.rz.f32 	%f9, %f7, %f8;
	add.s64 	%rd83, %rd50, %rd94;
	st.global.f32 	[%rd83], %f9;

I would have this to run into JuliaGPU/GPUCompiler.jl#384 and heavily pessimize codegen, but surprisingly it's just some accuracy tests that fail.

orkolorko · 2023-02-10T13:09:58Z

Thank you for looking into this issue @maleadt

If it is possible, this would be really groundbreaking: computing the enclosure using Rump algorithm
uses various independent matrix products with different rounding, that could be run in parallel by the GPU.
This could make performance much better for Julia based libraries.

Correct me if I am wrong @lucaferranti.

The code for matrix multiplication is the following.

mA, mB, R, Csup = setrounding(T, RoundUp) do
        mA = Ainf + 0.5 * (Asup - Ainf)
        mB = Binf + 0.5 * (Bsup - Binf)

        rA = mA - Ainf
        rB = mB - Binf

        R = abs.(mA) * rB + rA * (abs.(mB) + rB)
        Csup = mA * mB + R

        return mA, mB, R, Csup
end

Cinf = setrounding(T, RoundDown) do
        mA * mB - R
end

As you can see, it runs 3 independent matrix products with rounding up and 1 matrix product with rounding down,
the products are independent, so in principle they can be run concurrently by the GPU.

maleadt changed the title ~~Setting Rounding Modes~~ Support for setting floating-point rounding modes Feb 10, 2023

maleadt transferred this issue from JuliaGPU/GemmKernels.jl Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for setting floating-point rounding modes #1765

Support for setting floating-point rounding modes #1765

orkolorko commented Feb 10, 2023

maleadt commented Feb 10, 2023

maleadt commented Feb 10, 2023

maleadt commented Feb 10, 2023

orkolorko commented Feb 10, 2023

Support for setting floating-point rounding modes #1765

Support for setting floating-point rounding modes #1765

Comments

orkolorko commented Feb 10, 2023

maleadt commented Feb 10, 2023

maleadt commented Feb 10, 2023

maleadt commented Feb 10, 2023

orkolorko commented Feb 10, 2023