Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for setting floating-point rounding modes #1765

Open
orkolorko opened this issue Feb 10, 2023 · 4 comments
Open

Support for setting floating-point rounding modes #1765

orkolorko opened this issue Feb 10, 2023 · 4 comments

Comments

@orkolorko
Copy link

I was wondering if it possible to set directed rounding modes in the GEMM kernels,
i.e., compile a kernel with RoundUp and one with RoundDown.

This would allow to use GPU acceleration for the sake of Interval Linear Algebra.
I know that on Nvidia GPUs the Rounding mode is selected at compile time instead that at runtime;
I'm curious to know if GemmKernels could allow us to compile different kernels for different rounding modes.

@maleadt
Copy link
Member

maleadt commented Feb 10, 2023

This isn't even supported by CUDA.jl. AFAIK, it's would require us to emit special versions of library calls (e.g. __nv_fmaf_rn instead of __nv_fmaf) as well as specify the rounding mode for each floating-point operation (e.g. instead of a plain fadd emit call float @llvm.experimental.constrained.fadd.f32(float, float, metadata !"round.tonearest", metadata !"fpexcept.strict") or call float @llvm.nvvm.add.rn.f(float, float)). Maybe we could do this using overlay methods, dispatching on GPUInterpreter properties, or by using a post-optimization rewrite pass.

In summary, no this is currently not possible and would require (potentially a fair bit of) compiler work to make it possible.

@maleadt maleadt changed the title Setting Rounding Modes Support for setting floating-point rounding modes Feb 10, 2023
@maleadt maleadt transferred this issue from JuliaGPU/GemmKernels.jl Feb 10, 2023
@maleadt
Copy link
Member

maleadt commented Feb 10, 2023

Since this requires work on CUDA.jl I've transfered the issue there. cc @vchuravy, you may have thoughts (we possibly could implement this somewhat generally in GPUCompiler.jl, but it seems hard)

@maleadt
Copy link
Member

maleadt commented Feb 10, 2023

Demo:

# the default
julia> fadd_rn(x,y) = ccall("llvm.nvvm.add.rn.f", llvmcall, Float32, (Float32, Float32), x, y)
fadd_rn (generic function with 1 method)

julia> CUDA.code_llvm(fadd_rn, Tuple{Float32, Float32})
;  @ REPL[3]:1 within `fadd_rn`
define float @julia_fadd_rn_1936(float %0, float %1) local_unnamed_addr #0 {
top:
  %2 = fadd float %0, %1
  ret float %2
}


# round to zero
julia> fadd_rz(x,y) = ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
fadd_rz (generic function with 1 method)

julia> CUDA.code_llvm(fadd_rz, Tuple{Float32, Float32})
;  @ REPL[5]:1 within `fadd_rz`
define float @julia_fadd_rz_3211(float %0, float %1) local_unnamed_addr #0 {
top:
  %2 = call float @llvm.nvvm.add.rz.f(float %0, float %1)
  ret float %2
}

julia> CUDA.code_ptx(fadd_rz, Tuple{Float32, Float32})
//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_75
.address_size 64

	// .globl	julia_fadd_rz_3252      // -- Begin function julia_fadd_rz_3252
                                        // @julia_fadd_rz_3252
.visible .func  (.param .b32 func_retval0) julia_fadd_rz_3252(
	.param .b32 julia_fadd_rz_3252_param_0,
	.param .b32 julia_fadd_rz_3252_param_1
)
{
	.reg .f32 	%f<4>;

// %bb.0:                               // %top
	ld.param.f32 	%f1, [julia_fadd_rz_3252_param_0];
	ld.param.f32 	%f2, [julia_fadd_rz_3252_param_1];
	add.rz.f32 	%f3, %f1, %f2;
	st.param.f32 	[func_retval0+0], %f3;
	ret;
                                        // -- End function
}

Which means that in principle we could do:

@device_override Base.:(+)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(+)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.add.rz.d", llvmcall, Float64, (Float64, Float64), x, y)

@device_override Base.:(-)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.add.rz.f", llvmcall, Float32, (Float32, Float32), x, -y)
@device_override Base.:(-)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.add.rz.d", llvmcall, Float64, (Float64, Float64), x, -y)

@device_override Base.:(*)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.mul.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(*)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.mul.rz.d", llvmcall, Float64, (Float64, Float64), x, y)

@device_override Base.:(/)(x::Float32, y::Float32) =
    ccall("llvm.nvvm.div.rz.f", llvmcall, Float32, (Float32, Float32), x, y)
@device_override Base.:(/)(x::Float64, y::Float64) =
    ccall("llvm.nvvm.div.rz.d", llvmcall, Float64, (Float64, Float64), x, y)

julia> @device_code_ptx CUDA.rand(1) .+ CUDA.rand(1)

	ld.global.f32 	%f8, [%rd82];
	add.rz.f32 	%f9, %f7, %f8;
	add.s64 	%rd83, %rd50, %rd94;
	st.global.f32 	[%rd83], %f9;

I would have this to run into JuliaGPU/GPUCompiler.jl#384 and heavily pessimize codegen, but surprisingly it's just some accuracy tests that fail.

@orkolorko
Copy link
Author

Thank you for looking into this issue @maleadt

If it is possible, this would be really groundbreaking: computing the enclosure using Rump algorithm
uses various independent matrix products with different rounding, that could be run in parallel by the GPU.
This could make performance much better for Julia based libraries.

Correct me if I am wrong @lucaferranti.

The code for matrix multiplication is the following.

mA, mB, R, Csup = setrounding(T, RoundUp) do
        mA = Ainf + 0.5 * (Asup - Ainf)
        mB = Binf + 0.5 * (Bsup - Binf)

        rA = mA - Ainf
        rB = mB - Binf

        R = abs.(mA) * rB + rA * (abs.(mB) + rB)
        Csup = mA * mB + R

        return mA, mB, R, Csup
end

Cinf = setrounding(T, RoundDown) do
        mA * mB - R
end

As you can see, it runs 3 independent matrix products with rounding up and 1 matrix product with rounding down,
the products are independent, so in principle they can be run concurrently by the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants