-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide mechanism to automatically partition kernels #43
Comments
Figuring out how to partition these kernels makes working with MultiBroadcastFusion.jl a bit brittle. This is perhaps somewhat related to #24. |
This is what I have so far: ##### Should work, but requires CUDA/GPUCompiler
function get_usage_lim(f, args)
cuda = CUDA.active_state()
Base.@lock CUDA.cufunction_lock begin
kernel_f = CUDA.cudaconvert(f)
kernel_args = map(CUDA.cudaconvert, args)
kernel_tt = Tuple{map(Core.Typeof, kernel_args)...}
source = GPUCompiler.methodinstance(kernel_f, kernel_tt)
config = CUDA.compiler_config(cuda.device)
(; ptx, cap) = config.params
# validate use of parameter memory
argtypes = filter([CUDA.KernelState, source.specTypes.parameters...]) do dt
!isghosttype(dt) && !Core.Compiler.isconstType(dt)
end
param_usage = sum(sizeof, argtypes)
param_limit = cap >= v"7.0" && ptx >= v"8.1" ? 32764 : 4096
return (; param_usage, param_limit)
end
end
function msg(; ptx, cap, param_limit, param_usage)
return """Kernel invocation uses too much parameter memory.
$(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed
by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor)."""
end
#####
using LLVM.Interop: isghosttype
function get_usage_lim(f, args)
kernel_f = CUDA.cudaconvert(f)
kernel_args = map(CUDA.cudaconvert, args)
kernel_tt = Tuple{map(Core.Typeof, kernel_args)...}
config = CUDA.compiler_config(cuda.device)
(; ptx, cap) = config.params
# validate use of parameter memory
argtypes = filter(kernel_tt) do dt
!isghosttype(dt) && !Core.Compiler.isconstType(dt)
end
param_usage = sum(sizeof, argtypes)
param_limit = cap >= v"7.0" && ptx >= v"8.1" ? 32764 : 4096
return (; param_usage, param_limit)
end
function msg(; ptx, cap, param_limit, param_usage)
return """Kernel invocation uses too much parameter memory.
$(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed
by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor)."""
end
##### May work, only requires CUDA
function get_usage_lim(f, args)
kernel_f = CUDA.cudaconvert(f)
kernel_args = map(CUDA.cudaconvert, args)
kernel_tt = Tuple{map(Core.Typeof, kernel_args)...}
config = CUDA.compiler_config(cuda.device)
(; ptx, cap) = config.params
# validate use of parameter memory
argtypes = filter(dt-> !Core.Compiler.isconstType(dt), kernel_tt)
param_usage = sum(sizeof, argtypes)
param_limit = cap >= v"7.0" && ptx >= v"8.1" ? 32764 : 4096
return (; param_usage, param_limit)
end
function msg(; ptx, cap, param_limit, param_usage)
return """Kernel invocation uses too much parameter memory.
$(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed
by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor)."""
end We need some sort of strategy of splitting kernels. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We have a large set of kernels in ClimaAtmos, and we have to partition them as shown below, because we run into
ERROR: LoadError: Kernel invocation uses too much parameter memory; | 4.586 KiB exceeds the 4.000 KiB limit imposed by sm_60 / PTX v8.2
pretty easily. Since this is device-dependent, we should probably offer a mechanism to split the fused broadcasts into segments that are bounded by the parameter memory.The text was updated successfully, but these errors were encountered: