CUDA? #5

AshtonSBradley · 2024-11-10T08:24:07Z

Hi, very interested to see this happening. Is there any chance this will support CUDA.jl?

fgerick · 2024-11-12T08:47:42Z

I was planning to look into this when I find the time, yes. I haven't done that before and it might take me some time to get an Yggdrasil recipe ready that compiles the CUDA library. After that it should be pretty straightforward. Perhaps it is also worth just writing the wrappers around the cushtns_ functions beforehand, so one can use a compiled library instead of SHTns_jll.

nschaeff · 2024-11-12T10:35:12Z

I'm not sure (because I don't know julia enough), but it seems to me that calling directly the gpu routines when the array is on gpu should be straigthforward.
This is what I did to support cupy: if a cupy array (which resides on gpu) is passed to the wrapper routine, it calls the gpu routines.

Is there a simple way to know if an array resides on GPU ?

fgerick · 2024-11-12T11:12:39Z

Yes, once the wrapper routines are implemented (I believe several arrays of the config, grid etc. need to be CuArrays on the gpu), writing the ccall for a gpu array is straightforward and thanks to julia's multiple dispatch it will be very high-level for the user, without much code change needed.

AshtonSBradley · 2024-11-13T09:57:27Z

I was planning to look into this when I find the time, yes. I haven't done that before and it might take me some time to get an Yggdrasil recipe ready that compiles the CUDA library. After that it should be pretty straightforward. Perhaps it is also worth just writing the wrappers around the cushtns_ functions beforehand, so one can use a compiled library instead of SHTns_jll.

I could be misunderstanding, but I am fairly sure that you don't have to do this - CUDA.jl installs all of this for you. As long as drivers are installed all you need is CUDA.jl which is very mature and easy to use and supports the array interface. They basically use a "transpiler" to lower a subset of julia to native cuda. So everything in that subset "just works"

julia> using CUDA

julia> cu
cu (generic function with 1 method)

julia> cu(randn(1000,1000))
1000×1000 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -2.20339     1.44234    -0.177612   …   0.196209  -0.676572   -0.627904
 -0.345614    1.85168    -2.25062        0.620883  -0.0916078  -1.01014
  0.301055   -0.475698    0.262541       0.163865   1.73029     0.785342
 -0.0333346  -1.44696     0.24885       -0.689432   0.579899    0.0947351
 -1.47289     0.438792    1.10364       -0.656165  -0.808387    0.402221
  1.88226     0.33446    -0.134357   …   0.389446  -0.07918     0.4424
 -0.234209    0.296305    1.3775        -1.02765    0.113215   -0.287571
 -0.712484   -0.625674    0.738943      -0.361901  -0.850789    0.245523
  1.60886    -0.0674538  -0.387519       0.512214  -0.597321    0.785137
 -1.40604     1.34734    -0.415162      -0.82778   -1.6658     -1.45493
  ⋮                                  ⋱                         
  0.165213    0.107774    0.225846      -1.08948    1.67        1.51029
  0.0838663  -1.98239    -1.27673        2.04546    0.958453   -1.31763
 -0.417       0.740629   -0.132767      -1.67897    1.44722     1.29719
 -0.513159   -1.43663    -0.0568169      1.00033   -0.776083    1.39976
  0.2583      0.418289    0.172831   …   0.107581   0.465972   -2.46566
 -0.926377    0.36303     0.566888      -0.670945   0.123907   -0.316887
  0.704769    0.492781    1.56844        1.26098   -1.83675    -0.303527
 -0.488667   -1.13031     0.956114       0.302873  -0.180226   -0.191216
 -0.771964   -0.173448    0.967076      -0.267228   1.30337    -1.03358

so determining if array is on the device is just in the type

AshtonSBradley · 2024-11-13T10:02:18Z

Running the tests gives some info about the stack that CUDA installs:

(@v1.11) pkg> test CUDA
...
✓ CUDA_Driver_jll
  ✓ JuliaNVTXCallbacks_jll
  ✓ LLVMExtra_jll
  ✓ demumble_jll
  ✓ SortingAlgorithms
  ✓ LLVM
  ✓ CUDA_Runtime_jll
  ✓ ColorTypes
  ✓ PrettyTables
  ✓ Colors
  ✓ StaticArrays → StaticArraysStatisticsExt
  ✓ Adapt → AdaptStaticArraysExt
  ✓ LLVM → BFloat16sExt
  ✓ GPUArraysCore
  ✓ NVTX
  ✓ UnsafeAtomicsLLVM
  ✓ GPUCompiler
  ✓ GPUArrays
...
    Testing Running tests...
┌ Info: System information:
│ CUDA runtime 11.8, artifact installation
│ CUDA driver 11.7
│ NVIDIA driver 515.105.1
│ 
│ CUDA libraries: 
│ - CUBLAS: 11.11.3
│ - CURAND: 10.3.0
│ - CUFFT: 10.9.0
│ - CUSOLVER: 11.4.1
│ - CUSPARSE: 11.7.5
│ - CUPTI: 2022.3.0 (API 18.0.0)
│ - NVML: 11.0.0+515.105.1
│ 
│ Julia packages: 
│ - CUDA: 5.5.2
│ - CUDA_Driver_jll: 0.10.3+0
│ - CUDA_Runtime_jll: 0.15.3+0
│ 
│ Toolchain:
│ - Julia: 1.11.1
│ - LLVM: 16.0.6
│ 
│ 1 device:
└   0: NVIDIA TITAN V (sm_70, 10.266 GiB / 12.000 GiB available)

nschaeff · 2024-11-13T10:38:36Z

I believe several arrays of the config, grid etc. need to be CuArrays on the gpu

If the shtns library is compiled with cuda, when you create a plan, there will be internal copies of everything needed on gpu. The shtns plan itself is an object that stays on the cpu.
All is needed would be functions that accept CuArrays and then call the cu_SH_to_spat routine (for instance) instead of SH_to_spat.
The same plan can run on cpu or gpu:

SH_to_spat(shtns_config,  qlm, q);     // run on cpu: qlm and q are arrays in cpu memory
cu_SH_to_spat(shtns_config,  qlm_gpu,  q_gpu);    // run on gpu: qlm_gpu and q_gpu are arrays in gpu memory

For the shtns C library, if the cuda toolkit is installed and the environment variable CUDA_PATH is correctly set, then
./configure --enable-cuda gives you GPU support.

So I guess a quick hack would be:

install the SHTns C library in your system, with --enable-cuda
use this "external library" as described here:
https://github.com/fgerick/SHTns.jl/blob/master/docs/src/index.md
Replicate the functions you need, like for instance synth from here:
https://github.com/fgerick/SHTns.jl/blob/9ffc35aa3c07c8949450499e4518b7ab8e443743/src/synth.jl#L6C1-L12C4

with CuArrays as type for qlm and make sure to call cu_SH_to_spat instead of SH_to_spat

This should work, I think. @AshtonSBradley could you try it ? Otherwise, I may give it a try before the end of the year.

fgerick · 2024-11-13T10:40:25Z

What I meant to say is the compilation of the C library SHTns using BinaryBuilder, which produces the Julia Artifact SHTns_jll (complied shared library version of SHTns) requires some alteration on the Yggdrasil build script, in order to have the CUDA functions compiled and callable.

I agree afterwards it's all taken care of CUDA.jl. In fact I compiled a CUDA-enabled version of SHTns and made a simple synthesis work. So over the next week or so I will try to make a more complete wrapper and upload it.

Then, one can use their custom SHTns build until I manage to make the GPU-enabled SHTns_jll work.

fgerick mentioned this issue Nov 17, 2024

CUDA #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA? #5

CUDA? #5

AshtonSBradley commented Nov 10, 2024

fgerick commented Nov 12, 2024

nschaeff commented Nov 12, 2024

fgerick commented Nov 12, 2024

AshtonSBradley commented Nov 13, 2024

AshtonSBradley commented Nov 13, 2024

nschaeff commented Nov 13, 2024

fgerick commented Nov 13, 2024 •

edited

Loading

CUDA? #5

CUDA? #5

Comments

AshtonSBradley commented Nov 10, 2024

fgerick commented Nov 12, 2024

nschaeff commented Nov 12, 2024

fgerick commented Nov 12, 2024

AshtonSBradley commented Nov 13, 2024

AshtonSBradley commented Nov 13, 2024

nschaeff commented Nov 13, 2024

fgerick commented Nov 13, 2024 • edited Loading

fgerick commented Nov 13, 2024 •

edited

Loading