Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA? #5

Open
AshtonSBradley opened this issue Nov 10, 2024 · 7 comments
Open

CUDA? #5

AshtonSBradley opened this issue Nov 10, 2024 · 7 comments

Comments

@AshtonSBradley
Copy link

Hi, very interested to see this happening. Is there any chance this will support CUDA.jl?

@fgerick
Copy link
Owner

fgerick commented Nov 12, 2024

I was planning to look into this when I find the time, yes. I haven't done that before and it might take me some time to get an Yggdrasil recipe ready that compiles the CUDA library. After that it should be pretty straightforward. Perhaps it is also worth just writing the wrappers around the cushtns_ functions beforehand, so one can use a compiled library instead of SHTns_jll.

@nschaeff
Copy link

I'm not sure (because I don't know julia enough), but it seems to me that calling directly the gpu routines when the array is on gpu should be straigthforward.
This is what I did to support cupy: if a cupy array (which resides on gpu) is passed to the wrapper routine, it calls the gpu routines.

Is there a simple way to know if an array resides on GPU ?

@fgerick
Copy link
Owner

fgerick commented Nov 12, 2024

Yes, once the wrapper routines are implemented (I believe several arrays of the config, grid etc. need to be CuArrays on the gpu), writing the ccall for a gpu array is straightforward and thanks to julia's multiple dispatch it will be very high-level for the user, without much code change needed.

@AshtonSBradley
Copy link
Author

I was planning to look into this when I find the time, yes. I haven't done that before and it might take me some time to get an Yggdrasil recipe ready that compiles the CUDA library. After that it should be pretty straightforward. Perhaps it is also worth just writing the wrappers around the cushtns_ functions beforehand, so one can use a compiled library instead of SHTns_jll.

I could be misunderstanding, but I am fairly sure that you don't have to do this - CUDA.jl installs all of this for you. As long as drivers are installed all you need is CUDA.jl which is very mature and easy to use and supports the array interface. They basically use a "transpiler" to lower a subset of julia to native cuda. So everything in that subset "just works"

julia> using CUDA

julia> cu
cu (generic function with 1 method)

julia> cu(randn(1000,1000))
1000×1000 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -2.20339     1.44234    -0.177612   …   0.196209  -0.676572   -0.627904
 -0.345614    1.85168    -2.25062        0.620883  -0.0916078  -1.01014
  0.301055   -0.475698    0.262541       0.163865   1.73029     0.785342
 -0.0333346  -1.44696     0.24885       -0.689432   0.579899    0.0947351
 -1.47289     0.438792    1.10364       -0.656165  -0.808387    0.402221
  1.88226     0.33446    -0.134357   …   0.389446  -0.07918     0.4424
 -0.234209    0.296305    1.3775        -1.02765    0.113215   -0.287571
 -0.712484   -0.625674    0.738943      -0.361901  -0.850789    0.245523
  1.60886    -0.0674538  -0.387519       0.512214  -0.597321    0.785137
 -1.40604     1.34734    -0.415162      -0.82778   -1.6658     -1.45493
  ⋮                                  ⋱                         
  0.165213    0.107774    0.225846      -1.08948    1.67        1.51029
  0.0838663  -1.98239    -1.27673        2.04546    0.958453   -1.31763
 -0.417       0.740629   -0.132767      -1.67897    1.44722     1.29719
 -0.513159   -1.43663    -0.0568169      1.00033   -0.776083    1.39976
  0.2583      0.418289    0.172831   …   0.107581   0.465972   -2.46566
 -0.926377    0.36303     0.566888      -0.670945   0.123907   -0.316887
  0.704769    0.492781    1.56844        1.26098   -1.83675    -0.303527
 -0.488667   -1.13031     0.956114       0.302873  -0.180226   -0.191216
 -0.771964   -0.173448    0.967076      -0.267228   1.30337    -1.03358

so determining if array is on the device is just in the type

@AshtonSBradley
Copy link
Author

Running the tests gives some info about the stack that CUDA installs:

(@v1.11) pkg> test CUDA
...
✓ CUDA_Driver_jll
  ✓ JuliaNVTXCallbacks_jll
  ✓ LLVMExtra_jll
  ✓ demumble_jll
  ✓ SortingAlgorithms
  ✓ LLVM
  ✓ CUDA_Runtime_jll
  ✓ ColorTypes
  ✓ PrettyTables
  ✓ Colors
  ✓ StaticArrays → StaticArraysStatisticsExt
  ✓ Adapt → AdaptStaticArraysExt
  ✓ LLVM → BFloat16sExt
  ✓ GPUArraysCore
  ✓ NVTX
  ✓ UnsafeAtomicsLLVM
  ✓ GPUCompiler
  ✓ GPUArrays
...
    Testing Running tests...
┌ Info: System information:
│ CUDA runtime 11.8, artifact installation
│ CUDA driver 11.7
│ NVIDIA driver 515.105.1
│ 
│ CUDA libraries: 
│ - CUBLAS: 11.11.3
│ - CURAND: 10.3.0
│ - CUFFT: 10.9.0
│ - CUSOLVER: 11.4.1
│ - CUSPARSE: 11.7.5
│ - CUPTI: 2022.3.0 (API 18.0.0)
│ - NVML: 11.0.0+515.105.1
│ 
│ Julia packages: 
│ - CUDA: 5.5.2
│ - CUDA_Driver_jll: 0.10.3+0
│ - CUDA_Runtime_jll: 0.15.3+0
│ 
│ Toolchain:
│ - Julia: 1.11.1
│ - LLVM: 16.0.6
│ 
│ 1 device:
└   0: NVIDIA TITAN V (sm_70, 10.266 GiB / 12.000 GiB available)

@nschaeff
Copy link

I believe several arrays of the config, grid etc. need to be CuArrays on the gpu

If the shtns library is compiled with cuda, when you create a plan, there will be internal copies of everything needed on gpu. The shtns plan itself is an object that stays on the cpu.
All is needed would be functions that accept CuArrays and then call the cu_SH_to_spat routine (for instance) instead of SH_to_spat.
The same plan can run on cpu or gpu:

SH_to_spat(shtns_config,  qlm, q);     // run on cpu: qlm and q are arrays in cpu memory
cu_SH_to_spat(shtns_config,  qlm_gpu,  q_gpu);    // run on gpu: qlm_gpu and q_gpu are arrays in gpu memory

For the shtns C library, if the cuda toolkit is installed and the environment variable CUDA_PATH is correctly set, then
./configure --enable-cuda gives you GPU support.

So I guess a quick hack would be:

  1. install the SHTns C library in your system, with --enable-cuda

  2. use this "external library" as described here:
    https://github.com/fgerick/SHTns.jl/blob/master/docs/src/index.md

  3. Replicate the functions you need, like for instance synth from here:
    https://github.com/fgerick/SHTns.jl/blob/9ffc35aa3c07c8949450499e4518b7ab8e443743/src/synth.jl#L6C1-L12C4

with CuArrays as type for qlm and make sure to call cu_SH_to_spat instead of SH_to_spat

This should work, I think. @AshtonSBradley could you try it ? Otherwise, I may give it a try before the end of the year.

@fgerick
Copy link
Owner

fgerick commented Nov 13, 2024

What I meant to say is the compilation of the C library SHTns using BinaryBuilder, which produces the Julia Artifact SHTns_jll (complied shared library version of SHTns) requires some alteration on the Yggdrasil build script, in order to have the CUDA functions compiled and callable.

I agree afterwards it's all taken care of CUDA.jl. In fact I compiled a CUDA-enabled version of SHTns and made a simple synthesis work. So over the next week or so I will try to make a more complete wrapper and upload it.

Then, one can use their custom SHTns build until I manage to make the GPU-enabled SHTns_jll work.

@fgerick fgerick mentioned this issue Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants