-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NDTensors] Add AMDGPU.jl
(ROCm) based extension for NDTensors
#1325
Conversation
At first glance this looks good, thanks! This is a nice payoff for a lot of work @kmp5VT and I have been doing to make the overload requirements for new GPU backends as minimal as possible, I'm glad to see this doesn't seem to require very much code to get working. The SVD code definitely looks like it is the most complicated part that we will have to look at carefully, but besides that it all looks similar to what we needed to overload for the
|
Looks like there is some package compatibility issue in the tests. |
Absolutely, this was far easier to get (mostly) working than I expected it to be, thanks to your and @kmp5VT 's great work! And yes, there seem to be more parameters to the I will try disabling scalar indexing to work on the |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1325 +/- ##
===========================================
- Coverage 84.40% 53.78% -30.62%
===========================================
Files 100 99 -1
Lines 8581 8528 -53
===========================================
- Hits 7243 4587 -2656
- Misses 1338 3941 +2603 ☔ View full report in Codecov by Sentry. |
One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR. I think it is a subtle question to figure out how to select different SVD backends for different devices (say you want to specify that if you run on CUDA you want to use a certain cuLAPACK implementation, but then on AMD you want to transfer to CPU to perform SVD, then on Metal you want to...). That will be a longer discussion and may require us to think about the higher level SVD interface. It may by that we should add support to the svd_alg(::Type{<:CuArray}) = "cpu"
svd_alg(::Type{<:ROCArray}) = "jacobi_algorithm"
svd_alg(::Type{<:MtlArray}) = "cpu"
u, s, v = svd(t, (i, j); alg=svd_alg) We would also like to add automated testing for AMD GPUs, @kmp5VT will start to look into that. |
@wbernoudy I am currently working on finding a way to access an AMD GPU device. Would it be possible to provide me access to your branch so I can help accelerate your development? I am not that familiar with the |
Of course, I added you as a collaborator on my fork. To be clear, this extension is already using the accelerated BLAS functions provided by
|
Believe I figured this out, can skip to next comment @mtfishman I tried the idea of disabling scalar indexing to see if some generic matmul function is using that in the Here you can see the standard julia> using AMDGPU
julia> using ITensors
julia> using NDTensors # modified from current branch to disallow scalar indexing
julia> using LinearAlgebra
julia> i, j = Index(6152, "i"), Index(1324, "j")
((dim=6152|id=857|"i"), (dim=1324|id=611|"j"))
julia> A = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=6152|id=857|"i") (dim=1324|id=611|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
julia> B = ITensor(randomTensor(ROCArray, (j, i)))
ITensor ord=2 (dim=1324|id=611|"j") (dim=6152|id=857|"i")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
julia> dot(A, B)
ERROR: Scalar indexing is disallowed.
(rest of stack trace)
julia> dag(A) * B
ITensor ord=0
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
julia> AMDGPU.@allowscalar (dag(A) * B)[]
-3066.643034152186
julia> AMDGPU.@elapsed dag(A) * B
0.29850277f0
julia> function LinearAlgebra.dot(A::ITensor, B::ITensor)
pB = B
if (inds(A) != inds(B))
pB = permute(B, inds(A))
end
return dot(data(tensor(A)), data(tensor(pB)))
end
julia> dot(A, B)
-3066.643034153356
julia> AMDGPU.@elapsed dot(A, B)
0.000623521f0 |
Actually I think I have figured out the slowdown with Calling |
This is exactly what I added the @mtfishman Would it be better if I removed the Jacobi implementation from this PR? |
Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new
Yes, I think so. The code you wrote seems more appropriate to add as an |
Very much agreed. An orthogonal argument that allows you to choose the device (in addition to the SVD algorithm) sounds great to me. Either way, I'm happy to either wait until after a more precise API has been figured out to finalize this PR, or keep the awkward
I figured this might be the case, and it certainly makes sense to handle the out-of-memory properly in |
Glad we're in agreement about that. I think since there will be only one SVD backend for |
Sorry, I didn't see you referring to |
Oh, I didn't think about just overloading |
@wbernoudy Hi, just giving you an update. I now have access to an AMD GPU and have made a few changes (nothing significant). One thing is that the changes I am making in another PR also affects the work here. What I am doing is working to finish this other PR, pull the changes in here and make those modifications here as well so that we don't have to write this code twice. I am hoping to finish 1331 this week and put more time into your PR next week. |
Co-authored-by: Matt Fishman <[email protected]>
…s.jl into add-rocm-ndtensors-ext
Looks good! Thanks @wbernoudy and @kmp5VT. |
AMDGPU.jl
(ROCm) based extension for NDTensorsAMDGPU.jl
(ROCm) based extension for NDTensors
Description
This PR adds an AMDGPU.jl extension for NDTensors following the same structure as the CUDA extension.
It is still a WIP, but I am hoping to get feedback on a few points before finishing it up.
ITensor
s sometimes fails or falls back on slow default methods. If I implement a simple method that makes a copy and permutes the indices if they are different, I can then just calldot
on the underlyingROCArray
s, like so:I'm having trouble understanding which low level functions (presumably in
LinearAlgebra
)ITensors.dot
actually calls. Trying to trace the functions in the Julia REPL brings me toNDTensors.contract
but I can't figure out what it calls past that.This brings me to a file in
SimpleTraits.jl
I expect it's very similar to an issue I was having with other slow tensor contraction with
ROCArray
backedITensor
s which I was fixed by exposing the right parameters forLinearAlgebra.mul!
inAMDGPU.jl
(JuliaGPU/AMDGPU.jl#585).Though this PR does implement the rocSOLVER Jacobi method for SVD, I found it to be significantly slower than divide-and-conquer on CPU for the TDVP simulations I'm doing. To make using the CPU SVD work I added a
roc_to_cpu_svd
option forsvd_alg
which simply copies the matrix to host, calls SVD, and then converts USV back toROCArray
s. However, I'm guessing there may be a cleaner or more general way to do this (maybe to allow the otherNDTensors
extensions like the CUDA extension to also easily use CPU for SVD).When calling the rocSOLVER SVD algorithm, the function may try to allocate extra memory on the GPU (even though I can't find any documentation for why in the rocSOLVER documentation from AMD), so if the memory pool has allocated the entire GPU's memory (i.e.
AMDGPU.Mem.free()
returns 0) the function may fail. This is why I am trimming the pool if there is less than 1GB free before callinggesvdj!
. I will try to investigate this more to see if there is a proper way to handle it.There is a small bug in the rocSOLVER SVD algorithms in
AMDGPU.jl
which has been fixed but not released yet (Use adjoint instead of tranpose when returning V in SVD JuliaGPU/AMDGPU.jl#588).How Has This Been Tested?
I added the new
NDTensors.roc
function to the devices list in theNDTensors
tests. However, there are several tests failing due to:dot
issuesvd
directly onROCArray
s which seems to return junkITensors.jl/NDTensors/src/lib/Unwrap/test/runtests.jl
Line 94 in 98b95a2
AMDGPU.jl
.ql
onROCArray
s which is not handled byAMDGPU.jl
(similar toCUDA.jl
). There seems to be an exception made for CUDA arrays in various QL decomposition functions, e.g.ITensors.jl/NDTensors/src/linearalgebra/linearalgebra.jl
Line 392 in 98b95a2
ROCArray
s here.I also added an example script similar to the CUDA one.
Checklist:
using JuliaFormatter; format(".")
in the base directory of the repository (~/.julia/dev/ITensors
) to format your code according to our style guidelines.