Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NDTensors] Add AMDGPU.jl (ROCm) based extension for NDTensors #1325

Merged
merged 41 commits into from
Mar 21, 2024

Conversation

wbernoudy
Copy link
Contributor

Description

This PR adds an AMDGPU.jl extension for NDTensors following the same structure as the CUDA extension.

It is still a WIP, but I am hoping to get feedback on a few points before finishing it up.

  1. Dot product on ITensors sometimes fails or falls back on slow default methods. If I implement a simple method that makes a copy and permutes the indices if they are different, I can then just call dot on the underlying ROCArrays, like so:
function dot(A::ITensor, B::ITensor)
  pB = B
  if (inds(A) != inds(B))
    pB = permute(B, inds(A))
  end
  return dot(NDTensors.data(tensor(A)), NDTensors.data(tensor(pB)))
end

I'm having trouble understanding which low level functions (presumably in LinearAlgebra) ITensors.dot actually calls. Trying to trace the functions in the Julia REPL brings me to NDTensors.contract but I can't figure out what it calls past that.

This brings me to a file in SimpleTraits.jl

julia> i, j = Index(10, "i"), Index(5, "j")
((dim=10|id=395|"i"), (dim=5|id=574|"j"))

julia> A = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=10|id=395|"i") (dim=5|id=574|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> B = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=10|id=395|"i") (dim=5|id=574|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> labelsA, labelsB = ITensors.compute_contraction_labels(inds(A), inds(B))
((-1, -2), (-1, -2))

julia> @edit NDTensors.contract(A.tensor, labelsA, B.tensor, labelsB)

I expect it's very similar to an issue I was having with other slow tensor contraction with ROCArray backed ITensors which I was fixed by exposing the right parameters for LinearAlgebra.mul! in AMDGPU.jl (JuliaGPU/AMDGPU.jl#585).

  1. Though this PR does implement the rocSOLVER Jacobi method for SVD, I found it to be significantly slower than divide-and-conquer on CPU for the TDVP simulations I'm doing. To make using the CPU SVD work I added a roc_to_cpu_svd option for svd_alg which simply copies the matrix to host, calls SVD, and then converts USV back to ROCArrays. However, I'm guessing there may be a cleaner or more general way to do this (maybe to allow the other NDTensors extensions like the CUDA extension to also easily use CPU for SVD).

  2. When calling the rocSOLVER SVD algorithm, the function may try to allocate extra memory on the GPU (even though I can't find any documentation for why in the rocSOLVER documentation from AMD), so if the memory pool has allocated the entire GPU's memory (i.e. AMDGPU.Mem.free() returns 0) the function may fail. This is why I am trimming the pool if there is less than 1GB free before calling gesvdj!. I will try to investigate this more to see if there is a proper way to handle it.

  3. There is a small bug in the rocSOLVER SVD algorithms in AMDGPU.jl which has been fixed but not released yet (Use adjoint instead of tranpose when returning V in SVD JuliaGPU/AMDGPU.jl#588).

How Has This Been Tested?

I added the new NDTensors.roc function to the devices list in the NDTensors tests. However, there are several tests failing due to:

  1. The aforementioned dot issue
  2. Tests that call svd directly on ROCArrays which seems to return junk
    U, S, V, = svd(expose(mp))
    I believe this should be fixed by AMDGPU.jl.
  3. There are many tests that are calling ql on ROCArrays which is not handled by AMDGPU.jl (similar to CUDA.jl). There seems to be an exception made for CUDA arrays in various QL decomposition functions, e.g. Not sure if I should add similar code for ROCArrays here.

I also added an example script similar to the CUDA one.

Checklist:

  • My code follows the style guidelines of this project. Please run using JuliaFormatter; format(".") in the base directory of the repository (~/.julia/dev/ITensors) to format your code according to our style guidelines.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that verify the behavior of the changes I made.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • Any dependent changes have been merged and published in downstream modules.

@mtfishman
Copy link
Member

mtfishman commented Feb 2, 2024

At first glance this looks good, thanks! This is a nice payoff for a lot of work @kmp5VT and I have been doing to make the overload requirements for new GPU backends as minimal as possible, I'm glad to see this doesn't seem to require very much code to get working.

The SVD code definitely looks like it is the most complicated part that we will have to look at carefully, but besides that it all looks similar to what we needed to overload for the CUDA and Metal backends.

In terms of dot, it just calls tensor contraction: https://github.com/ITensor/ITensors.jl/blob/v0.3.55/src/itensor.jl#L1905 so maybe you are hitting a corner case of tensor contraction that is going wrong for the AMDGPU backend? You would have to share a minimal example for us to have more information. Sorry, I see you shared an example and started investigating the call stack. I can't remember if we have a special code path for inner product-like contractions, we would have to look into that. Something you could do is use a profiler to try to investigate where the slow code is, or maybe disable scalar indexing and see if it errors somewhere.

@mtfishman
Copy link
Member

Looks like there is some package compatibility issue in the tests.

@wbernoudy
Copy link
Contributor Author

Absolutely, this was far easier to get (mostly) working than I expected it to be, thanks to your and @kmp5VT 's great work!

And yes, there seem to be more parameters to the gesvdj functions available in rocSOLVER than in cuSOLVER. We can definitely reconsider the defaults I have set there or perhaps allow a way to pass them in from the DMRG algorithms.

I will try disabling scalar indexing to work on the dot issue, I didn't try that yet.

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.78%. Comparing base (f4ad958) to head (c03a9bc).

❗ Current head c03a9bc differs from pull request most recent head b0356a2. Consider uploading reports for the commit b0356a2 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1325       +/-   ##
===========================================
- Coverage   84.40%   53.78%   -30.62%     
===========================================
  Files         100       99        -1     
  Lines        8581     8528       -53     
===========================================
- Hits         7243     4587     -2656     
- Misses       1338     3941     +2603     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mtfishman
Copy link
Member

One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR.

I think it is a subtle question to figure out how to select different SVD backends for different devices (say you want to specify that if you run on CUDA you want to use a certain cuLAPACK implementation, but then on AMD you want to transfer to CPU to perform SVD, then on Metal you want to...). That will be a longer discussion and may require us to think about the higher level SVD interface.

It may by that we should add support to the svd function that the alg keyword argument could in general be a function which takes in information about the backend and returns a string or Algorithm object depending on that backend, i.e.:

svd_alg(::Type{<:CuArray}) = "cpu"
svd_alg(::Type{<:ROCArray}) = "jacobi_algorithm"
svd_alg(::Type{<:MtlArray}) = "cpu"

u, s, v = svd(t, (i, j); alg=svd_alg)

We would also like to add automated testing for AMD GPUs, @kmp5VT will start to look into that.

@kmp5VT
Copy link
Collaborator

kmp5VT commented Feb 2, 2024

@wbernoudy I am currently working on finding a way to access an AMD GPU device. Would it be possible to provide me access to your branch so I can help accelerate your development? I am not that familiar with the AMDGPU.jl package but it does look like there is a rocblas and rocsolver directory which links the julia code to the C libraries in ROC. We should be able to access these functions through linearalgebra.jl , if we can't thats something we could bring up to AMDGPU.jl and should be a relatively easy implementation. regarding the dot function I believe we are using LinearAlgebra.mul! to launch a BLAS dot function.

@wbernoudy
Copy link
Contributor Author

wbernoudy commented Feb 2, 2024

Would it be possible to provide me access to your branch so I can help accelerate your development?

Of course, I added you as a collaborator on my fork.

To be clear, this extension is already using the accelerated BLAS functions provided by AMDGPU.jl for tensor operations, and it is using the SVD provided in rocSOLVER (and even eigen methods).

AMDGPU.jl now provides both the three and five argument versions of LinearAlgebra.mul!. But I'm guessing there is a certain transformation on one or more of the matrices used somewhere in the call stack of dot that is not yet handled by AMDGPU.jl. I found another of these cases that I fixed here https://github.com/wbernoudy/ITensors.jl/blob/01f6c71844114eb92a9a50c4b23343161b9b028f/NDTensors/ext/NDTensorsROCmExt/mul.jl#L27

@wbernoudy
Copy link
Contributor Author

wbernoudy commented Feb 9, 2024

Believe I figured this out, can skip to next comment

@mtfishman I tried the idea of disabling scalar indexing to see if some generic matmul function is using that in the dot code path, but no luck. I disabled the indexing simply by commenting out the relevant functions in NDTensors/ext/NDTensorsAMDGPUExt/indexing.jl, which I believe was the right thing to do to test this (and I do see it it reaching a ERROR: Scalar indexing is disallowed. on some calls). Is there another way it's allowing scalar indexing that I might be missing?

Here you can see the standard dot function does not hit any scalar indexing (other than the very final reading of the zero dimensional tensor) but is ~500x slower than calling dot on the underlying arrays.

julia> using AMDGPU

julia> using ITensors

julia> using NDTensors  # modified from current branch to disallow scalar indexing

julia> using LinearAlgebra

julia> i, j = Index(6152, "i"), Index(1324, "j")
((dim=6152|id=857|"i"), (dim=1324|id=611|"j"))

julia> A = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=6152|id=857|"i") (dim=1324|id=611|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> B = ITensor(randomTensor(ROCArray, (j, i)))
ITensor ord=2 (dim=1324|id=611|"j") (dim=6152|id=857|"i")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> dot(A, B)
ERROR: Scalar indexing is disallowed.
(rest of stack trace)

julia> dag(A) * B
ITensor ord=0
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> AMDGPU.@allowscalar (dag(A) * B)[]
-3066.643034152186

julia> AMDGPU.@elapsed dag(A) * B
0.29850277f0

julia> function LinearAlgebra.dot(A::ITensor, B::ITensor)
         pB = B
         if (inds(A) != inds(B))
           pB = permute(B, inds(A))
         end
         return dot(data(tensor(A)), data(tensor(pB)))
       end

julia> dot(A, B)
-3066.643034153356

julia> AMDGPU.@elapsed dot(A, B)
0.000623521f0

@wbernoudy
Copy link
Contributor Author

Actually I think I have figured out the slowdown with dot (so my previous comment can be ignored). The issue is actually that it is calling the AMDGPU gemm method, but this method itself is quite slow when doing the dot product on large vectors. Perhaps other GPU libraries (e.g. CUDA) are able to handle that case better, but I'm guessing the rocblas_{T}gemm functions are doing the naive thing and having threads write to the same output value atomically.

Calling dot directly on the arrays calls the method defined in the GPUArrays library, which uses mapreduce and seems to be far faster. This makes me think there should be definition of dot in ITensors.jl that does call dot on the underlying arrays instead of calling general matmul implicitly by contracting as normal. I'd be happy to make a separate PR for this.

@wbernoudy
Copy link
Contributor Author

One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR.

This is exactly what I added the "roc_to_cpu_svd" alg type for. I would be happy to make this the default in this PR (instead of "jacobi_algorithm").

@mtfishman Would it be better if I removed the Jacobi implementation from this PR?

@mtfishman
Copy link
Member

mtfishman commented Feb 12, 2024

One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR.

This is exactly what I added the "roc_to_cpu_svd" alg type for. I would be happy to make this the default in this PR (instead of "jacobi_algorithm").

Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new "*_to_cpu_svd" for every GPU backend? That doesn't seem like a sustainable forward-looking API. I think we will have to think more carefully about how to allow users to control device transfers in low level operations more generally, perhaps with an API that is orthogonal to the alg keyword argument.

@mtfishman Would it be better if I removed the Jacobi implementation from this PR?

Yes, I think so. The code you wrote seems more appropriate to add as an svd function to AMDGPU.jl, and given that it doesn't appear to help with performance it doesn't seem very compelling to add it here and have to support it. It would also simplify the code a lot and let us push off a decision about what to name the alg arguments.

@wbernoudy
Copy link
Contributor Author

Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new "*_to_cpu_svd" for every GPU backend? That doesn't seem like a sustainable forward-looking API.

Very much agreed. An orthogonal argument that allows you to choose the device (in addition to the SVD algorithm) sounds great to me.

Either way, I'm happy to either wait until after a more precise API has been figured out to finalize this PR, or keep the awkward roc_to_cpu_svd for now. Or perhaps there is a cleaner way to do this now without the API change? My understanding is that we need something for the name of the SVD alg type for ROCMatrix, as it must be used here:

alg = replace_nothing(alg, default_svd_alg(T))

The code you wrote seems more appropriate to add as an svd function to AMDGPU.jl

I figured this might be the case, and it certainly makes sense to handle the out-of-memory properly in AMDGPU.jl instead. I will take this out 👍

@mtfishman
Copy link
Member

Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new "*_to_cpu_svd" for every GPU backend? That doesn't seem like a sustainable forward-looking API.

Very much agreed. An orthogonal argument that allows you to choose the device (in addition to the SVD algorithm) sounds great to me.

Either way, I'm happy to either wait until after a more precise API has been figured out to finalize this PR, or keep the awkward roc_to_cpu_svd for now. Or perhaps there is a cleaner way to do this now without the API change? My understanding is that we need something for the name of the SVD alg type for ROCMatrix, as it must be used here:

Glad we're in agreement about that. I think since there will be only one SVD backend for ROCMatrix with the current plan (one where we transfer to CPU), we wouldn't need to involve the alg keyword argument in this PR, see for example the factorizations defined for Metal: https://github.com/ITensor/ITensors.jl/blob/main/NDTensors/ext/NDTensorsMetalExt/linearalgebra.jl. So we could still finish this PR in a simpler form and then add back other SVD backends later as they become available, with a better API for choosing between those backends on GPU or transferring to CPU.

@mtfishman
Copy link
Member

mtfishman commented Feb 12, 2024

Sorry, I didn't see you referring to default_svd_alg(T). I forget how that is handled for Metal but it could just be handled the same way, presumably it chooses the default that would be used for CPU SVD and then forwards that choice to the CPU SVD backend.

@wbernoudy
Copy link
Contributor Author

Oh, I didn't think about just overloading LinearAlgebra.svd directly! That's much more straightforward for this. I will make that change.

@kmp5VT
Copy link
Collaborator

kmp5VT commented Feb 20, 2024

@wbernoudy Hi, just giving you an update. I now have access to an AMD GPU and have made a few changes (nothing significant). One thing is that the changes I am making in another PR also affects the work here. What I am doing is working to finish this other PR, pull the changes in here and make those modifications here as well so that we don't have to write this code twice. I am hoping to finish 1331 this week and put more time into your PR next week.

@kmp5VT kmp5VT marked this pull request as ready for review March 21, 2024 13:07
@mtfishman
Copy link
Member

Looks good! Thanks @wbernoudy and @kmp5VT.

@mtfishman mtfishman merged commit 093d339 into ITensor:main Mar 21, 2024
7 of 8 checks passed
@mtfishman mtfishman changed the title [WIP] [NDTensors] Add AMDGPU.jl (ROCm) based extension for NDTensors [NDTensors] Add AMDGPU.jl (ROCm) based extension for NDTensors Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants