Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support #137

Merged
merged 15 commits into from
Aug 1, 2023
Merged

Add GPU support #137

merged 15 commits into from
Aug 1, 2023

Conversation

fjebaker
Copy link
Member

@fjebaker fjebaker commented Jul 30, 2023

Closes #8

There's still a lot of work to do here:

  • properly embed the GPU ensemble calls into the tracing pipeline to ensure e.g. kwargshandle isn't passed to solve, and that the dt / adaptive is set correctly
  • investigate why GPU tracing is returning poor trajectories
  • bundle into an Extra package that loads if the user also is using DiffEqGPU.jl (Julia 1.9 feature)
  • fix type promotion issues with using GPU in rendergeodesics
  • batch solve non-deterministically fails

This PR currently includes a temporary fix for handling sin / cos duals in ForwardDiff when dispatching on Metal.

State of the device

Rudimentary benchmarks look very promising (fully Float32):

sols = @btime tracegeodesics(m, us, vs, 2000.0f0)
# 100:      13.175 ms (98917 allocations: 38.06 MiB)
# 10_000:   1.727 s (9873983 allocations: 3.72 GiB)

sols = @btime tracegeodesics(m, us, vs, 2000.0f0,
    solver = GPUTsit5(), ensemble = EnsembleGPUKernel(Metal.MetalBackend())
)
# 100:      31.596 ms (3095 allocations: 2.17 MiB)
# 10_000:   231.454 ms (187074 allocations: 204.85 MiB)

However, the traces themselves do not. On the CPU, we get:
Screenshot 2023-07-30 at 22 53 42

Whereas on the GPU:
Screenshot 2023-07-30 at 22 53 48

Clearly there is something very wrong here. Since the impact parameters are set for $\alpha$ between 5 and 10, the lack of spread in the GPU picture might suggest that the initial steps of the integrator are poor, which then propagates further into the integration.

The performance is promising, and provided it doesn't degrade in trying to fix the numerical issues, then the GPU support should be very worthwhile.

@codecov-commenter
Copy link

Codecov Report

Merging #137 (db2d8d6) into main (dd23b71) will decrease coverage by 0.19%.
The diff coverage is 57.89%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main     #137      +/-   ##
==========================================
- Coverage   68.10%   67.91%   -0.19%     
==========================================
  Files          56       56              
  Lines        2414     2425      +11     
==========================================
+ Hits         1644     1647       +3     
- Misses        770      778       +8     
Files Changed Coverage Δ
src/Gradus.jl 28.57% <0.00%> (-6.73%) ⬇️
src/tracing/tracing.jl 90.38% <ø> (ø)
src/tracing/geodesic-problem.jl 97.05% <100.00%> (+0.08%) ⬆️
src/tracing/method-implementations/auto-diff.jl 97.87% <100.00%> (+0.09%) ⬆️

@fjebaker
Copy link
Member Author

I discovered I was only plotting the CPU offload solutions. With adaptive timestepping, we don't have anything to plot beyond start and endpoint, but with fixed timestep we actually get (slow) curves to reconstruct:

Screenshot 2023-07-31 at 11 25 09

Integration termination via the callback functions doesn't seem to be working at the moment. Similarly the status codes don't update.

@fjebaker
Copy link
Member Author

fjebaker commented Aug 1, 2023

GPU:

+ Starting trace...
Rendering: 100%[========================================] Time: 0:00:09 (57.58 μs/it)
+ Trace complete.
  9.217382 seconds (3.04 M allocations: 246.754 MiB, 0.43% gc time)
Screenshot 2023-08-01 at 10 16 11

CPU:

+ Starting trace...
Rendering: 100%[========================================] Time: 0:00:19 ( 0.12 ms/it)
+ Trace complete.
 19.528674 seconds (1.68 M allocations: 202.045 MiB, 0.09% compilation time)

But I think the CPU Float32 implementation is all over the place, with all sorts of hidden conversions going on. This is reenforced by the shadow image it projects:

Screenshot 2023-08-01 at 10 16 16

@fjebaker
Copy link
Member Author

fjebaker commented Aug 1, 2023

For reference, the above 400x400 are still around a factor 2x faster on CPU Float64 (6 threads) than GPU Float32 (Metal).

@fjebaker
Copy link
Member Author

fjebaker commented Aug 1, 2023

For a 1000x1000 shadow render:

GPU Float32: 26.861899 seconds (19.00 M allocations: 1.505 GiB, 0.91% gc time)
CPU Float64: 24.358143 seconds (10.14 M allocations: 1.722 GiB, 1.04% gc time)

There's a few things that aren't quite fair here on the GPU, since it's still doing the point function evaluation on the CPU.

@fjebaker fjebaker marked this pull request as ready for review August 1, 2023 22:53
@fjebaker
Copy link
Member Author

fjebaker commented Aug 1, 2023

The batch solves failing I suspect may be related to fast math calls, but I'm going to leave it for now and investigate this at a later stage.

@fjebaker fjebaker merged commit bd920f6 into main Aug 1, 2023
@fjebaker fjebaker deleted the fergus/gpu branch August 1, 2023 23:02
fjebaker added a commit that referenced this pull request Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU support
2 participants