-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: MatrixFree and Assembled GPU operators #766
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #766 +/- ##
==========================================
- Coverage 93.29% 92.77% -0.53%
==========================================
Files 36 33 -3
Lines 5235 4952 -283
==========================================
- Hits 4884 4594 -290
- Misses 351 358 +7 ☔ View full report in Codecov by Sentry. |
end | ||
### TODO Adapt dofhandler | ||
|
||
# TODO not sure how to do this automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Driveby comment: I think we could change
Ferrite.jl/src/interpolations.jl
Line 521 in 3cff601
throw(ArgumentError("no shape function $i for interpolation $ip")) |
to @boundscheck throw(ArgumentError("no shape function $i for interpolation $ip"))
(also in all other places) and then in this example use @inbounds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is rather hacky. Give me a bit more time to think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is actually not type stable.
Probably should return eltype(ξ)(NaN)
after the @boundscheck
and update the operations to not multiply by floats.
But I wouldn't call that hacky though since it just allows @inbounds
to be used to elude the bounds check?
…ate NaNs at the speed of light.
The first SpMV kernels run, but there is still quite a bit of features and optimizations to implement. Here some first benchmarks. cuSPARSE currently wins by a landslide and the full matrix-free kernel is about 3x slower thatn the CPU implementation for now. julia> @benchmark CUDA.@sync mul!($u,$Agpu,$bgpu)
BenchmarkTools.Trial: 27 samples with 1 evaluation.
Range (min … max): 185.123 ms … 186.871 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 186.140 ms ┊ GC (median): 0.00%
Time (mean ± σ): 186.102 ms ± 469.661 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ ▃ █ ▃ █
▇▁▁▁▁▁▁▁▇▁▁▁▁▁▇▇▁▁▁▇█▁█▇▁▁▁▇▇▁▁▁▁▁▇█▁▁▁▁▇▇▁▁▁▁█▁▁▇▁▇▁█▁▁▁▁▇▁▇ ▁
185 ms Histogram: frequency by time 187 ms <
Memory estimate: 46.61 KiB, allocs estimate: 475.
julia> @benchmark mul!($ucpu, $Kcpu, $bcpu)
BenchmarkTools.Trial: 63 samples with 1 evaluation.
Range (min … max): 79.763 ms … 83.167 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 80.435 ms ┊ GC (median): 0.00%
Time (mean ± σ): 80.446 ms ± 512.121 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▆▁▅▆▁▆▅▆▅▁▆██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
79.8 ms Histogram: log(frequency) by time 83 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mul!($ugpu,$Kgpu,$bgpu)
BenchmarkTools.Trial: 8654 samples with 1 evaluation.
Range (min … max): 18.390 μs … 2.875 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 610.724 μs ┊ GC (median): 0.00%
Time (mean ± σ): 574.855 μs ± 144.657 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▃▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂█ ▂
18.4 μs Histogram: frequency by time 614 μs <
Memory estimate: 1.44 KiB, allocs estimate: 60.
julia> @benchmark mul!($ugpu,$Aeagpu,$bgpu)
BenchmarkTools.Trial: 201 samples with 1 evaluation.
Range (min … max): 24.722 ms … 30.899 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 24.823 ms ┊ GC (median): 0.00%
Time (mean ± σ): 24.871 ms ± 437.805 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃█▅▄▁
▂▂▃▃▄███████▆▅▄▄▄▃▁▁▁▁▂▂▂▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▃▂ ▃
24.7 ms Histogram: frequency by time 25.4 ms <
Memory estimate: 4.42 KiB, allocs estimate: 151. |
This PR shows how to implement matrix-free operators on the GPU and provides the necessary infrastructure.
TODOs
Future work