-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda heat example w quaditer #913
base: master
Are you sure you want to change the base?
Cuda heat example w quaditer #913
Conversation
What I did for now, and it's still work in progress:
This still work in progress and as my discussion with @termi-official last week I still need to work on the assembler, coloring algorthim. Some problems I have encountered that might be so straightforward to tackle:
|
Great to see some quick progress here!
I think that is straight forward to solve. We never really need the Dicts directly during assembly. We should be able to get away by just convert the Vectors (once) to GPUVectors and run the assembly with these. This might require 2 structs. One holding the full information (e.g. GPUGrid) and one which we use in the kernels (e.g. GPUGridView). Maybe the latter could be something like struct GPUGridView{TEA, TNA, TSA <: Union{Nothing, <:AbstractVector{Int}, <: AbstractVector{FaceIndex}, ..., TCA} <: AbstractGrid (?)
cells::TEA
nodes::TNA
subdomain::TSA
color::TCA
end where subdomain just holds the data which we want to iterate over (or nothing for all cells) and color is a vector for elements with one color of the current subdomain. |
A longer-term thing just to throw out the idea, but perhaps a more slim Grid could be nice? struct Grid{dim, C, T, CV, NV, S}
cells::CV
nodes::NV
gridsets::S
function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, gridsets) where {C, dim, T}
return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, gridsets)
end
end
struct GridSets
facetsets::Dict{String, OrderedSet{FacetIndex}}
cellsets::Dict{String, OrderedSets{Int}}
....
end allowing also |
I thought of this quite a bit already, whether we should have our grid in the form struct Grid{dim, C, T, CV, NV, S, TT}
cells::CV
nodes::NV
subdomain_info::S
function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, subdomain_info) where {C, dim, T}
return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, subdomain_info)
end
end where subdomain info contains any kind of subdomain information. This could also include potentially some optional topology information which we need for some problems. In the simplest case it would be just facesets and cellsets. However, we should do this in a separate PR. What do you think @fredrikekre ? |
So far:
- K = spzeros!!(Float64, I, J, ndofs(dh), ndofs(dh)) #old code
+ K = spzeros!!(T, I, J, ndofs(dh), ndofs(dh)) # my proposal
I don't know whether this was intended or what but I found it worth mentioning. |
Thanks for putting this so far together! Some quick comments before the next meeting for you.
Indeed and I started refactoring some of the assembly code here #916 . I also think that we cannot get away with reusing the existing assembler and that we need a custom one.
Indeed, but you should be able to write into nzval directly. Your GPUSparseMatrixCSC struct has very similar structure to the one of CUSPARSE already, so switching should be straight forward.
Indeed. Frekdrik has put something great together to fix this already here #888 and I hope we can merge it in the not so far future to have more direct support of different formats. |
A GPU benchmark with 1000 X 1000 grid, Biquadratic Lagrange as an approximation function, and 3 x 3 quadrature rule for numerical integration. Profiler ran for 477.09 ms, capturing 2844 events.
Host-side activity: calling CUDA APIs took 303.82 ms (63.68% of the trace)
┌──────┬───────────┬───────────┬────────┬─────────────────────────┬────────────────────────────┐
│ ID │ Start │ Time │ Thread │ Name │ Details │
├──────┼───────────┼───────────┼────────┼─────────────────────────┼────────────────────────────┤
│ 5 │ 19.55 µs │ 8.82 µs │ 1 │ cuMemAllocFromPoolAsync │ 15.274 MiB, device memory │
│ 19 │ 38.39 µs │ 953.67 ns │ 1 │ cuStreamSynchronize │ - │
│ 28 │ 44.35 µs │ 2.66 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 33 │ 2.71 ms │ 7.15 µs │ 1 │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│ 170 │ 2.84 ms │ 1.43 µs │ 1 │ cuStreamSynchronize │ - │
│ 179 │ 2.84 ms │ 44.51 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 184 │ 47.38 ms │ 27.18 µs │ 1 │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│ 309 │ 47.47 ms │ 3.58 µs │ 1 │ cuStreamSynchronize │ - │
│ 318 │ 47.49 ms │ 41.89 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 323 │ 89.42 ms │ 38.86 µs │ 1 │ cuMemAllocFromPoolAsync │ 15.274 MiB, device memory │
│ 335 │ 89.48 ms │ 64.61 µs │ 1 │ cuMemsetD32Async │ - │
│ 356 │ 98.81 ms │ 30.76 µs │ 1 │ cuMemAllocFromPoolAsync │ 34.332 MiB, device memory │
│ 370 │ 98.86 ms │ 4.29 µs │ 1 │ cuStreamSynchronize │ - │
│ 379 │ 98.88 ms │ 5.4 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 388 │ 104.32 ms │ 21.46 µs │ 1 │ cuMemAllocFromPoolAsync │ 30.518 MiB, device memory │
│ 516 │ 104.41 ms │ 2.38 µs │ 1 │ cuStreamSynchronize │ - │
│ 525 │ 104.42 ms │ 4.74 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 534 │ 110.25 ms │ 24.56 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 548 │ 110.29 ms │ 4.53 µs │ 1 │ cuStreamSynchronize │ - │
│ 557 │ 110.31 ms │ 797.99 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 566 │ 111.12 ms │ 13.35 µs │ 1 │ cuMemAllocFromPoolAsync │ 7.645 MiB, device memory │
│ 1054 │ 111.3 ms │ 1.19 µs │ 1 │ cuStreamSynchronize │ - │
│ 1063 │ 111.3 ms │ 1.39 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 1072 │ 114.73 ms │ 13.59 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 1086 │ 114.75 ms │ 1.91 µs │ 1 │ cuStreamSynchronize │ - │
│ 1095 │ 114.76 ms │ 598.67 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 1112 │ 115.43 ms │ 8.11 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.164 MiB, device memory │
│ 1124 │ 115.44 ms │ 50.54 µs │ 1 │ cuMemsetD32Async │ - │
│ 1129 │ 115.49 ms │ 6.91 µs │ 1 │ cuMemAllocFromPoolAsync │ 360.000 KiB, device memory │
│ 1141 │ 115.5 ms │ 6.44 µs │ 1 │ cuMemsetD32Async │ - │
│ 1162 │ 149.56 ms │ 1.22 ms │ 1 │ cuMemAllocFromPoolAsync │ 34.332 MiB, device memory │
│ 1176 │ 150.8 ms │ 4.29 µs │ 1 │ cuStreamSynchronize │ - │
│ 1185 │ 150.82 ms │ 7.98 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 1194 │ 158.81 ms │ 722.41 µs │ 1 │ cuMemAllocFromPoolAsync │ 30.518 MiB, device memory │
│ 1208 │ 159.55 ms │ 3.58 µs │ 1 │ cuStreamSynchronize │ - │
│ 1217 │ 159.56 ms │ 6.86 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 1226 │ 167.29 ms │ 15.02 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 1240 │ 167.34 ms │ 3.58 µs │ 1 │ cuStreamSynchronize │ - │
│ 1249 │ 167.35 ms │ 777.24 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 1258 │ 168.14 ms │ 6.68 µs │ 1 │ cuMemAllocFromPoolAsync │ 7.645 MiB, device memory │
│ 1965 │ 168.38 ms │ 1.19 µs │ 1 │ cuStreamSynchronize │ - │
│ 1974 │ 168.39 ms │ 1.41 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 1983 │ 171.42 ms │ 12.87 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 1997 │ 171.46 ms │ 2.15 µs │ 1 │ cuStreamSynchronize │ - │
│ 2006 │ 171.47 ms │ 816.58 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 2023 │ 172.42 ms │ 84.4 µs │ 1 │ cuLaunchKernel │ - │
│ 2801 │ 172.93 ms │ 13.11 µs │ 2 │ cuMemFreeAsync │ 3.815 MiB, device memory │
│ 2806 │ 172.95 ms │ 2.38 µs │ 2 │ cuMemFreeAsync │ 7.645 MiB, device memory │
│ 2811 │ 172.96 ms │ 2.38 µs │ 2 │ cuMemFreeAsync │ 3.815 MiB, device memory │
│ 2816 │ 172.96 ms │ 2.62 µs │ 2 │ cuMemFreeAsync │ 30.518 MiB, device memory │
│ 2821 │ 172.97 ms │ 2.86 µs │ 2 │ cuMemFreeAsync │ 34.332 MiB, device memory │
│ 2824 │ 172.97 ms │ 303.8 ms │ 2 │ cuStreamSynchronize │ - │
└──────┴───────────┴───────────┴────────┴─────────────────────────┴────────────────────────────┘
Device-side activity: GPU was busy for 418.87 ms (87.80% of the trace)
┌──────┬───────────┬───────────┬─────────┬────────┬──────┬─────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Size │ Throughput │ Name ⋯
├──────┼───────────┼───────────┼─────────┼────────┼──────┼─────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────
│ 28 │ 247.24 µs │ 2.6 ms │ - │ - │ - │ 15.274 MiB │ 5.741 GiB/s │ [copy pageable to device memory] ⋯
│ 179 │ 3.02 ms │ 44.45 ms │ - │ - │ - │ 244.202 MiB │ 5.365 GiB/s │ [copy pageable to device memory] ⋯
│ 318 │ 47.62 ms │ 41.88 ms │ - │ - │ - │ 244.202 MiB │ 5.694 GiB/s │ [copy pageable to device memory] ⋯
│ 335 │ 89.97 ms │ 250.1 µs │ - │ - │ - │ 15.274 MiB │ 59.640 GiB/s │ [set device memory] ⋯
│ 379 │ 99.14 ms │ 5.27 ms │ - │ - │ - │ 34.332 MiB │ 6.362 GiB/s │ [copy pageable to device memory] ⋯
│ 525 │ 104.54 ms │ 4.76 ms │ - │ - │ - │ 30.518 MiB │ 6.262 GiB/s │ [copy pageable to device memory] ⋯
│ 557 │ 110.52 ms │ 791.31 µs │ - │ - │ - │ 3.815 MiB │ 4.708 GiB/s │ [copy pageable to device memory] ⋯
│ 1063 │ 111.5 ms │ 1.35 ms │ - │ - │ - │ 7.645 MiB │ 5.519 GiB/s │ [copy pageable to device memory] ⋯
│ 1095 │ 114.97 ms │ 537.16 µs │ - │ - │ - │ 3.815 MiB │ 6.935 GiB/s │ [copy pageable to device memory] ⋯
│ 1124 │ 115.95 ms │ 58.41 µs │ - │ - │ - │ 3.164 MiB │ 52.898 GiB/s │ [set device memory] ⋯
│ 1141 │ 116.02 ms │ 13.11 µs │ - │ - │ - │ 360.000 KiB │ 26.182 GiB/s │ [set device memory] ⋯
│ 1185 │ 153.41 ms │ 5.53 ms │ - │ - │ - │ 34.332 MiB │ 6.064 GiB/s │ [copy pageable to device memory] ⋯
│ 1217 │ 161.78 ms │ 4.81 ms │ - │ - │ - │ 30.518 MiB │ 6.200 GiB/s │ [copy pageable to device memory] ⋯
│ 1249 │ 167.66 ms │ 730.51 µs │ - │ - │ - │ 3.815 MiB │ 5.100 GiB/s │ [copy pageable to device memory] ⋯
│ 1974 │ 168.68 ms │ 1.34 ms │ - │ - │ - │ 7.645 MiB │ 5.582 GiB/s │ [copy pageable to device memory] ⋯
│ 2006 │ 171.79 ms │ 721.93 µs │ - │ - │ - │ 3.815 MiB │ 5.160 GiB/s │ [copy pageable to device memory] ⋯
│ 2023 │ 172.92 ms │ 303.78 ms │ 256 │ 40 │ 95 │ - │ - │ assemble_gpu_(CuSparseDeviceMatrixCSC<Float32, Int32, 1l>, CuDeviceArray<Float32, 1l, 1l>, Stat ⋯
└──────┴───────────┴───────────┴─────────┴────────┴──────┴─────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the first review round covering the example, parts of the assembly logic and some of the kernel infrastructure.
ext/GPU/adapt.jl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these dispatches CUDA-specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe yes, but just out of curiosity, is there any performance-related reason that I might have overlooked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is more about making the code extensible (e.g. to allow using AMD).
ext/GPU/gpu_assembler.jl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation dispatches (Ferrite.allocate_matrix
) are missing. Just allocate the analogue CPU matrix and shove it into CUSPARSE á la
Kcpu = allocate_matrix(SparseMatrixCSC{Float32, Int32}, dh)
allocate_matrix(CUSPARSE.CuSparseMatrixCSC{Float32, Int32}, dh)
where we extract the type parameters from the dispatch.
heatflow_qp_values.jl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what does this file belong to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, no idea, first time to notice it 😂
## gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh)) | ||
## gpu_kernel() | ||
## end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing the analogue benchmark using QuadraturePointIterator
GPU Setup Benchmarkingfunction setup_bench_gpu(n_cells, n_basefuncs, cellvalues, dh)
Kgpu = allocate_matrix(CUSPARSE.CuSparseMatrixCSC{Float32, Int32}, dh)
fgpu = CUDA.zeros(eltype(Kgpu), ndofs(dh));
gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh))
end Profiler ran for 3.21 s, capturing 107 events.
Host-side activity: calling CUDA APIs took 375.8 ms (11.72% of the trace)
┌─────┬────────┬───────────┬─────────────────────────┬────────────────────────────┐
│ ID │ Start │ Time │ Name │ Details │
├─────┼────────┼───────────┼─────────────────────────┼────────────────────────────┤
│ 5 │ 2.83 s │ 92.27 µs │ cuMemAllocFromPoolAsync │ 15.274 MiB, device memory │
│ 19 │ 2.83 s │ 4.77 µs │ cuStreamSynchronize │ - │
│ 28 │ 2.83 s │ 10.39 ms │ cuMemcpyHtoDAsync │ - │
│ 33 │ 2.84 s │ 9.14 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│ 47 │ 2.85 s │ 12.16 µs │ cuStreamSynchronize │ - │
│ 56 │ 2.85 s │ 195.62 ms │ cuMemcpyHtoDAsync │ - │
│ 61 │ 3.04 s │ 7.37 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│ 75 │ 3.05 s │ 5.01 µs │ cuStreamSynchronize │ - │
│ 84 │ 3.05 s │ 152.64 ms │ cuMemcpyHtoDAsync │ - │
│ 89 │ 3.2 s │ 46.49 µs │ cuMemAllocFromPoolAsync │ 15.274 MiB, device memory │
│ 101 │ 3.2 s │ 399.59 µs │ cuMemsetD32Async │ - │
└─────┴────────┴───────────┴─────────────────────────┴────────────────────────────┘
Device-side activity: GPU was busy for 301.98 ms (9.42% of the trace)
┌─────┬────────┬───────────┬─────────────┬───────────────┬──────────────────────────────────┐
│ ID │ Start │ Time │ Size │ Throughput │ Name │
├─────┼────────┼───────────┼─────────────┼───────────────┼──────────────────────────────────┤
│ 28 │ 2.83 s │ 10.54 ms │ 15.274 MiB │ 1.416 GiB/s │ [copy pageable to device memory] │
│ 56 │ 2.88 s │ 168.79 ms │ 244.202 MiB │ 1.413 GiB/s │ [copy pageable to device memory] │
│ 84 │ 3.08 s │ 122.56 ms │ 244.202 MiB │ 1.946 GiB/s │ [copy pageable to device memory] │
│ 101 │ 3.21 s │ 91.31 µs │ 15.274 MiB │ 163.349 GiB/s │ [set device memory] │
└─────┴────────┴───────────┴─────────────┴───────────────┴──────────────────────────────────┘ GPU Kernel BenchmarkingCUDA.@profile trace = true gpu_kernel() Profiler ran for 373.97 ms, capturing 1731 events.
Host-side activity: calling CUDA APIs took 315.64 ms (84.40% of the trace)
┌──────┬──────────┬───────────┬────────┬─────────────────────────┬────────────────────────────┐
│ ID │ Start │ Time │ Thread │ Name │ Details │
├──────┼──────────┼───────────┼────────┼─────────────────────────┼────────────────────────────┤
│ 21 │ 11.25 ms │ 39.1 µs │ 1 │ cuMemAllocFromPoolAsync │ 34.332 MiB, device memory │
│ 35 │ 11.32 ms │ 5.48 µs │ 1 │ cuStreamSynchronize │ - │
│ 44 │ 11.34 ms │ 5.62 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 53 │ 17.0 ms │ 28.85 µs │ 1 │ cuMemAllocFromPoolAsync │ 30.518 MiB, device memory │
│ 139 │ 17.07 ms │ 3.1 µs │ 1 │ cuStreamSynchronize │ - │
│ 148 │ 17.08 ms │ 4.88 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 157 │ 22.87 ms │ 27.89 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 171 │ 22.92 ms │ 3.34 µs │ 1 │ cuStreamSynchronize │ - │
│ 180 │ 22.93 ms │ 706.2 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 189 │ 23.65 ms │ 8.58 µs │ 1 │ cuMemAllocFromPoolAsync │ 7.645 MiB, device memory │
│ 305 │ 23.81 ms │ 1.43 µs │ 1 │ cuStreamSynchronize │ - │
│ 314 │ 23.81 ms │ 1.13 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 323 │ 26.6 ms │ 12.4 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 337 │ 26.63 ms │ 1.67 µs │ 1 │ cuStreamSynchronize │ - │
│ 346 │ 26.63 ms │ 582.93 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 363 │ 27.29 ms │ 8.34 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.164 MiB, device memory │
│ 375 │ 27.3 ms │ 65.09 µs │ 1 │ cuMemsetD32Async │ - │
│ 380 │ 27.37 ms │ 4.05 µs │ 1 │ cuMemAllocFromPoolAsync │ 360.000 KiB, device memory │
│ 392 │ 27.37 ms │ 36.95 µs │ 1 │ cuMemsetD32Async │ - │
│ 413 │ 38.88 ms │ 27.89 µs │ 1 │ cuMemAllocFromPoolAsync │ 34.332 MiB, device memory │
│ 427 │ 38.93 ms │ 4.29 µs │ 1 │ cuStreamSynchronize │ - │
│ 436 │ 38.95 ms │ 5.96 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 445 │ 44.94 ms │ 23.6 µs │ 1 │ cuMemAllocFromPoolAsync │ 30.518 MiB, device memory │
│ 555 │ 45.05 ms │ 5.48 µs │ 1 │ cuCtxSetCurrent │ - │
│ 556 │ 45.05 ms │ 239.13 µs │ 1 │ cuCtxGetDevice │ - │
│ 557 │ 45.33 ms │ 715.26 ns │ 1 │ cuDeviceGetCount │ - │
│ 560 │ 45.34 ms │ 12.16 µs │ 1 │ cuMemFreeAsync │ 3.815 MiB, device memory │
│ 565 │ 45.36 ms │ 2.38 µs │ 1 │ cuMemFreeAsync │ 7.645 MiB, device memory │
│ 570 │ 45.36 ms │ 2.15 µs │ 1 │ cuMemFreeAsync │ 3.815 MiB, device memory │
│ 575 │ 45.37 ms │ 3.1 µs │ 1 │ cuMemFreeAsync │ 30.518 MiB, device memory │
│ 580 │ 45.37 ms │ 3.81 µs │ 1 │ cuMemFreeAsync │ 34.332 MiB, device memory │
│ 586 │ 45.51 ms │ 4.05 µs │ 1 │ cuStreamSynchronize │ - │
│ 595 │ 45.52 ms │ 4.83 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 604 │ 51.4 ms │ 14.78 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 618 │ 51.43 ms │ 2.86 µs │ 1 │ cuStreamSynchronize │ - │
│ 627 │ 51.44 ms │ 727.89 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 636 │ 52.18 ms │ 15.97 µs │ 1 │ cuMemAllocFromPoolAsync │ 7.645 MiB, device memory │
│ 881 │ 52.4 ms │ 1.67 µs │ 1 │ cuStreamSynchronize │ - │
│ 890 │ 52.41 ms │ 1.3 ms │ 1 │ cuMemcpyHtoDAsync │ - │
│ 899 │ 55.51 ms │ 17.4 µs │ 1 │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 913 │ 55.54 ms │ 2.15 µs │ 1 │ cuStreamSynchronize │ - │
│ 922 │ 55.55 ms │ 687.12 µs │ 1 │ cuMemcpyHtoDAsync │ - │
│ 939 │ 56.33 ms │ 882.39 µs │ 1 │ cuLaunchKernel │ - │
│ 1715 │ 58.13 ms │ 315.64 ms │ 2 │ cuStreamSynchronize │ - │
└──────┴──────────┴───────────┴────────┴─────────────────────────┴────────────────────────────┘
Device-side activity: GPU was busy for 343.25 ms (91.78% of the trace)
┌─────┬──────────┬───────────┬─────────┬────────┬──────┬─────────────┬──────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Size │ Throughput │ Name ⋯
├─────┼──────────┼───────────┼─────────┼────────┼──────┼─────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 44 │ 11.67 ms │ 5.43 ms │ - │ - │ - │ 34.332 MiB │ 6.175 GiB/s │ [copy pageable to device memory] ⋯
│ 148 │ 17.26 ms │ 4.85 ms │ - │ - │ - │ 30.518 MiB │ 6.139 GiB/s │ [copy pageable to device memory] ⋯
│ 180 │ 23.28 ms │ 532.15 µs │ - │ - │ - │ 3.815 MiB │ 7.000 GiB/s │ [copy pageable to device memory] ⋯
│ 314 │ 23.97 ms │ 1.16 ms │ - │ - │ - │ 7.645 MiB │ 6.456 GiB/s │ [copy pageable to device memory] ⋯
│ 346 │ 26.81 ms │ 599.15 µs │ - │ - │ - │ 3.815 MiB │ 6.218 GiB/s │ [copy pageable to device memory] ⋯
│ 375 │ 27.47 ms │ 40.53 µs │ - │ - │ - │ 3.164 MiB │ 76.235 GiB/s │ [set device memory] ⋯
│ 392 │ 27.52 ms │ 9.54 µs │ - │ - │ - │ 360.000 KiB │ 36.000 GiB/s │ [set device memory] ⋯
│ 436 │ 39.24 ms │ 5.8 ms │ - │ - │ - │ 34.332 MiB │ 5.778 GiB/s │ [copy pageable to device memory] ⋯
│ 595 │ 45.71 ms │ 4.88 ms │ - │ - │ - │ 30.518 MiB │ 6.101 GiB/s │ [copy pageable to device memory] ⋯
│ 627 │ 51.69 ms │ 735.76 µs │ - │ - │ - │ 3.815 MiB │ 5.063 GiB/s │ [copy pageable to device memory] ⋯
│ 890 │ 52.58 ms │ 1.37 ms │ - │ - │ - │ 7.645 MiB │ 5.467 GiB/s │ [copy pageable to device memory] ⋯
│ 922 │ 55.78 ms │ 664.71 µs │ - │ - │ - │ 3.815 MiB │ 5.604 GiB/s │ [copy pageable to device memory] ⋯
│ 939 │ 56.52 ms │ 317.17 ms │ 256 │ 40 │ 95 │ - │ - │ assemble_gpu_(CuSparseDeviceMatrixCSC<Float32, Int32, 1l>, CuDeviceArray<Float32, 1l, 1l>, StaticCellValues<StaticInterpolationValues<La ⋯
└─────┴──────────┴───────────┴─────────┴────────┴──────┴─────────────┴──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────── CPU Setup Benchmarkingfunction setup_bench_cpu( dh)
K = allocate_matrix(SparseMatrixCSC{Float64, Int}, dh)
f = zeros(eltype(K), ndofs(dh));
return K,f
end BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.188 s … 10.796 s ┊ GC (min … max): 5.31% … 1.65%
Time (median): 6.492 s ┊ GC (median): 2.27%
Time (mean ± σ): 6.492 s ± 6.086 s ┊ GC (mean ± σ): 2.27% ± 2.59%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.19 s Histogram: frequency by time 10.8 s <
Memory estimate: 2.12 GiB, allocs estimate: 2009. CPU Assemble Benchmarking1. Standard Assembly@benchmark assemble_global_std!($cellvalues, $dh, $K, $f) BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.404 s … 1.439 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.437 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.429 s ± 16.653 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█ ▁
1.4 s Histogram: frequency by time 1.44 s <
Memory estimate: 1.45 KiB, allocs estimate: 15. 2. QuadratureValueIterator@benchmark assemble_global_qp!($cellvalues, $dh, $K, $f) BenchmarkTools.Trial: 5 samples with 1 evaluation.
Range (min … max): 1.201 s … 1.246 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.215 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.221 s ± 18.214 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █ █
█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.2 s Histogram: frequency by time 1.25 s <
Memory estimate: 1.45 KiB, allocs estimate: 15. |
GPU BenchmarkingSetup:Profiler ran for 3.52 s, capturing 389 events.
Host-side activity: calling CUDA APIs took 276.41 ms (7.86% of the trace)
┌─────┬────────┬───────────┬─────────────────────────┬────────────────────────────┐
│ ID │ Start │ Time │ Name │ Details │
├─────┼────────┼───────────┼─────────────────────────┼────────────────────────────┤
│ 5 │ 3.21 s │ 1.11 ms │ cuMemAllocFromPoolAsync │ 15.274 MiB, device memory │
│ 19 │ 3.21 s │ 2.86 µs │ cuStreamSynchronize │ - │
│ 27 │ 3.21 s │ 14.58 ms │ cuMemcpyHtoDAsync │ - │
│ 32 │ 3.22 s │ 6.52 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│ 46 │ 3.23 s │ 3.1 µs │ cuStreamSynchronize │ - │
│ 54 │ 3.23 s │ 157.46 ms │ cuMemcpyHtoDAsync │ - │
│ 59 │ 3.39 s │ 4.67 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│ 73 │ 3.39 s │ 2.38 µs │ cuStreamSynchronize │ - │
│ 81 │ 3.39 s │ 54.99 ms │ cuMemcpyHtoDAsync │ - │
│ 86 │ 3.45 s │ 41.25 µs │ cuMemAllocFromPoolAsync │ 15.274 MiB, device memory │
│ 98 │ 3.45 s │ 393.39 µs │ cuMemsetD32Async │ - │
│ 105 │ 3.47 s │ 1.37 ms │ cuMemAllocFromPoolAsync │ 34.332 MiB, device memory │
│ 119 │ 3.47 s │ 5.01 µs │ cuStreamSynchronize │ - │
│ 127 │ 3.47 s │ 8.08 ms │ cuMemcpyHtoDAsync │ - │
│ 132 │ 3.47 s │ 1.38 ms │ cuMemAllocFromPoolAsync │ 30.518 MiB, device memory │
│ 146 │ 3.48 s │ 2.15 µs │ cuStreamSynchronize │ - │
│ 154 │ 3.48 s │ 11.79 ms │ cuMemcpyHtoDAsync │ - │
│ 159 │ 3.49 s │ 815.87 µs │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 173 │ 3.49 s │ 1.43 µs │ cuStreamSynchronize │ - │
│ 181 │ 3.49 s │ 2.56 ms │ cuMemcpyHtoDAsync │ - │
│ 186 │ 3.49 s │ 82.97 µs │ cuMemAllocFromPoolAsync │ 7.645 MiB, device memory │
│ 305 │ 3.49 s │ 1.91 µs │ cuStreamSynchronize │ - │
│ 313 │ 3.49 s │ 1.55 ms │ cuMemcpyHtoDAsync │ - │
│ 318 │ 3.5 s │ 11.21 µs │ cuMemAllocFromPoolAsync │ 3.815 MiB, device memory │
│ 332 │ 3.5 s │ 953.67 ns │ cuStreamSynchronize │ - │
│ 340 │ 3.5 s │ 606.3 µs │ cuMemcpyHtoDAsync │ - │
│ 347 │ 3.5 s │ 7.64 ms │ cuMemAllocFromPoolAsync │ 308.990 MiB, device memory │
│ 359 │ 3.51 s │ 55.31 µs │ cuMemsetD32Async │ - │
│ 364 │ 3.51 s │ 427.01 µs │ cuMemAllocFromPoolAsync │ 34.332 MiB, device memory │
│ 376 │ 3.51 s │ 34.09 µs │ cuMemsetD32Async │ - │
└─────┴────────┴───────────┴─────────────────────────┴────────────────────────────┘
Device-side activity: GPU was busy for 210.53 ms (5.98% of the trace)
┌─────┬────────┬───────────┬─────────────┬───────────────┬──────────────────────────────────┐
│ ID │ Start │ Time │ Size │ Throughput │ Name │
├─────┼────────┼───────────┼─────────────┼───────────────┼──────────────────────────────────┤
│ 27 │ 3.21 s │ 10.63 ms │ 15.274 MiB │ 1.403 GiB/s │ [copy pageable to device memory] │
│ 54 │ 3.25 s │ 131.97 ms │ 244.202 MiB │ 1.807 GiB/s │ [copy pageable to device memory] │
│ 81 │ 3.4 s │ 46.19 ms │ 244.202 MiB │ 5.163 GiB/s │ [copy pageable to device memory] │
│ 98 │ 3.45 s │ 90.36 µs │ 15.274 MiB │ 165.073 GiB/s │ [set device memory] │
│ 127 │ 3.47 s │ 5.96 ms │ 34.332 MiB │ 5.629 GiB/s │ [copy pageable to device memory] │
│ 154 │ 3.48 s │ 10.47 ms │ 30.518 MiB │ 2.847 GiB/s │ [copy pageable to device memory] │
│ 181 │ 3.49 s │ 881.67 µs │ 3.815 MiB │ 4.225 GiB/s │ [copy pageable to device memory] │
│ 313 │ 3.49 s │ 1.45 ms │ 7.645 MiB │ 5.154 GiB/s │ [copy pageable to device memory] │
│ 340 │ 3.5 s │ 609.87 µs │ 3.815 MiB │ 6.108 GiB/s │ [copy pageable to device memory] │
│ 359 │ 3.52 s │ 2.07 ms │ 308.990 MiB │ 145.709 GiB/s │ [set device memory] │
│ 376 │ 3.52 s │ 210.29 µs │ 34.332 MiB │ 159.439 GiB/s │ [set device memory] │
└─────┴────────┴───────────┴─────────────┴───────────────┴──────────────────────────────────┘ kernel launchProfiler ran for 479.87 ms, capturing 829 events.
Host-side activity: calling CUDA APIs took 478.53 ms (99.72% of the trace)
┌─────┬──────────┬───────────┬────────┬─────────────────────┐
│ ID │ Start │ Time │ Thread │ Name │
├─────┼──────────┼───────────┼────────┼─────────────────────┤
│ 49 │ 89.65 µs │ 352.62 µs │ 1 │ cuLaunchKernel │
│ 825 │ 1.02 ms │ 478.53 ms │ 2 │ cuStreamSynchronize │
└─────┴──────────┴───────────┴────────┴─────────────────────┘
Device-side activity: GPU was busy for 478.59 ms (99.73% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬──────┐
│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Name │
├────┼───────────┼───────────┼─────────┼────────┼──────┼──────┤
│ 49 │ 949.62 µs │ 478.59 ms │ 256 │ 40 │ 255 │ _4 │
└────┴───────────┴───────────┴─────────┴────────┴──────┴──────┘ CPU BenchmarkingSetupBenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.599 s … 5.068 s ┊ GC (min … max): 6.36% … 1.17%
Time (median): 3.834 s ┊ GC (median): 2.93%
Time (mean ± σ): 3.834 s ± 1.746 s ┊ GC (mean ± σ): 2.93% ± 3.67%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.6 s Histogram: frequency by time 5.07 s <
Memory estimate: 2.12 GiB, allocs estimate: 2009. Kernel launchBenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.406 s … 1.526 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.454 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.460 s ± 49.409 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.41 s Histogram: frequency by time 1.53 s <
Memory estimate: 1.45 KiB, allocs estimate: 15. |
Heat Example Prototype using CUDA.jl and StaticCellValues