Cuda heat example w quaditer #913

Abdelrahman912 · 2024-05-14T12:06:22Z

Heat Example Prototype using CUDA.jl and StaticCellValues

docs/src/literate-tutorials/gpu_qp_heat_equation.jl

Abdelrahman912 · 2024-05-23T14:18:14Z

What I did for now, and it's still work in progress:

I added some higher-level abstractions and some restructuring to match the original example (still need some refactoring).
I used the QuadratureValuesIterator and edited the StaticCellValue object to be compatible with the GPU.

This still work in progress and as my discussion with @termi-official last week I still need to work on the assembler, coloring algorthim.

Some problems I have encountered that might be so straightforward to tackle:

Grid object contains Dict type which is not GPU compatible.

termi-official · 2024-05-23T19:29:05Z

Great to see some quick progress here!

Some problems I have encountered that might be so straightforward to tackle:
1. `Grid` object contains `Dict` type which is not GPU compatible.

I think that is straight forward to solve. We never really need the Dicts directly during assembly. We should be able to get away by just convert the Vectors (once) to GPUVectors and run the assembly with these. This might require 2 structs. One holding the full information (e.g. GPUGrid) and one which we use in the kernels (e.g. GPUGridView). Maybe the latter could be something like

struct GPUGridView{TEA, TNA, TSA <: Union{Nothing, <:AbstractVector{Int}, <: AbstractVector{FaceIndex}, ..., TCA} <: AbstractGrid (?)
    cells::TEA
    nodes::TNA
    subdomain::TSA
    color::TCA
end

where subdomain just holds the data which we want to iterate over (or nothing for all cells) and color is a vector for elements with one color of the current subdomain.

KnutAM · 2024-05-23T19:44:24Z

A longer-term thing just to throw out the idea, but perhaps a more slim Grid could be nice?

struct Grid{dim, C, T, CV, NV, S}
    cells::CV
    nodes::NV
    gridsets::S
    function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, gridsets) where {C, dim, T}
        return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, gridsets)
     end
end
struct GridSets
    facetsets::Dict{String, OrderedSet{FacetIndex}}
    cellsets::Dict{String, OrderedSets{Int}}
    ....
end

allowing also gridsets=nothing

termi-official · 2024-06-04T14:18:29Z

A longer-term thing just to throw out the idea, but perhaps a more slim Grid could be nice?

struct Grid{dim, C, T, CV, NV, S}
    cells::CV
    nodes::NV
    gridsets::S
    function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, gridsets) where {C, dim, T}
        return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, gridsets)
     end
end
struct GridSets
    facetsets::Dict{String, OrderedSet{FacetIndex}}
    cellsets::Dict{String, OrderedSets{Int}}
    ....
end

allowing also gridsets=nothing

I thought of this quite a bit already, whether we should have our grid in the form

struct Grid{dim, C, T, CV, NV, S, TT}
    cells::CV
    nodes::NV
    subdomain_info::S
    function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, subdomain_info) where {C, dim, T}
        return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, subdomain_info)
     end
end

where subdomain info contains any kind of subdomain information. This could also include potentially some optional topology information which we need for some problems. In the simplest case it would be just facesets and cellsets.

However, we should do this in a separate PR. What do you think @fredrikekre ?

src/Grid/grid_generators.jl

Abdelrahman912 · 2024-06-20T22:22:53Z

So far:

example implementation using the coloring algorithm and I did my best to follow the same abstraction as in the CPU case (also one could circumvent this by introducing metaprogramming to set up the kernel before launching but this might be relevant for later discussion)
I had to implement a custom assembler (naive implementation) gpu_assembler because the already existing one cannot be used because the permutation and sorteddofs attributes are mutable (regarding to the elements and the size) so it's anly valid for sequential code, not to mention the resize!
Also, setting an index for CuSparseMatrixCSC is not allowed inside a kernel (ref: https://discourse.julialang.org/t/cuda-jl-atomic-addition-error-to-a-sparse-array-inside-cuda-kernel/78789/2), so I write a very naive GPUSparseMatrixCSC.
Final observation, in create_sparsity_pattern it only create sparse matrix with nzvals of type Float64 as follows :

    - K = spzeros!!(Float64, I, J, ndofs(dh), ndofs(dh)) #old code
    + K = spzeros!!(T, I, J, ndofs(dh), ndofs(dh)) # my proposal

I don't know whether this was intended or what but I found it worth mentioning.

termi-official · 2024-06-20T22:33:12Z

Thanks for putting this so far together! Some quick comments before the next meeting for you.

2. I had to implement a custom assembler (naive implementation) `gpu_assembler` because the already existing one cannot be used because the `permutation` and `sorteddofs` attributes are mutable (regarding to the elements and the size) so it's anly valid for sequential code, not to mention the `resize!`

Indeed and I started refactoring some of the assembly code here #916 . I also think that we cannot get away with reusing the existing assembler and that we need a custom one.

3. Also, setting an index for `CuSparseMatrixCSC` is not allowed inside a kernel (ref: https://discourse.julialang.org/t/cuda-jl-atomic-addition-error-to-a-sparse-array-inside-cuda-kernel/78789/2), so I write a very naive `GPUSparseMatrixCSC`.

Indeed, but you should be able to write into nzval directly. Your GPUSparseMatrixCSC struct has very similar structure to the one of CUSPARSE already, so switching should be straight forward.

4. Final observation, in `create_sparsity_pattern` it only create sparse matrix with `nzvals` of type `Float64`  as follows :

    - K = spzeros!!(Float64, I, J, ndofs(dh), ndofs(dh)) #old code
    + K = spzeros!!(T, I, J, ndofs(dh), ndofs(dh)) # my proposal

I don't know whether this was intended or what but I found it worth mentioning.

Indeed. Frekdrik has put something great together to fix this already here #888 and I hope we can merge it in the not so far future to have more direct support of different formats.

Abdelrahman912 · 2024-11-19T18:32:34Z

A GPU benchmark with 1000 X 1000 grid, Biquadratic Lagrange as an approximation function, and 3 x 3 quadrature rule for numerical integration.

Profiler ran for 477.09 ms, capturing 2844 events.
Host-side activity: calling CUDA APIs took 303.82 ms (63.68% of the trace)
┌──────┬───────────┬───────────┬────────┬─────────────────────────┬────────────────────────────┐
│   ID │     Start │      Time │ Thread │ Name                    │                    Details │
├──────┼───────────┼───────────┼────────┼─────────────────────────┼────────────────────────────┤
│    5 │  19.55 µs │   8.82 µs │      1 │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│   19 │  38.39 µs │ 953.67 ns │      1 │ cuStreamSynchronize     │                          - │
│   28 │  44.35 µs │   2.66 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│   33 │   2.71 ms │   7.15 µs │      1 │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  170 │   2.84 ms │   1.43 µs │      1 │ cuStreamSynchronize     │                          - │
│  179 │   2.84 ms │  44.51 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  184 │  47.38 ms │  27.18 µs │      1 │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  309 │  47.47 ms │   3.58 µs │      1 │ cuStreamSynchronize     │                          - │
│  318 │  47.49 ms │  41.89 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  323 │  89.42 ms │  38.86 µs │      1 │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│  335 │  89.48 ms │  64.61 µs │      1 │ cuMemsetD32Async        │                          - │
│  356 │  98.81 ms │  30.76 µs │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│  370 │  98.86 ms │   4.29 µs │      1 │ cuStreamSynchronize     │                          - │
│  379 │  98.88 ms │    5.4 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  388 │ 104.32 ms │  21.46 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│  516 │ 104.41 ms │   2.38 µs │      1 │ cuStreamSynchronize     │                          - │
│  525 │ 104.42 ms │   4.74 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  534 │ 110.25 ms │  24.56 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  548 │ 110.29 ms │   4.53 µs │      1 │ cuStreamSynchronize     │                          - │
│  557 │ 110.31 ms │ 797.99 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  566 │ 111.12 ms │  13.35 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│ 1054 │  111.3 ms │   1.19 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1063 │  111.3 ms │   1.39 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1072 │ 114.73 ms │  13.59 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 1086 │ 114.75 ms │   1.91 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1095 │ 114.76 ms │ 598.67 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1112 │ 115.43 ms │   8.11 µs │      1 │ cuMemAllocFromPoolAsync │   3.164 MiB, device memory │
│ 1124 │ 115.44 ms │  50.54 µs │      1 │ cuMemsetD32Async        │                          - │
│ 1129 │ 115.49 ms │   6.91 µs │      1 │ cuMemAllocFromPoolAsync │ 360.000 KiB, device memory │
│ 1141 │  115.5 ms │   6.44 µs │      1 │ cuMemsetD32Async        │                          - │
│ 1162 │ 149.56 ms │   1.22 ms │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│ 1176 │  150.8 ms │   4.29 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1185 │ 150.82 ms │   7.98 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1194 │ 158.81 ms │ 722.41 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│ 1208 │ 159.55 ms │   3.58 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1217 │ 159.56 ms │   6.86 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1226 │ 167.29 ms │  15.02 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 1240 │ 167.34 ms │   3.58 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1249 │ 167.35 ms │ 777.24 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1258 │ 168.14 ms │   6.68 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│ 1965 │ 168.38 ms │   1.19 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1974 │ 168.39 ms │   1.41 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1983 │ 171.42 ms │  12.87 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 1997 │ 171.46 ms │   2.15 µs │      1 │ cuStreamSynchronize     │                          - │
│ 2006 │ 171.47 ms │ 816.58 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 2023 │ 172.42 ms │   84.4 µs │      1 │ cuLaunchKernel          │                          - │
│ 2801 │ 172.93 ms │  13.11 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 2806 │ 172.95 ms │   2.38 µs │      2 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│ 2811 │ 172.96 ms │   2.38 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 2816 │ 172.96 ms │   2.62 µs │      2 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│ 2821 │ 172.97 ms │   2.86 µs │      2 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│ 2824 │ 172.97 ms │  303.8 ms │      2 │ cuStreamSynchronize     │                          - │
└──────┴───────────┴───────────┴────────┴─────────────────────────┴────────────────────────────┘

Device-side activity: GPU was busy for 418.87 ms (87.80% of the trace)
┌──────┬───────────┬───────────┬─────────┬────────┬──────┬─────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────
│   ID │     Start │      Time │ Threads │ Blocks │ Regs │        Size │   Throughput │ Name                                                                                            ⋯
├──────┼───────────┼───────────┼─────────┼────────┼──────┼─────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────
│   28 │ 247.24 µs │    2.6 ms │       - │      - │    - │  15.274 MiB │  5.741 GiB/s │ [copy pageable to device memory]                                                                ⋯
│  179 │   3.02 ms │  44.45 ms │       - │      - │    - │ 244.202 MiB │  5.365 GiB/s │ [copy pageable to device memory]                                                                ⋯
│  318 │  47.62 ms │  41.88 ms │       - │      - │    - │ 244.202 MiB │  5.694 GiB/s │ [copy pageable to device memory]                                                                ⋯
│  335 │  89.97 ms │  250.1 µs │       - │      - │    - │  15.274 MiB │ 59.640 GiB/s │ [set device memory]                                                                             ⋯
│  379 │  99.14 ms │   5.27 ms │       - │      - │    - │  34.332 MiB │  6.362 GiB/s │ [copy pageable to device memory]                                                                ⋯
│  525 │ 104.54 ms │   4.76 ms │       - │      - │    - │  30.518 MiB │  6.262 GiB/s │ [copy pageable to device memory]                                                                ⋯
│  557 │ 110.52 ms │ 791.31 µs │       - │      - │    - │   3.815 MiB │  4.708 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 1063 │  111.5 ms │   1.35 ms │       - │      - │    - │   7.645 MiB │  5.519 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 1095 │ 114.97 ms │ 537.16 µs │       - │      - │    - │   3.815 MiB │  6.935 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 1124 │ 115.95 ms │  58.41 µs │       - │      - │    - │   3.164 MiB │ 52.898 GiB/s │ [set device memory]                                                                             ⋯
│ 1141 │ 116.02 ms │  13.11 µs │       - │      - │    - │ 360.000 KiB │ 26.182 GiB/s │ [set device memory]                                                                             ⋯
│ 1185 │ 153.41 ms │   5.53 ms │       - │      - │    - │  34.332 MiB │  6.064 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 1217 │ 161.78 ms │   4.81 ms │       - │      - │    - │  30.518 MiB │  6.200 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 1249 │ 167.66 ms │ 730.51 µs │       - │      - │    - │   3.815 MiB │  5.100 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 1974 │ 168.68 ms │   1.34 ms │       - │      - │    - │   7.645 MiB │  5.582 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 2006 │ 171.79 ms │ 721.93 µs │       - │      - │    - │   3.815 MiB │  5.160 GiB/s │ [copy pageable to device memory]                                                                ⋯
│ 2023 │ 172.92 ms │ 303.78 ms │     256 │     40 │   95 │           - │            - │ assemble_gpu_(CuSparseDeviceMatrixCSC<Float32, Int32, 1l>, CuDeviceArray<Float32, 1l, 1l>, Stat ⋯
└──────┴───────────┴───────────┴─────────┴────────┴──────┴─────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────

termi-official

Here the first review round covering the example, parts of the assembly logic and some of the kernel infrastructure.

docs/src/literate-tutorials/gpu_qp_heat_equation.jl

termi-official · 2024-11-25T13:55:56Z

ext/GPU/adapt.jl

Can we make these dispatches CUDA-specific?

I believe yes, but just out of curiosity, is there any performance-related reason that I might have overlooked?

No, this is more about making the code extensible (e.g. to allow using AMD).

termi-official · 2024-11-25T14:00:47Z

ext/GPU/gpu_assembler.jl

The allocation dispatches (Ferrite.allocate_matrix) are missing. Just allocate the analogue CPU matrix and shove it into CUSPARSE á la

Kcpu = allocate_matrix(SparseMatrixCSC{Float32, Int32}, dh) allocate_matrix(CUSPARSE.CuSparseMatrixCSC{Float32, Int32}, dh)

where we extract the type parameters from the dispatch.

termi-official · 2024-11-25T14:01:22Z

heatflow_qp_values.jl

To what does this file belong to?

Honestly, no idea, first time to notice it 😂

termi-official · 2024-11-25T14:03:02Z

docs/src/literate-tutorials/gpu_qp_heat_equation.jl

+##     gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh))
+##     gpu_kernel()
+## end
+


I think we are missing the analogue benchmark using QuadraturePointIterator

src/GPU/gpu_grid.jl

Abdelrahman912 · 2024-11-26T02:54:50Z

GPU Setup Benchmarking

function setup_bench_gpu(n_cells, n_basefuncs, cellvalues, dh)
    Kgpu = allocate_matrix(CUSPARSE.CuSparseMatrixCSC{Float32, Int32}, dh)
    fgpu = CUDA.zeros(eltype(Kgpu), ndofs(dh));
    gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh))
end

Profiler ran for 3.21 s, capturing 107 events.

Host-side activity: calling CUDA APIs took 375.8 ms (11.72% of the trace)
┌─────┬────────┬───────────┬─────────────────────────┬────────────────────────────┐
│  ID │  Start │      Time │ Name                    │                    Details │
├─────┼────────┼───────────┼─────────────────────────┼────────────────────────────┤
│   5 │ 2.83 s │  92.27 µs │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│  19 │ 2.83 s │   4.77 µs │ cuStreamSynchronize     │                          - │
│  28 │ 2.83 s │  10.39 ms │ cuMemcpyHtoDAsync       │                          - │
│  33 │ 2.84 s │   9.14 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  47 │ 2.85 s │  12.16 µs │ cuStreamSynchronize     │                          - │
│  56 │ 2.85 s │ 195.62 ms │ cuMemcpyHtoDAsync       │                          - │
│  61 │ 3.04 s │   7.37 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  75 │ 3.05 s │   5.01 µs │ cuStreamSynchronize     │                          - │
│  84 │ 3.05 s │ 152.64 ms │ cuMemcpyHtoDAsync       │                          - │
│  89 │  3.2 s │  46.49 µs │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│ 101 │  3.2 s │ 399.59 µs │ cuMemsetD32Async        │                          - │
└─────┴────────┴───────────┴─────────────────────────┴────────────────────────────┘

Device-side activity: GPU was busy for 301.98 ms (9.42% of the trace)
┌─────┬────────┬───────────┬─────────────┬───────────────┬──────────────────────────────────┐
│  ID │  Start │      Time │        Size │    Throughput │ Name                             │
├─────┼────────┼───────────┼─────────────┼───────────────┼──────────────────────────────────┤
│  28 │ 2.83 s │  10.54 ms │  15.274 MiB │   1.416 GiB/s │ [copy pageable to device memory] │
│  56 │ 2.88 s │ 168.79 ms │ 244.202 MiB │   1.413 GiB/s │ [copy pageable to device memory] │
│  84 │ 3.08 s │ 122.56 ms │ 244.202 MiB │   1.946 GiB/s │ [copy pageable to device memory] │
│ 101 │ 3.21 s │  91.31 µs │  15.274 MiB │ 163.349 GiB/s │ [set device memory]              │
└─────┴────────┴───────────┴─────────────┴───────────────┴──────────────────────────────────┘

GPU Kernel Benchmarking

CUDA.@profile trace = true gpu_kernel()

Profiler ran for 373.97 ms, capturing 1731 events.

Host-side activity: calling CUDA APIs took 315.64 ms (84.40% of the trace)
┌──────┬──────────┬───────────┬────────┬─────────────────────────┬────────────────────────────┐
│   ID │    Start │      Time │ Thread │ Name                    │                    Details │
├──────┼──────────┼───────────┼────────┼─────────────────────────┼────────────────────────────┤
│   21 │ 11.25 ms │   39.1 µs │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│   35 │ 11.32 ms │   5.48 µs │      1 │ cuStreamSynchronize     │                          - │
│   44 │ 11.34 ms │   5.62 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│   53 │  17.0 ms │  28.85 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│  139 │ 17.07 ms │    3.1 µs │      1 │ cuStreamSynchronize     │                          - │
│  148 │ 17.08 ms │   4.88 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  157 │ 22.87 ms │  27.89 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  171 │ 22.92 ms │   3.34 µs │      1 │ cuStreamSynchronize     │                          - │
│  180 │ 22.93 ms │  706.2 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  189 │ 23.65 ms │   8.58 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│  305 │ 23.81 ms │   1.43 µs │      1 │ cuStreamSynchronize     │                          - │
│  314 │ 23.81 ms │   1.13 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  323 │  26.6 ms │   12.4 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  337 │ 26.63 ms │   1.67 µs │      1 │ cuStreamSynchronize     │                          - │
│  346 │ 26.63 ms │ 582.93 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  363 │ 27.29 ms │   8.34 µs │      1 │ cuMemAllocFromPoolAsync │   3.164 MiB, device memory │
│  375 │  27.3 ms │  65.09 µs │      1 │ cuMemsetD32Async        │                          - │
│  380 │ 27.37 ms │   4.05 µs │      1 │ cuMemAllocFromPoolAsync │ 360.000 KiB, device memory │
│  392 │ 27.37 ms │  36.95 µs │      1 │ cuMemsetD32Async        │                          - │
│  413 │ 38.88 ms │  27.89 µs │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│  427 │ 38.93 ms │   4.29 µs │      1 │ cuStreamSynchronize     │                          - │
│  436 │ 38.95 ms │   5.96 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  445 │ 44.94 ms │   23.6 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│  555 │ 45.05 ms │   5.48 µs │      1 │ cuCtxSetCurrent         │                          - │
│  556 │ 45.05 ms │ 239.13 µs │      1 │ cuCtxGetDevice          │                          - │
│  557 │ 45.33 ms │ 715.26 ns │      1 │ cuDeviceGetCount        │                          - │
│  560 │ 45.34 ms │  12.16 µs │      1 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│  565 │ 45.36 ms │   2.38 µs │      1 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│  570 │ 45.36 ms │   2.15 µs │      1 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│  575 │ 45.37 ms │    3.1 µs │      1 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│  580 │ 45.37 ms │   3.81 µs │      1 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│  586 │ 45.51 ms │   4.05 µs │      1 │ cuStreamSynchronize     │                          - │
│  595 │ 45.52 ms │   4.83 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  604 │  51.4 ms │  14.78 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  618 │ 51.43 ms │   2.86 µs │      1 │ cuStreamSynchronize     │                          - │
│  627 │ 51.44 ms │ 727.89 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  636 │ 52.18 ms │  15.97 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│  881 │  52.4 ms │   1.67 µs │      1 │ cuStreamSynchronize     │                          - │
│  890 │ 52.41 ms │    1.3 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  899 │ 55.51 ms │   17.4 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  913 │ 55.54 ms │   2.15 µs │      1 │ cuStreamSynchronize     │                          - │
│  922 │ 55.55 ms │ 687.12 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  939 │ 56.33 ms │ 882.39 µs │      1 │ cuLaunchKernel          │                          - │
│ 1715 │ 58.13 ms │ 315.64 ms │      2 │ cuStreamSynchronize     │                          - │
└──────┴──────────┴───────────┴────────┴─────────────────────────┴────────────────────────────┘

Device-side activity: GPU was busy for 343.25 ms (91.78% of the trace)
┌─────┬──────────┬───────────┬─────────┬────────┬──────┬─────────────┬──────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  ID │    Start │      Time │ Threads │ Blocks │ Regs │        Size │   Throughput │ Name                                                                                                                                     ⋯
├─────┼──────────┼───────────┼─────────┼────────┼──────┼─────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  44 │ 11.67 ms │   5.43 ms │       - │      - │    - │  34.332 MiB │  6.175 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 148 │ 17.26 ms │   4.85 ms │       - │      - │    - │  30.518 MiB │  6.139 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 180 │ 23.28 ms │ 532.15 µs │       - │      - │    - │   3.815 MiB │  7.000 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 314 │ 23.97 ms │   1.16 ms │       - │      - │    - │   7.645 MiB │  6.456 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 346 │ 26.81 ms │ 599.15 µs │       - │      - │    - │   3.815 MiB │  6.218 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 375 │ 27.47 ms │  40.53 µs │       - │      - │    - │   3.164 MiB │ 76.235 GiB/s │ [set device memory]                                                                                                                      ⋯
│ 392 │ 27.52 ms │   9.54 µs │       - │      - │    - │ 360.000 KiB │ 36.000 GiB/s │ [set device memory]                                                                                                                      ⋯
│ 436 │ 39.24 ms │    5.8 ms │       - │      - │    - │  34.332 MiB │  5.778 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 595 │ 45.71 ms │   4.88 ms │       - │      - │    - │  30.518 MiB │  6.101 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 627 │ 51.69 ms │ 735.76 µs │       - │      - │    - │   3.815 MiB │  5.063 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 890 │ 52.58 ms │   1.37 ms │       - │      - │    - │   7.645 MiB │  5.467 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 922 │ 55.78 ms │ 664.71 µs │       - │      - │    - │   3.815 MiB │  5.604 GiB/s │ [copy pageable to device memory]                                                                                                         ⋯
│ 939 │ 56.52 ms │ 317.17 ms │     256 │     40 │   95 │           - │            - │ assemble_gpu_(CuSparseDeviceMatrixCSC<Float32, Int32, 1l>, CuDeviceArray<Float32, 1l, 1l>, StaticCellValues<StaticInterpolationValues<La ⋯
└─────┴──────────┴───────────┴─────────┴────────┴──────┴─────────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

CPU Setup Benchmarking

function setup_bench_cpu( dh)
    K = allocate_matrix(SparseMatrixCSC{Float64, Int}, dh)
    f = zeros(eltype(K), ndofs(dh));
    return K,f
end

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.188 s … 10.796 s  ┊ GC (min … max): 5.31% … 1.65%
 Time  (median):     6.492 s             ┊ GC (median):    2.27%
 Time  (mean ± σ):   6.492 s ±  6.086 s  ┊ GC (mean ± σ):  2.27% ± 2.59%

  █                                                      █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.19 s        Histogram: frequency by time        10.8 s <

 Memory estimate: 2.12 GiB, allocs estimate: 2009.

CPU Assemble Benchmarking

1. Standard Assembly

@benchmark assemble_global_std!($cellvalues, $dh, $K, $f)

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.404 s …   1.439 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.437 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.429 s ± 16.653 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁                                                 ▁     █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█ ▁
  1.4 s          Histogram: frequency by time        1.44 s <

 Memory estimate: 1.45 KiB, allocs estimate: 15.

2. QuadratureValueIterator

@benchmark assemble_global_qp!($cellvalues, $dh, $K, $f)

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.201 s …   1.246 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.215 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.221 s ± 18.214 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █          █     █                      █               █  
  █▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.2 s          Histogram: frequency by time        1.25 s <

 Memory estimate: 1.45 KiB, allocs estimate: 15.

Abdelrahman912 · 2024-12-04T23:46:16Z

GPU Benchmarking

Setup:

Profiler ran for 3.52 s, capturing 389 events.

Host-side activity: calling CUDA APIs took 276.41 ms (7.86% of the trace)
┌─────┬────────┬───────────┬─────────────────────────┬────────────────────────────┐
│  ID │  Start │      Time │ Name                    │                    Details │
├─────┼────────┼───────────┼─────────────────────────┼────────────────────────────┤
│   5 │ 3.21 s │   1.11 ms │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│  19 │ 3.21 s │   2.86 µs │ cuStreamSynchronize     │                          - │
│  27 │ 3.21 s │  14.58 ms │ cuMemcpyHtoDAsync       │                          - │
│  32 │ 3.22 s │   6.52 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  46 │ 3.23 s │    3.1 µs │ cuStreamSynchronize     │                          - │
│  54 │ 3.23 s │ 157.46 ms │ cuMemcpyHtoDAsync       │                          - │
│  59 │ 3.39 s │   4.67 ms │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  73 │ 3.39 s │   2.38 µs │ cuStreamSynchronize     │                          - │
│  81 │ 3.39 s │  54.99 ms │ cuMemcpyHtoDAsync       │                          - │
│  86 │ 3.45 s │  41.25 µs │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│  98 │ 3.45 s │ 393.39 µs │ cuMemsetD32Async        │                          - │
│ 105 │ 3.47 s │   1.37 ms │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│ 119 │ 3.47 s │   5.01 µs │ cuStreamSynchronize     │                          - │
│ 127 │ 3.47 s │   8.08 ms │ cuMemcpyHtoDAsync       │                          - │
│ 132 │ 3.47 s │   1.38 ms │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│ 146 │ 3.48 s │   2.15 µs │ cuStreamSynchronize     │                          - │
│ 154 │ 3.48 s │  11.79 ms │ cuMemcpyHtoDAsync       │                          - │
│ 159 │ 3.49 s │ 815.87 µs │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 173 │ 3.49 s │   1.43 µs │ cuStreamSynchronize     │                          - │
│ 181 │ 3.49 s │   2.56 ms │ cuMemcpyHtoDAsync       │                          - │
│ 186 │ 3.49 s │  82.97 µs │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│ 305 │ 3.49 s │   1.91 µs │ cuStreamSynchronize     │                          - │
│ 313 │ 3.49 s │   1.55 ms │ cuMemcpyHtoDAsync       │                          - │
│ 318 │  3.5 s │  11.21 µs │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 332 │  3.5 s │ 953.67 ns │ cuStreamSynchronize     │                          - │
│ 340 │  3.5 s │  606.3 µs │ cuMemcpyHtoDAsync       │                          - │
│ 347 │  3.5 s │   7.64 ms │ cuMemAllocFromPoolAsync │ 308.990 MiB, device memory │
│ 359 │ 3.51 s │  55.31 µs │ cuMemsetD32Async        │                          - │
│ 364 │ 3.51 s │ 427.01 µs │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│ 376 │ 3.51 s │  34.09 µs │ cuMemsetD32Async        │                          - │
└─────┴────────┴───────────┴─────────────────────────┴────────────────────────────┘

Device-side activity: GPU was busy for 210.53 ms (5.98% of the trace)
┌─────┬────────┬───────────┬─────────────┬───────────────┬──────────────────────────────────┐
│  ID │  Start │      Time │        Size │    Throughput │ Name                             │
├─────┼────────┼───────────┼─────────────┼───────────────┼──────────────────────────────────┤
│  27 │ 3.21 s │  10.63 ms │  15.274 MiB │   1.403 GiB/s │ [copy pageable to device memory] │
│  54 │ 3.25 s │ 131.97 ms │ 244.202 MiB │   1.807 GiB/s │ [copy pageable to device memory] │
│  81 │  3.4 s │  46.19 ms │ 244.202 MiB │   5.163 GiB/s │ [copy pageable to device memory] │
│  98 │ 3.45 s │  90.36 µs │  15.274 MiB │ 165.073 GiB/s │ [set device memory]              │
│ 127 │ 3.47 s │   5.96 ms │  34.332 MiB │   5.629 GiB/s │ [copy pageable to device memory] │
│ 154 │ 3.48 s │  10.47 ms │  30.518 MiB │   2.847 GiB/s │ [copy pageable to device memory] │
│ 181 │ 3.49 s │ 881.67 µs │   3.815 MiB │   4.225 GiB/s │ [copy pageable to device memory] │
│ 313 │ 3.49 s │   1.45 ms │   7.645 MiB │   5.154 GiB/s │ [copy pageable to device memory] │
│ 340 │  3.5 s │ 609.87 µs │   3.815 MiB │   6.108 GiB/s │ [copy pageable to device memory] │
│ 359 │ 3.52 s │   2.07 ms │ 308.990 MiB │ 145.709 GiB/s │ [set device memory]              │
│ 376 │ 3.52 s │ 210.29 µs │  34.332 MiB │ 159.439 GiB/s │ [set device memory]              │
└─────┴────────┴───────────┴─────────────┴───────────────┴──────────────────────────────────┘

kernel launch

Profiler ran for 479.87 ms, capturing 829 events.

Host-side activity: calling CUDA APIs took 478.53 ms (99.72% of the trace)
┌─────┬──────────┬───────────┬────────┬─────────────────────┐
│  ID │    Start │      Time │ Thread │ Name                │
├─────┼──────────┼───────────┼────────┼─────────────────────┤
│  49 │ 89.65 µs │ 352.62 µs │      1 │ cuLaunchKernel      │
│ 825 │  1.02 ms │ 478.53 ms │      2 │ cuStreamSynchronize │
└─────┴──────────┴───────────┴────────┴─────────────────────┘

Device-side activity: GPU was busy for 478.59 ms (99.73% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬──────┐
│ ID │     Start │      Time │ Threads │ Blocks │ Regs │ Name │
├────┼───────────┼───────────┼─────────┼────────┼──────┼──────┤
│ 49 │ 949.62 µs │ 478.59 ms │     256 │     40 │  255 │ _4   │
└────┴───────────┴───────────┴─────────┴────────┴──────┴──────┘

CPU Benchmarking

Setup

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.599 s … 5.068 s  ┊ GC (min … max): 6.36% … 1.17%
 Time  (median):     3.834 s            ┊ GC (median):    2.93%
 Time  (mean ± σ):   3.834 s ± 1.746 s  ┊ GC (mean ± σ):  2.93% ± 3.67%

  █                                                     █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.6 s         Histogram: frequency by time       5.07 s <

 Memory estimate: 2.12 GiB, allocs estimate: 2009.

Kernel launch

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.406 s …   1.526 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.454 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.460 s ± 49.409 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                   █    █                              █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.41 s         Histogram: frequency by time        1.53 s <

 Memory estimate: 1.45 KiB, allocs estimate: 15.

KnutAM and others added 10 commits January 11, 2024 10:35

Initial ideas

a979fb2

Working implementation

298158c

Merge branch 'master' into kam/QuadraturePointIterator

1794db3

Add static values version and improve interface

51ab4f2

Add dev example and test

22a7377

Merge branch 'master' into kam/QuadraturePointIterator

18377f3

Add StaticCellValues without stored cell coordinates

27a3a96

initial ideas

95b5729

minor changes

d4e881d

Merge branch 'Ferrite-FEM:master' into cuda-heat-example-w-quaditer

f55b878

koehlerson reviewed May 15, 2024

View reviewed changes

docs/src/literate-tutorials/gpu_qp_heat_equation.jl Outdated Show resolved Hide resolved

KristofferC reviewed May 15, 2024

View reviewed changes

docs/src/literate-tutorials/gpu_qp_heat_equation.jl Outdated Show resolved Hide resolved

Abdelrahman912 added 2 commits May 23, 2024 15:54

add some abstractions

c1ef6ad

add minor comment

394ac6a

Abdelrahman912 added 3 commits May 30, 2024 22:57

add z dierction for numerical integration

1f0df67

add Float32

3152042

minor fix

aac5994

Abdelrahman912 commented Jun 4, 2024

View reviewed changes

src/Grid/grid_generators.jl Outdated Show resolved Hide resolved

termi-official mentioned this pull request Jun 11, 2024

CUDA monodomain solver baseline termi-official/Thunderbolt.jl#104

Merged

7 tasks

Abdelrahman912 added 5 commits June 18, 2024 19:52

init coloring implementation

142f89a

init working on the assembler

eaff534

init gpu_assembler

ffdc341

implement naive gpu_assembler

59595e8

minor fix

0e3cb21

Abdelrahman912 added 10 commits November 12, 2024 01:51

make cache mutable

7338788

put the coloring stuff in the init

cbab665

minor fix

1c81281

code for benchmarking (to be removed)

d42bcab

rm cpu multithreading benchmark code

1ab1650

init fix for higher order approximations in gpu

bc8ec95

add working imp for global gpu mem

c7f4b0f

add some comments

d4d5967

trying to make the ci happy

3b2196b

minor fix

825d257

comment gpu related stuff in eg to pass ci

6109bd1

termi-official reviewed Nov 25, 2024

View reviewed changes

Abdelrahman912 added 2 commits November 25, 2024 20:24

some review fixes

9caa60b

some review fixes

868d559

Abdelrahman912 added 10 commits November 26, 2024 03:58

add allocate_matrix for CuSparseMatrix

a4637b6

init first ideas for cuda mem allocator

1619986

add cuda mem interface

69eb55a

minor fix

ad09d08

first fix for global mem alloc

801868b

init fix for shared mem Alloc

81f932b

fix for keywords args bug

dd7868c

init pre launch adaptation

441c9fb

minor fix

52f1479

refactor mem allocate in cuda kernel launcher

05fb154

Abdelrahman912 added 3 commits December 5, 2024 00:49

minor changes

4381a76

fix tests

63bbffe

minor fix

57d01bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda heat example w quaditer #913

Cuda heat example w quaditer #913

Abdelrahman912 commented May 14, 2024

Abdelrahman912 commented May 23, 2024

termi-official commented May 23, 2024

Some problems I have encountered that might be so straightforward to tackle:

KnutAM commented May 23, 2024

termi-official commented Jun 4, 2024 •

edited

Loading

Abdelrahman912 commented Jun 20, 2024

termi-official commented Jun 20, 2024

Abdelrahman912 commented Nov 19, 2024

termi-official left a comment

termi-official Nov 25, 2024

Abdelrahman912 Nov 25, 2024

termi-official Nov 25, 2024

termi-official Nov 25, 2024

termi-official Nov 25, 2024

Abdelrahman912 Nov 25, 2024

termi-official Nov 25, 2024

Abdelrahman912 commented Nov 26, 2024

Abdelrahman912 commented Dec 4, 2024

Cuda heat example w quaditer #913

Are you sure you want to change the base?

Cuda heat example w quaditer #913

Conversation

Abdelrahman912 commented May 14, 2024

Abdelrahman912 commented May 23, 2024

What I did for now, and it's still work in progress:

Some problems I have encountered that might be so straightforward to tackle:

termi-official commented May 23, 2024

Some problems I have encountered that might be so straightforward to tackle:

KnutAM commented May 23, 2024

termi-official commented Jun 4, 2024 • edited Loading

Abdelrahman912 commented Jun 20, 2024

So far:

termi-official commented Jun 20, 2024

Abdelrahman912 commented Nov 19, 2024

termi-official left a comment

Choose a reason for hiding this comment

termi-official Nov 25, 2024

Choose a reason for hiding this comment

Abdelrahman912 Nov 25, 2024

Choose a reason for hiding this comment

termi-official Nov 25, 2024

Choose a reason for hiding this comment

termi-official Nov 25, 2024

Choose a reason for hiding this comment

termi-official Nov 25, 2024

Choose a reason for hiding this comment

Abdelrahman912 Nov 25, 2024

Choose a reason for hiding this comment

termi-official Nov 25, 2024

Choose a reason for hiding this comment

Abdelrahman912 commented Nov 26, 2024

GPU Setup Benchmarking

GPU Kernel Benchmarking

CPU Setup Benchmarking

CPU Assemble Benchmarking

1. Standard Assembly

2. QuadratureValueIterator

Abdelrahman912 commented Dec 4, 2024

GPU Benchmarking

Setup:

kernel launch

CPU Benchmarking

Setup

Kernel launch

termi-official commented Jun 4, 2024 •

edited

Loading