Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nh to type parameter space #1894

Merged
merged 1 commit into from
Jul 23, 2024
Merged

Add Nh to type parameter space #1894

merged 1 commit into from
Jul 23, 2024

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Jul 22, 2024

This PR adds Nh to the type parameter space. This seems to be needed for GPU performance, as indicated by the recently added performance benchmark script results:

at_dot_call!($X_array, $Y_array):
     6 milliseconds, 775 microseconds
at_dot_call!($X_vector, $Y_vector):
     2 milliseconds, 834 microseconds
custom_sol_kernel!($X_vector, $Y_vector, $(Val(N))):
     2 milliseconds, 547 microseconds
custom_kernel_bc!($X_vector, $Y_vector, $us; printtb = false):
     2 milliseconds, 561 microseconds
custom_kernel_bc!($X_array, $Y_array, $us; printtb = false, use_pw = false):
     4 milliseconds, 160 microseconds
custom_kernel_bc!($X_array, $Y_array, $us; printtb = false, use_pw = true):
     2 milliseconds, 584 microseconds
custom_kernel_bc!($X_vector, $Y_vector, $uss; printtb = false):
     2 milliseconds, 540 microseconds
custom_kernel_bc!($X_array, $Y_array, $uss; printtb = false, use_pw = false):
     2 milliseconds, 715 microseconds
custom_kernel_bc!($X_array, $Y_array, $uss; printtb = false, use_pw = true):
     2 milliseconds, 547 microseconds

Where us and uss are the "universal sizes" and "static universal sizes" respectively. The only difference is how Nh is stored (dynamically vs statically). It's clear from this benchmark that simply moving Nh into the type domain, (some) of our kernels can improve.

Not all of them will because, for example, our copyto! kernel is using static ranges since it only accesses size(dest, 4):

function knl_copyto!(dest, src)

    i = CUDA.threadIdx().x
    j = CUDA.threadIdx().y

    h = CUDA.blockIdx().x
    v = CUDA.blockDim().z * (CUDA.blockIdx().y - 1) + CUDA.threadIdx().z

    if v <= size(dest, 4)
        I = CartesianIndex((i, j, 1, v, h))
        @inbounds dest[I] = src[I]
    end
    return nothing
end

This is likely why the "flat" implementation:

function knl_copyto_flat!(dest::AbstractData, bc)
    @inbounds begin
        n = size(dest)
        tidx = thread_index()
        if valid_range(tidx, prod(n))
            I = kernel_indexes(tidx, n)
            dest[I] = bc[I]
        end
    end
    return nothing
end

was being outperformed.

Also, I noticed in our single_field_solve_kernel! can improve on this:

function single_field_solve_kernel!(device, cache, x, A, b)
    idx = CUDA.threadIdx().x + (CUDA.blockIdx().x - 1) * CUDA.blockDim().x
    Ni, Nj, _, _, Nh = size(Fields.field_values(A))
    if idx <= Ni * Nj * Nh
        (i, j, h) = CartesianIndices((1:Ni, 1:Nj, 1:Nh))[idx].I

        _single_field_solve!(
            device,
            Spaces.column(cache, i, j, h),
            Spaces.column(x, i, j, h),
            Spaces.column(A, i, j, h),
            Spaces.column(b, i, j, h),
        )
    end
    return nothing
end

Rather than using Val everywhere (which will result in uglier interfaces and many creations of runtime types on the CPU), I think it's worth just moving this into the type space, so that we get this performance improvement by default.

Closes #11 for good.

@charleskawczynski
Copy link
Member Author

Ahh, fortunately, our stencils should see an improvement:

    ::Val{Nv},
    ::Val{Nq},
    Nh,
) where {Nv, Nq}
    @inbounds begin
        gid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
        if gid  Nv * Nq * Nq * Nh
            (li, lw, rw, ri) = bds
            (v, i, j, h) = cart_ind((Nv, Nq, Nq, Nh), gid).I

@charleskawczynski charleskawczynski merged commit 98f4047 into main Jul 23, 2024
15 of 18 checks passed
@charleskawczynski charleskawczynski deleted the ck/STATIC branch July 23, 2024 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add size to parameter space of data layout
1 participant