Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix correctness in cuda_mapreduce #2106

Merged
merged 1 commit into from
Dec 21, 2024
Merged

Fix correctness in cuda_mapreduce #2106

merged 1 commit into from
Dec 21, 2024

Conversation

Sbozzolo
Copy link
Member

cuda_mapreduce was not working correctly with certain spaces.

Why was this happening?

I added a comment to describe the algorithm in the commit.

In a nutshell, the algorithm was not taking into account the fact that the final block is not completely filled with points to process. Therefore, the reduction included some elements that did not contain real points (but the value 0).

Closes #2097

@Sbozzolo Sbozzolo force-pushed the gb/fix_cuda_reductions branch from 63a7ef8 to cafbbcf Compare December 18, 2024 00:59
@Sbozzolo Sbozzolo force-pushed the gb/fix_cuda_reductions branch 5 times, most recently from b0eea6e to 0298139 Compare December 19, 2024 16:02
@sriharshakandala
Copy link
Member

One option is to use binary-op appropriate initialization. For example,

function _init_val_for_reduction(f::Function, ::Type{T}) where {T}
    f == min && return typemax(T)
    f == max && return typemin(T)
    return T(0)
end

with

reduction[tidx] = _init_val_for_reduction(op, T)

@Sbozzolo
Copy link
Member Author

One option is to use binary-op appropriate initialization. For example,

function _init_val_for_reduction(f::Function, ::Type{T}) where {T}
    f == min && return typemax(T)
    f == max && return typemin(T)
    return T(0)
end

with

reduction[tidx] = _init_val_for_reduction(op, T)

This would require defining the init value for every function, which doesn't seem optimal.

Is there any issue with the fix in this PR?

@@ -31,6 +31,34 @@ function mapreduce_cuda(
weighted_jacobian = OnesArray(parent(data)),
opargs...,
)
# This function implements the following parallel reduction algorithm:
#
# Blocks processes multiple data points at the same time (n_ops_on_load)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each thread loads multiple data points in shmem!

`cuda_mapreduce` was not working correctly with certain spaces.

Why was this happening?

I added a comment to describe the algorithm in the commit.

In a nutshell, the algorithm was not taking into account the fact that
the final block is not completely filled with points to process.
Therefore, the reduction included some elements that did not contain
real points (but the value 0).
@Sbozzolo Sbozzolo force-pushed the gb/fix_cuda_reductions branch from 0298139 to 8cdf3f3 Compare December 20, 2024 22:29
@Sbozzolo Sbozzolo enabled auto-merge December 20, 2024 22:29
@Sbozzolo Sbozzolo merged commit 6539b89 into main Dec 21, 2024
32 of 34 checks passed
@Sbozzolo Sbozzolo deleted the gb/fix_cuda_reductions branch December 21, 2024 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minimum on a field does not return the correct value on GPU
3 participants