Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow remapper to take multiple Fields #1669

Merged
merged 3 commits into from
Apr 22, 2024
Merged

Allow remapper to take multiple Fields #1669

merged 3 commits into from
Apr 22, 2024

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Apr 11, 2024

This PR increases the rank of some internal arrays of the remapper by 1. The new dimension is to allow the remapper to process multiple Fields at the same time.

The idea is the following: when creating a Remapper, one can specify a buffer_length. When the buffer_length is larger than one, the Remapper will preallocate space to be able to interpolate buffer_length at the same time. Then, when interpolate is called, the Remapper can work with any number of Fields and the work is divided in batches of buffer_length size.

E.g, If buffer_length is 10 and 22 fields are to be interpolated, the work is processed in groups of 10+10+2. The only cost of choosing a large buffer_length is memory (there shouldn't be any runtime penalty in interpolating 1 field with a batch_length of 100).

The memory cost of a Remapper is order (B, H, V), where B is the buffer length, H is the number of horizontal points, and V the number of vertical points. For H = 180x90 and V = 50, this means that each buffer costs 51_840_000 bytes (50 MB) for double precision on the root process + 50 MB / N_tasks on each task + base cost that is independent of B.

When it comes to functionalities, this implementation is strictly a superset of the previous one and it is almost fully backward compatible.

To make it work nicely with CUDA, I had to add a constraint in interpolate!: the destination array has to be a CuArray. Previously, the destination array could a normal Array and the code would copy it over automatically. This does not seem possible using views (without scalar indexing). As such, this PR is technically breaking. To make the interface uniform, interpolate will return an array of the same type as the field (eg, a CuArray for a Field defined on the GPU). It is up to the user to move it to the CPU. I actually think that this is a better behavior.

The new module has additional allocations that lead to a ~ 50 % slow down (resulting in a reduction of 1-2 % in SYPD). I haven't spent too much time trying to hyper optmize this, and I think that this PR is already in a good shape to be merged. In the future, we can try to find ways to further reduce allocations and improve performance. At that point, we should also look at data locality a little better (I haven't put too much thought in the order of the for loops in the GPU kernels.)

@Sbozzolo Sbozzolo force-pushed the gb/remapper_nfields branch 13 times, most recently from 767d0aa to cd9f632 Compare April 15, 2024 20:21
@Sbozzolo Sbozzolo marked this pull request as ready for review April 15, 2024 22:05
This PR increases the rank of some internal arrays of the remapper by 1. The new
dimension is to allow the remapper to process multiple Fields at the same time.

The idea is the following: when creating a Remapper, one can specify a
buffer_length. When the buffer_length is larger than one, the Remapper will
preallocate space to be able to interpolate buffer_length at the same time.
Then, when `interpolate` is called, the Remapper can work with any number of
Fields and the work is divided in batches of buffer_length size.

E.g, If buffer_length is 10 and 22 fields are to be interpolated, the work is
processed in groups of 10+10+2. The only cost of choosing a large buffer_length
is memory (there shouldn't be any runtime penalty in interpolating 1 field with
a batch_length of 100).

The memory cost of a Remapper is order (B, H, V), where B is the buffer length,
H is the number of horizontal points, and V the number of vertical points. For H
= 180x90 and V = 50, this means that each buffer costs 51_840_000 bytes (50 MB)
for double precision on the root process + 50 MB / N_tasks on each task + base
cost that is independent of B.
@Sbozzolo Sbozzolo force-pushed the gb/remapper_nfields branch 7 times, most recently from f0bfd0d to 3270fda Compare April 18, 2024 22:02
@charleskawczynski
Copy link
Member

I looked through the changes in more detail, I think there are some internals that could be improved, but I'd rather not delay the PR. The API looks much more suitable for further internal optimization, so I'm happy with the new design. Remapper, interpolate! are the main user-facing pieces, correct? Also, what is the 3rd dimension of field_values? It seems like it's always indexed into via nothing, is this increasing the stride? (again, these are internals, so not super important).

Just for future reference, regarding potential internal improvements, we could:

  • Move buffer_length into the type space, so that the compiler can leverage that being a constant
  • Make colons in Remapper lazy (and put num_dims in the type space), so that the compiler can leverage that being a constant
  • Reduce code duplication between the kernels by using multidimensional techniques described in https://julialang.org/blog/2016/02/iteration/
  • Address the # TODO: Check the memory access pattern. This was not optimized and likely inefficient! todo
  • Change from using step ranges (e.g., hindex:totalThreadsX:num_horiz) to while loops to alleviate register pressure (https://cuda.juliagpu.org/dev/tutorials/performance/#Avoiding-StepRange)
  • Effectively move _reset_interpolated_values! into _set_interpolated_values! (_set_interpolated_values! should also initialize to zero)
  • See if we can fix allocations by delaying the use of views, and instead passing statically known ranges

@Sbozzolo
Copy link
Member Author

Thank you!

Yes, the public interface is the constructors for Remapper, interpolate, and interpolate! and everything else should be considered internal.

field_values is used to implement Operators.get_node more efficiently, as such it mirrors its structure:

Base.@propagate_inbounds function get_node(
parent_space,
field::Fields.Field,
ij::CartesianIndex{2},
slabidx,
)
space = reconstruct_placeholder_space(axes(field), parent_space)
i, j = Tuple(ij)
if space isa Spaces.FaceExtrudedFiniteDifferenceSpace
v = slabidx.v + half
elseif space isa Spaces.CenterExtrudedFiniteDifferenceSpace ||
space isa Spaces.AbstractSpectralElementSpace
v = slabidx.v
else
error("invalid space")
end
h = slabidx.h
fv = Fields.field_values(field)
return fv[i, j, nothing, v, h]
end

The third index is probably the vertical index, and I think that it is nothing because we are working with fixed horizontal slabs.

@Sbozzolo Sbozzolo merged commit 3faa8da into main Apr 22, 2024
16 checks passed
@Sbozzolo Sbozzolo deleted the gb/remapper_nfields branch April 22, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants