-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow remapper to take multiple Fields #1669
Conversation
767d0aa
to
cd9f632
Compare
This PR increases the rank of some internal arrays of the remapper by 1. The new dimension is to allow the remapper to process multiple Fields at the same time. The idea is the following: when creating a Remapper, one can specify a buffer_length. When the buffer_length is larger than one, the Remapper will preallocate space to be able to interpolate buffer_length at the same time. Then, when `interpolate` is called, the Remapper can work with any number of Fields and the work is divided in batches of buffer_length size. E.g, If buffer_length is 10 and 22 fields are to be interpolated, the work is processed in groups of 10+10+2. The only cost of choosing a large buffer_length is memory (there shouldn't be any runtime penalty in interpolating 1 field with a batch_length of 100). The memory cost of a Remapper is order (B, H, V), where B is the buffer length, H is the number of horizontal points, and V the number of vertical points. For H = 180x90 and V = 50, this means that each buffer costs 51_840_000 bytes (50 MB) for double precision on the root process + 50 MB / N_tasks on each task + base cost that is independent of B.
f0bfd0d
to
3270fda
Compare
3270fda
to
a27735e
Compare
I looked through the changes in more detail, I think there are some internals that could be improved, but I'd rather not delay the PR. The API looks much more suitable for further internal optimization, so I'm happy with the new design. Just for future reference, regarding potential internal improvements, we could:
|
Thank you! Yes, the public interface is the constructors for Remapper,
ClimaCore.jl/src/Operators/spectralelement.jl Lines 537 to 556 in 79039d4
The third index is probably the vertical index, and I think that it is |
This PR increases the rank of some internal arrays of the remapper by 1. The new dimension is to allow the remapper to process multiple Fields at the same time.
The idea is the following: when creating a Remapper, one can specify a buffer_length. When the buffer_length is larger than one, the Remapper will preallocate space to be able to interpolate buffer_length at the same time. Then, when
interpolate
is called, the Remapper can work with any number of Fields and the work is divided in batches of buffer_length size.E.g, If buffer_length is 10 and 22 fields are to be interpolated, the work is processed in groups of 10+10+2. The only cost of choosing a large buffer_length is memory (there shouldn't be any runtime penalty in interpolating 1 field with a batch_length of 100).
The memory cost of a Remapper is order (B, H, V), where B is the buffer length, H is the number of horizontal points, and V the number of vertical points. For H = 180x90 and V = 50, this means that each buffer costs 51_840_000 bytes (50 MB) for double precision on the root process + 50 MB / N_tasks on each task + base cost that is independent of B.
When it comes to functionalities, this implementation is strictly a superset of the previous one and it is almost fully backward compatible.
To make it work nicely with CUDA, I had to add a constraint in
interpolate!
: the destination array has to be a CuArray. Previously, the destination array could a normal Array and the code would copy it over automatically. This does not seem possible using views (without scalar indexing). As such, this PR is technically breaking. To make the interface uniform,interpolate
will return an array of the same type as the field (eg, a CuArray for a Field defined on the GPU). It is up to the user to move it to the CPU. I actually think that this is a better behavior.The new module has additional allocations that lead to a ~ 50 % slow down (resulting in a reduction of 1-2 % in SYPD). I haven't spent too much time trying to hyper optmize this, and I think that this PR is already in a good shape to be merged. In the future, we can try to find ways to further reduce allocations and improve performance. At that point, we should also look at data locality a little better (I haven't put too much thought in the order of the for loops in the GPU kernels.)