Allow remapper to take multiple Fields #1669

Sbozzolo · 2024-04-11T04:49:38Z

This PR increases the rank of some internal arrays of the remapper by 1. The new dimension is to allow the remapper to process multiple Fields at the same time.

The idea is the following: when creating a Remapper, one can specify a buffer_length. When the buffer_length is larger than one, the Remapper will preallocate space to be able to interpolate buffer_length at the same time. Then, when interpolate is called, the Remapper can work with any number of Fields and the work is divided in batches of buffer_length size.

E.g, If buffer_length is 10 and 22 fields are to be interpolated, the work is processed in groups of 10+10+2. The only cost of choosing a large buffer_length is memory (there shouldn't be any runtime penalty in interpolating 1 field with a batch_length of 100).

The memory cost of a Remapper is order (B, H, V), where B is the buffer length, H is the number of horizontal points, and V the number of vertical points. For H = 180x90 and V = 50, this means that each buffer costs 51_840_000 bytes (50 MB) for double precision on the root process + 50 MB / N_tasks on each task + base cost that is independent of B.

When it comes to functionalities, this implementation is strictly a superset of the previous one and it is almost fully backward compatible.

To make it work nicely with CUDA, I had to add a constraint in interpolate!: the destination array has to be a CuArray. Previously, the destination array could a normal Array and the code would copy it over automatically. This does not seem possible using views (without scalar indexing). As such, this PR is technically breaking. To make the interface uniform, interpolate will return an array of the same type as the field (eg, a CuArray for a Field defined on the GPU). It is up to the user to move it to the CPU. I actually think that this is a better behavior.

The new module has additional allocations that lead to a ~ 50 % slow down (resulting in a reduction of 1-2 % in SYPD). I haven't spent too much time trying to hyper optmize this, and I think that this PR is already in a good shape to be merged. In the future, we can try to find ways to further reduce allocations and improve performance. At that point, we should also look at data locality a little better (I haven't put too much thought in the order of the for loops in the GPU kernels.)

This PR increases the rank of some internal arrays of the remapper by 1. The new dimension is to allow the remapper to process multiple Fields at the same time. The idea is the following: when creating a Remapper, one can specify a buffer_length. When the buffer_length is larger than one, the Remapper will preallocate space to be able to interpolate buffer_length at the same time. Then, when `interpolate` is called, the Remapper can work with any number of Fields and the work is divided in batches of buffer_length size. E.g, If buffer_length is 10 and 22 fields are to be interpolated, the work is processed in groups of 10+10+2. The only cost of choosing a large buffer_length is memory (there shouldn't be any runtime penalty in interpolating 1 field with a batch_length of 100). The memory cost of a Remapper is order (B, H, V), where B is the buffer length, H is the number of horizontal points, and V the number of vertical points. For H = 180x90 and V = 50, this means that each buffer costs 51_840_000 bytes (50 MB) for double precision on the root process + 50 MB / N_tasks on each task + base cost that is independent of B.

NEWS.md

charleskawczynski · 2024-04-19T01:14:29Z

I looked through the changes in more detail, I think there are some internals that could be improved, but I'd rather not delay the PR. The API looks much more suitable for further internal optimization, so I'm happy with the new design. Remapper, interpolate! are the main user-facing pieces, correct? Also, what is the 3rd dimension of field_values? It seems like it's always indexed into via nothing, is this increasing the stride? (again, these are internals, so not super important).

Just for future reference, regarding potential internal improvements, we could:

Move buffer_length into the type space, so that the compiler can leverage that being a constant
Make colons in Remapper lazy (and put num_dims in the type space), so that the compiler can leverage that being a constant
Reduce code duplication between the kernels by using multidimensional techniques described in https://julialang.org/blog/2016/02/iteration/
Address the # TODO: Check the memory access pattern. This was not optimized and likely inefficient! todo
Change from using step ranges (e.g., hindex:totalThreadsX:num_horiz) to while loops to alleviate register pressure (https://cuda.juliagpu.org/dev/tutorials/performance/#Avoiding-StepRange)
Effectively move _reset_interpolated_values! into _set_interpolated_values! (_set_interpolated_values! should also initialize to zero)
See if we can fix allocations by delaying the use of views, and instead passing statically known ranges

Sbozzolo · 2024-04-19T01:35:12Z

Thank you!

Yes, the public interface is the constructors for Remapper, interpolate, and interpolate! and everything else should be considered internal.

field_values is used to implement Operators.get_node more efficiently, as such it mirrors its structure:

ClimaCore.jl/src/Operators/spectralelement.jl

Lines 537 to 556 in 79039d4

    
           Base.@propagate_inbounds function get_node( 
        
               parent_space, 
        
               field::Fields.Field, 
        
               ij::CartesianIndex{2}, 
        
               slabidx, 
        
           ) 
        
               space = reconstruct_placeholder_space(axes(field), parent_space) 
        
               i, j = Tuple(ij) 
        
               if space isa Spaces.FaceExtrudedFiniteDifferenceSpace 
        
                   v = slabidx.v + half 
        
               elseif space isa Spaces.CenterExtrudedFiniteDifferenceSpace || 
        
                      space isa Spaces.AbstractSpectralElementSpace 
        
                   v = slabidx.v 
        
               else 
        
                   error("invalid space") 
        
               end 
        
               h = slabidx.h 
        
               fv = Fields.field_values(field) 
        
               return fv[i, j, nothing, v, h] 
        
           end

The third index is probably the vertical index, and I think that it is nothing because we are working with fixed horizontal slabs.

Sbozzolo force-pushed the gb/remapper_nfields branch 13 times, most recently from 767d0aa to cd9f632 Compare April 15, 2024 20:21

Sbozzolo marked this pull request as ready for review April 15, 2024 22:05

Sbozzolo requested a review from charleskawczynski April 15, 2024 22:26

Sbozzolo added 2 commits April 18, 2024 13:37

Move field index to last

a7eb55e

Sbozzolo force-pushed the gb/remapper_nfields branch 7 times, most recently from f0bfd0d to 3270fda Compare April 18, 2024 22:02

Remove all restrictions

a27735e

Sbozzolo force-pushed the gb/remapper_nfields branch from 3270fda to a27735e Compare April 18, 2024 22:08

Sbozzolo added the breaking 💥 label Apr 18, 2024

charleskawczynski reviewed Apr 18, 2024

View reviewed changes

NEWS.md Show resolved Hide resolved

charleskawczynski approved these changes Apr 19, 2024

View reviewed changes

Sbozzolo mentioned this pull request Apr 19, 2024

Use multi-field ClimaCore Remapper CliMA/ClimaDiagnostics.jl#16

Open

Sbozzolo merged commit 3faa8da into main Apr 22, 2024
16 checks passed

Sbozzolo deleted the gb/remapper_nfields branch April 22, 2024 20:20

Sbozzolo mentioned this pull request Apr 22, 2024

Add support for regridding multiple Fields at the same time #1668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow remapper to take multiple Fields #1669

Allow remapper to take multiple Fields #1669

Sbozzolo commented Apr 11, 2024 •

edited

Loading

charleskawczynski commented Apr 19, 2024

Sbozzolo commented Apr 19, 2024

Allow remapper to take multiple Fields #1669

Allow remapper to take multiple Fields #1669

Conversation

Sbozzolo commented Apr 11, 2024 • edited Loading

charleskawczynski commented Apr 19, 2024

Sbozzolo commented Apr 19, 2024

Sbozzolo commented Apr 11, 2024 •

edited

Loading