Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cartesian] Loop blocking in vertical direction #1618

Closed
xyuan opened this issue Aug 20, 2024 · 3 comments
Closed

[cartesian] Loop blocking in vertical direction #1618

xyuan opened this issue Aug 20, 2024 · 3 comments

Comments

@xyuan
Copy link

xyuan commented Aug 20, 2024

In the current implementation, the library node expansion pattern doesn't support loop blocking in vertical direction, see below for the generated C++ code,

    {
        #pragma omp parallel for collapse(2)
        for (auto __tile_j = 0; __tile_j < __J; __tile_j += 8) {
            for (auto __tile_i = 0; __tile_i < __I; __tile_i += 8) {
                {
                    for (auto __i = (__tile_i - 1); __i < ((__tile_i + Min(8, (__I - __tile_i))) + 1); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            for (auto __k = 0; __k < __K; __k += 1) {
                                nested_state_0_0_0_0_11(__state, &dxa[((__dxa_I_stride * __i) + (__dxa_J_stride * __j))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 1)))], &__0_al[(((__K * (__j - __tile_j)) + ((8 * __K) * ((__i - __tile_i) + 1))) + __k)], __I, __K, __dxa_I_stride, __dxa_J_stride, __i, __q_I_stride, __q_J_stride, __q_K_stride);
                            }
                        }
                    }
                }
                {
                    for (auto __i = __tile_i; __i < (__tile_i + Min(8, (__I - __tile_i))); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            for (auto __k = 0; __k < __K; __k += 1) {
                                nested_state_0_0_0_0_12(__state, &__0_al[((((8 * __K) * (__i - __tile_i)) + (__K * (__j - __tile_j))) + __k)], &courant[(((__courant_I_stride * __i) + (__courant_J_stride * __j)) + (__courant_K_stride * __k))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 2)))], &xflux[(((__i * __xflux_I_stride) + (__j * __xflux_J_stride)) + (__k * __xflux_K_stride))], __K, __courant_I_stride, __courant_J_stride, __courant_K_stride, __q_I_stride, __q_J_stride, __q_K_stride, __xflux_I_stride, __xflux_J_stride, __xflux_K_stride);
                            }
                        }
                    }
                }
            }
        }
    }

}

This is fine when vertical loop size is small, however, when we run high resolution with larger dimensional size in vertical direction, this vertical loop is no longer cache friendly, we will need to add new library node expansion pattern to support loop blocking in vertical direction, with which more cache friendly code can be generated as fellowing,

    {
        #pragma omp parallel for collapse(2)
        for (auto __tile_j = 0; __tile_j < __J; __tile_j += 8) {
            for (auto __tile_i = 0; __tile_i < __I; __tile_i += 8) {
                 **for (auto __tile_k = 0; __tile_k < __K; __tile_k += 8) {**
                {
                    for (auto __i = (__tile_i - 1); __i < ((__tile_i + Min(8, (__I - __tile_i))) + 1); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            **for (auto __k = __tile_k; __k < (__tile_k + Min(8, (K__-__tile_k))); __k += 1) {**
                                nested_state_0_0_0_0_11(__state, &dxa[((__dxa_I_stride * __i) + (__dxa_J_stride * __j))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 1)))], &__0_al[(((__K * (__j - __tile_j)) + ((8 * __K) * ((__i - __tile_i) + 1))) + __k)], __I, __K, __dxa_I_stride, __dxa_J_stride, __i, __q_I_stride, __q_J_stride, __q_K_stride);
                            }
                        }
                    }
                  }
                }
                {
                    for (auto __i = __tile_i; __i < (__tile_i + Min(8, (__I - __tile_i))); __i += 1) {
                        for (auto __j = __tile_j; __j < (__tile_j + Min(8, (__J - __tile_j))); __j += 1) {
                            **for (auto __k = __tile_k; __k < (__tile_k + Min(8, (__K - __tile_k))); __k += 1) {**
                                nested_state_0_0_0_0_12(__state, &__0_al[((((8 * __K) * (__i - __tile_i)) + (__K * (__j - __tile_j))) + __k)], &courant[(((__courant_I_stride * __i) + (__courant_J_stride * __j)) + (__courant_K_stride * __k))], &q[(((__j * __q_J_stride) + (__k * __q_K_stride)) + (__q_I_stride * (__i + 2)))], &xflux[(((__i * __xflux_I_stride) + (__j * __xflux_J_stride)) + (__k * __xflux_K_stride))], __K, __courant_I_stride, __courant_J_stride, __courant_K_stride, __q_I_stride, __q_J_stride, __q_K_stride, __xflux_I_stride, __xflux_J_stride, __xflux_K_stride);
                            }
                        }
                    }
                }
              }
            }
        }
    }

}

The bold code is the new code that added with the new library node expansion pattern. The corresponding PR will be added and update here for further review.

@FlorianDeconinck FlorianDeconinck changed the title loop blocking in vertical direction [cartesian] Loop blocking in vertical direction Aug 20, 2024
@FlorianDeconinck
Copy link
Contributor

Ping @twicki & @romanc to track for NASA

@romanc
Copy link
Contributor

romanc commented Jan 22, 2025

Note: this is missing the gt4py.cartesian label.

@FlorianDeconinck
Copy link
Contributor

This is a dace:X backend limitation on how we setup the sections in DaCe extensions. Will be fixed as part of a more holistic review of CPU optimization - as vertical loop blocking is but one of the many shortcomings of the CPU strategy with dace:X backends

@FlorianDeconinck FlorianDeconinck closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants