Templated MD arrays #39

mauneyc-LANL · 2023-10-02T21:10:01Z

PR Summary

This is a prototype for "templated MD", that is the reduction of code of the type

PORTABLE_FUNCTION PortableMDArray(T *data, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(1), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {}
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {
  }
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx3, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(nx3), nx4_(1), nx5_(1),
        nx6_(1) {}

To

  template <typename... NXs>
  PORTABLE_FUNCTION PortableMDArray(T *p, NXs... nxs) noexcept
      : PortableMDArray(p, nx_arr(nxs...), sizeof...(NXs)) {}

The single-value dimension variables int nxN_ have been replaced with a staticly sized container std::array<size_t, MAXDIM> nxs_. Also added PortableMDArray::rank_ member to avoid some reevaluations, though I'm not sure it's necessary.

Currently WIP, so documentation is wanting and a few function names/namespaces are silly/inconsistent/gibberish. I know there can be some reluctance about this coding style, so I wanted to get a "at least compiles" version up and gauge interest in iterating on it. (Actually passes a few tests! but haven't tried spiner with this yet).

It was working fine, why change it?

Update code to more modern style
Code is easier to maintain and diagnose
Allows for forward planning when C++ standards are updated
No change in down-facing API
For developers, the code is easier to read/glance

What are the downsides?

Some extra boilerplate to handle C++14's constrains on variadic programming (e.g. no std::apply(), no std::tuple, no fold expressions). C++14 can be "recursion heavy" in this regard, which may degrade performance.
API not "explicit" w.r.t. index ordering
(possibly) Longer compile times
For downstream users, code or errors passing through/originating in expanded template types can be agitating

Any suggestions, comments, questions welcome! @jonahm-LANL @chadmeyer @dholladay00 @jhp-lanl @jdolence

PR Checklist

Any changes to code are appropriately documented.
Code is formatted.
Install test passes.
Docs build.
If preparing for a new release, update the version in cmake.

dholladay00 · 2023-10-03T16:13:53Z

ports-of-call/portable_arrays.hpp

+// set_value end-case
+template <std::size_t Ind, typename NX>
+constexpr void set_value(narr &ndat, NX value) {
+  ndat[Ind] = value;


is operator[] of std::array marked constexpr? I thought this was something that preventing more widespread usage.

Sorry I intended to replace all these with appropriate PORTABLE_ s but missed a few

Oh, you probably meant something else. We can use a raw array, if std::array is messing with things

We should check that it runs on device, but I thought std::array worked with --expt-relaxed-constexpr but I'm not sure

@mauneyc-LANL you may be able to use std::get<Ind>(arr) = value;. But we really need to check and make sure it works on device.

Using --expt-relaxed-constexpr this runs smoothly using volta-x86 partition, kokkos+cuda cuda_arch=70 +wrapper, cmake .. -DPORTS_OF_CALL_BUILD_TESTING -DPORTABILITY_STRATEGY_KOKKOS=ON

ports-of-call/portable_arrays.hpp

dholladay00 · 2023-10-03T16:34:06Z

ports-of-call/portable_arrays.hpp

  PORTABLE_FORCEINLINE_FUNCTION int GetSize() const {
-    return nx1_ * nx2_ * nx3_ * nx4_ * nx5_ * nx6_;
+    return std::accumulate(nxs_.cbegin(), nxs_.cend(), 1,


Have you tested this on GPUs? I know Brendan had an issue and we had to write a custom accumulate. One could still use binary operators provided by the STL, but the accumulate itself wouldn't run. Perhaps we did something wrong, but I want to make sure this is still GPU callable.

I would also be concerned about std::accumulate. Might be better to hardcode this one, or write a a recursive thing ourselves.

It "worked" but gave warnings (even with --expt-relaxed-constexpr). I rewrote it as an explicit loop.

Yurlungur

This is a nice cleanup. Before merging, I'd like to:
a) Get a sense of compile time differences
b) Know for sure it all works on device

Yurlungur · 2023-10-03T18:08:37Z

ports-of-call/portable_arrays.hpp

+// maximum number of dimensions
+constexpr std::size_t MAXDIM = 6;
+// array type of dimensions/strides
+using narr = std::array<std::size_t, MAXDIM>;


don't love that this is lower-case. I'd find this type easier to interpret if it were, e.g.,

Suggested change

using narr = std::array<std::size_t, MAXDIM>;

using Narr_t = std::array<std::size_t, MAXDIM>;

Yurlungur · 2023-10-03T18:09:29Z

ports-of-call/portable_arrays.hpp

+// compute_index base case, i.e. fastest moving index
+template <std::size_t Ind>
+PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd,
+                                              const size_t index) {
+  return index;
+}
+
+// compute_index general case, computing slower moving index strides
+template <std::size_t Ind, typename... Tail>
+PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd,
+                                              const size_t index,
+                                              const Tail... tail) {
+  return index * nd[Ind] + compute_index<Ind + 1>(nd, tail...);
+}


This is nice---assuming the array on device issues work.

Tests complete on volta-x86, though we need to put together some more tests for ports-of-call

ports-of-call/portable_arrays.hpp

dholladay00 · 2023-10-04T16:37:25Z

CMakeLists.txt

+
+target_compile_options(${POCLIB}
+INTERFACE
+$<${with_cxx}:$<${ps_kokkos}:--expt-relaxed-constexpr>>)


In the event of +kokkos~cuda, wouldn't this result in a compile error?

Looks like it, but POC doesn't have a USE_CUDA or similar like singularity-eos does, so I need to probe which compiler is being used.

This is how I've done it in other projects:

get_target_property(kokkos_link_libs Kokkos::kokkoscore INTERFACE_LINK_LIBRARIES) string(REGEX MATCH "CUDA" kokkos_has_cuda_libs "${kokkos_link_libs}") ... if(kokkos_has_cuda_libs) do cuda stuff... endif()

mauneyc-LANL · 2023-10-05T04:31:39Z

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.

// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};

The options are:

at each compute_index call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to rank, rather than compute/store values up to MAXDIM
(current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that PortableMDArray needs to move from between contexts. Contra to (1), this array needs MAXDIM storage which is likely going to be mostly dead weight.

Yurlungur · 2023-10-05T20:31:22Z

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.
// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};
The options are:
1. at each `compute_index` call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to `rank`, rather than compute/store values up to `MAXDIM`

2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that `PortableMDArray` needs to move from between contexts. Contra to (1), this array needs `MAXDIM` storage which is likely going to be mostly dead weight.

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

mauneyc-LANL · 2023-10-05T23:08:32Z

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

Would be work considering the context that PortableMDArray is used in. Does it reshape often, does it get move on/off device frequently, ect. (2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

I've included a simple test with the PR. As originally there wasn't very much being tested directly with poc we may need to add some more for checking these details. I glanced over spiner tests for ideas but most of what is there looks more numerics based.

Yurlungur · 2023-10-06T21:57:12Z

Yeah fair questions:

Does it reshape often?

No, almost never.

does it get move on/off device frequently

Also no.

(2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

Like you said, 48 bytes is basically nothing, so I lean towards (2).

dholladay00 · 2023-10-09T18:25:10Z

Previously we recalculated at every access it seems. It seems like the number of adds will be the same and the number of multiplies saved will only be noticeable with high rank arrays. Performance data for ranks 2-6 would helpful in this case.

dholladay00 · 2023-10-30T18:04:54Z

The more I think about this the more I think we should recompute the strides. Even so, I'd like to see performance differences. Integer operations can be noticeable on GPUs so pre-computed may be more performant, but a larger object may use more registers. There are a lot of competing factors so data is the best.

dholladay00 · 2024-08-06T22:08:08Z

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

Yurlungur · 2024-08-06T22:09:57Z

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

I am in favor of moving to C++17 at this time. However, I would prefer to decouple the shift to 17 from this MR. Let's just update the cmake builds to require 17 and then merge other MRs when they are ready.

cmauney · 2024-08-14T17:56:46Z

@jonahm-LANL @dholladay00 I've pushed some benchmarks using catch2, right now just doing index computation through contiguous memory - (i, j, k) to n. I haven't tested this on device yet but I would be interested to see the results. I intend to put in the same benchmarks on random (rather than contiguous) memory.

I can take off Draft and move forward with review if you want to get this in. There are things I'd like to modify and add but there's a danger of code-creep and I don't want to hold up, the PR 'as-is' should be ready to go.

Yurlungur

I like this---it's a significant cleanup from the original implementation, which was quite low-level. I want @dholladay00 's build system concerns addressed before merge. Also, just to confirm, this doesn't change the API forfunctionality at all right? It's just more general.

Yurlungur · 2024-08-14T20:17:02Z

ports-of-call/portable_arrays.hpp

+  return r;
+}
+PORTABLE_FORCEINLINE_FUNCTION
+decltype(auto) vp_prod() {


is the decltype needed? Why not just an auto return value?

eh, I'm returning a generic lambda and that's my habbit. But it's not necessary here (tho I don't think it's harmful either)

I don't think the decltype adds anything and it's a bit harder to read but I won't push back too much on this.

Yurlungur · 2024-08-14T20:18:45Z

ports-of-call/portable_arrays.hpp

+  PORTABLE_FORCEINLINE_FUNCTION auto make_nxs_array(NX... nxs) {
+    std::array<std::size_t, MAXDIM> a;
+    std::array<std::size_t, N> t{static_cast<std::size_t>(nxs)...};
+    for (auto i = 0; i < N; ++i) {


I'm gonna insist this be int or size_t. IMO this is not a style thing but a self-documenting thing. It's an index it's not a double.

cmauney · 2024-08-15T14:35:41Z

If C++17 gets in before this is merged, then I do want to do another draft - it will make things cleaner and clearer.

Yurlungur · 2024-09-03T18:59:03Z

@mauneyc-LANL let us know when this is ready for re-review

Yurlungur · 2024-12-05T20:19:05Z

@mauneyc-LANL what's the status of this?

mauneyc-LANL · 2024-12-05T20:21:04Z

@jonahm-LANL Code-proper is good but I'm fleshing out the test, benchmarks and making sure they work on other machines. That could be separated into another PR.

Yurlungur · 2024-12-05T20:22:55Z

Sounds good. I think 1 test on github is enough to merge this, but I'd like at least one. We can do the more broad test set later down the line.

mauneyc-LANL · 2024-12-10T17:27:04Z

Ready for review @jonahm-LANL @jhp-lanl

Yurlungur

Some minor changes below. Also one high-level question: Given that this infrastructure assumes continuous data and only supports contiguous slices, is there a reason to include both sizes and strides? I think we only need to carry around sizes.

As discussed, we can skip benchmarking tests for now. Though I am somewhat concerned about performance.

Yurlungur · 2024-12-12T22:09:38Z

test/test_utilities.hpp

👍 these are nice

Yurlungur · 2024-12-12T22:47:14Z

ports-of-call/utility/index_algo.hpp

+#ifndef _PORTSOFCALL_UTILITY_INDEX_ALGO_HPP_
+#define _PORTSOFCALL_UTILITY_INDEX_ALGO_HPP_
+
+#include "../portability.hpp"


might be best to use <ports-of-call/portability.hpp>

Yurlungur · 2024-12-12T22:51:16Z

ports-of-call/portable_arrays.hpp

 class PortableMDArray {
 public:
-  static constexpr int MAXDIM = 6;
+  using this_type = PortableMDArray<T, D>;


We should have some introspection on the maximum dimension of PortableMDArray. Can we add a member field that reports D?

Yurlungur · 2024-12-12T22:54:20Z

ports-of-call/portable_arrays.hpp

+  PORTABLE_INLINE_FUNCTION size_type GetRank() const noexcept { return rank_; }
+  template <typename... NXs>
+  PORTABLE_INLINE_FUNCTION void Reshape(NXs... nxs) {
+    assert(util::array_reduce(IArray<D>{nxs...}, 1, std::multiplies<size_type>{}) ==


I know this was an assert previously, but now we have PORTABLE_REQUIRE. Let's use that instead as it will provide more useful/verbose output.

Yurlungur · 2024-12-12T22:58:07Z

Ah regarding strides, never mind. I see @dholladay00 's point above. I forgot about it.

…s already API breaking

Yurlungur · 2024-12-13T16:48:19Z

@mauneyc-LANL I tried updating spiner to use your branch and spiner fails with a huge swath of test failures, including several segfaults. We need to resolve that before this branch can be merged. Branch jmm/test-md-refactor in spiner.

Yurlungur · 2024-12-13T16:51:17Z

I think the issue is fast_index the original PortableMDArray implementation allows one to "pun" the dimension of the array to smaller sizes. In particular, indexing in 1d always accesses the flat index, which is a useful feature.

Yurlungur · 2024-12-13T17:05:23Z

OK I fixed the segfault but several tests still failing. @mauneyc-LANL these need to be fixed before merge. That said, the SAP integration effort is a higher priority. If you're ok with it let's punt on this for now.

mauneyc-LANL · 2024-12-18T14:12:42Z

ports-of-call/portable_arrays.hpp

+    if constexpr (sizeof...(Is) == 0) {
+      idx = 0;
+    } else if constexpr (sizeof...(Is) == 1) {
+      idx = get_first(idxs...);
+    } else {
+      idx = util::fast_findex({static_cast<size_type>(idxs)...}, nxs_, strides_);


fast_findex already checks the size of the input indices. It's missing the empty case but it should be added there.

What is the difference between

fast_findex(A const &ijk, A const &dim, A const &stride) { constexpr auto N = get_size(A{}); if constexpr (N == 1) { return ijk[0]; } //... }

and get_first ?

You could instead modify fast index instead of adding get_first I think. Either should be fine.

Yurlungur · 2024-12-18T14:57:06Z

Can you clarify?

Sure--I'll share later today. But you can also just try updating ports of call in spiner and running the tests yourself locally.

mauneyc-LANL · 2024-12-18T18:34:03Z

Need to rethink and reconfigure some of this work. Will try to split out the ideas into more focused PRs.

Christopher Mauney added 2 commits October 2, 2023 14:45

initial commit

f29eaf7

spelling + comments

7a8e259

mauneyc-LANL self-assigned this Oct 3, 2023

mauneyc-LANL requested review from dholladay00, chadmeyer, jonahm-LANL and jhp-lanl October 3, 2023 15:51

dholladay00 reviewed Oct 3, 2023

View reviewed changes

ports-of-call/portable_arrays.hpp Outdated Show resolved Hide resolved

added PORTABLE_ function preambles to missing ones

4c13c2c

dholladay00 reviewed Oct 3, 2023

View reviewed changes

Yurlungur reviewed Oct 3, 2023

View reviewed changes

mauneyc-LANL added 2 commits October 3, 2023 18:38

getting build/exec on GPUs

4742b0b

fix indexing

4bf821e

dholladay00 reviewed Oct 4, 2023

View reviewed changes

some design changes

4863827

mauneyc-LANL mentioned this pull request Oct 16, 2023

EOSBuilder rewrite lanl/singularity-eos#311

Merged

6 tasks

Christopher Mauney added 2 commits August 14, 2024 11:47

tests and benchmarks

1492a98

Merge branch 'main' into mauneyc/templated_md

36e848c

Yurlungur reviewed Aug 14, 2024

View reviewed changes

removed unneeded parameter

825301a

Christopher Mauney and others added 6 commits September 4, 2024 13:50

some sloppy juggling of types

703397a

working on stuff

195fecf

add span

f2431e1

merge main

e64866d

undo span changes

3c769f2

removed benchmark testing; fix some tests for gpu

9733f80

mauneyc-LANL changed the title ~~Draft: Templated MD arrays~~ Templated MD arrays Dec 10, 2024

Yurlungur requested changes Dec 12, 2024

View reviewed changes

jonahm-LANL added 3 commits December 13, 2024 09:14

Merge branch 'main' into mauneyc/templated_md

595d1cb

fix compiler errors.

68bae3b

return deprecated GetDim calls. Been long enough and Chris's change i…

d47b075

…s already API breaking

jonahm-LANL added 2 commits December 13, 2024 10:01

fast index type punning

e180050

namespace

b3aba86

mauneyc-LANL commented Dec 18, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

null case for index

3b5d9ba

mauneyc-LANL closed this Dec 18, 2024

Yurlungur mentioned this pull request Dec 24, 2024

add bracket operator #77

Merged

5 tasks

	using narr = std::array<std::size_t, MAXDIM>;
	using Narr_t = std::array<std::size_t, MAXDIM>;

Templated MD arrays #39

Templated MD arrays #39

Conversation

mauneyc-LANL commented Oct 2, 2023 • edited Loading

PR Summary

It was working fine, why change it?

What are the downsides?

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauneyc-LANL commented Oct 5, 2023

Yurlungur commented Oct 5, 2023

mauneyc-LANL commented Oct 5, 2023

Yurlungur commented Oct 6, 2023

dholladay00 commented Oct 9, 2023

dholladay00 commented Oct 30, 2023

dholladay00 commented Aug 6, 2024

Yurlungur commented Aug 6, 2024

cmauney commented Aug 14, 2024

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmauney commented Aug 15, 2024

Yurlungur commented Sep 3, 2024

Yurlungur commented Dec 5, 2024

mauneyc-LANL commented Dec 5, 2024

Yurlungur commented Dec 5, 2024

mauneyc-LANL commented Dec 10, 2024

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur commented Dec 12, 2024

Yurlungur commented Dec 13, 2024

Yurlungur commented Dec 13, 2024

Yurlungur commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Yurlungur commented Dec 18, 2024

mauneyc-LANL commented Dec 18, 2024

mauneyc-LANL commented Oct 2, 2023 •

edited

Loading