Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust to handle full array types #153

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

markcmiller86
Copy link
Member

This currently adjusts can_apply method for cases where dataset type is H5T_ARRAY. The set_local() method needs adjustment too. In addition, this initial stuff is assuming all dimensions in the type are smooth. We can relax that after experience using it.

@markcmiller86
Copy link
Member Author

@lindstro I have been doing development on a raspberry pi zero 2w just to see whats involved there. Very nifty little machine that fits in the palm of my hand.

Anyways this is a bit hackish at the moment (well, more sore than my typical hackishness anyways).

But, this does appear to now handle array types. It is passing all tests on the pi with HDF5-1.14.5.

The CI/CD here is, for some reason, using HDF5-1.10. We need to fix that.

I need to confirm I am handing the array type dimensions in the correct order.

But, this is a start.

We need to talk about "coorelated" and "uncorrelated" dimensions.

H5Z_ZFP_PUSH_AND_GOTO(H5E_PLINE, H5E_BADVALUE, 0,
#if ZFP_VERSION_NO < 0x0530
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be <=, I believe; 4D was introduced in 0.5.4. See <= usage above.

@@ -183,29 +205,29 @@ H5Z_zfp_can_apply(hid_t dcpl_id, hid_t type_id, hid_t chunk_space_id)

if (!(dsize == 4 || dsize == 8))
H5Z_ZFP_PUSH_AND_GOTO(H5E_PLINE, H5E_BADTYPE, 0,
"requires datatype size of 4 or 8");
"requires primitive datatype size of 4 or 8");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is dsize for arrays? Is it still the constituent scalar type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right. In the case of an array type, we overwrite dszie with the array's super type.

Now, that I am thinking about it, it could be an array of an array. I mean, that is one way of doing 2D arrays. And, this code assumes there is only one level of indirection to get to a primitive, language native type.

@lindstro
Copy link
Member

But, this does appear to now handle array types. It is passing all tests on the pi with HDF5-1.14.5.

Very cool. But how is zfp informed of the array dimensions? I see no obvious part of the code where the H5T_ARRAY dimensions get passed on to the zfp_field struct.

I need to confirm I am handing the array type dimensions in the correct order.

Right, and presumably those would be the fastest varying dimensions. Thus, you can't just add extra outer dimensions to zfp_field. For example, an nx * ny array of vectors of length nz needs to be fed to zfp as (nz, nx, ny), with the leftmost dimension varying fastest, while an nx * ny array of floats would be specified as (nx, ny). So there ought to be some extra logic that consolidates dependent and independent dimensions.

We need to talk about "coorelated" and "uncorrelated" dimensions.

Yes. As you pointed out to me, chunking is the way to deal with this in the domain, but no such capability exists for the range. In most cases, I would expect all dimensions in the range to be uncorrelated, though we've seen an example where two such dimensions are correlated. So somehow the user must pass this information to the filter.

@markcmiller86
Copy link
Member Author

But, this does appear to now handle array types. It is passing all tests on the pi with HDF5-1.14.5.

Very cool. But how is zfp informed of the array dimensions? I see no obvious part of the code where the H5T_ARRAY dimensions get passed on to the zfp_field struct.

Well, "...passing all tests..." gives the wrong impression. It passes all the tests we had before plus a couple of new ones I added involving H5T_ARRAY types. And, I can drive it to some large array types too and it still works.

Right, and presumably those would be the fastest varying dimensions. Thus, you can't just add extra outer dimensions to zfp_field. For example, an nx * ny array of vectors of length nz needs to be fed to zfp as (nz, nx, ny), with the leftmost dimension varying fastest, while an nx * ny array of floats would be specified as (nx, ny). So there ought to be some extra logic that consolidates dependent and independent dimensions.

Ok, so I am definitely doing it wrong then because I am letting the array type represent the fastest varying dimensions and just sort of appending the dataset dimensions onto it. Well, that is what I think am doing. I need to develop an example that produces data I can then examine with hdf5 tools and confirm its organized how I expect.

So, I would say...this is probably crap at the moment. But, I am learning what's involved in handling correctly.

BTW...what about complex numbers and ZFP. HDF5 recently added some support for them as a special case of H5T_COMPOUND type...I think.

@lindstro
Copy link
Member

Right, and presumably those would be the fastest varying dimensions. Thus, you can't just add extra outer dimensions to zfp_field. For example, an nx * ny array of vectors of length nz needs to be fed to zfp as (nz, nx, ny), with the leftmost dimension varying fastest, while an nx * ny array of floats would be specified as (nx, ny). So there ought to be some extra logic that consolidates dependent and independent dimensions.

Ok, so I am definitely doing it wrong then because I am letting the array type represent the fastest varying dimensions

By "array type," do you mean the H5T_ARRAY datum type (the range dimensions) or the dataset domain dimensions? Let's take an example:

typedef float tensor_type[3][2];

tensor_type tensor_field[5][4];

So we have a 4 x 5 tensor field of rank-2 tensors of size 2 x 3. In this case, you essentially end up with a 4D C array layout float array[5][4][3][2] and you would then pass this to zfp as nx = 2, ny = 3, nz = 4, nw = 5, with x (part of tensor_type) being the fastest varying dimension. Here tensor_type is an H5T_ARRAY.

So I think you're thinking of this correctly, but whereas normally we would think of nx = 4, ny = 5 in terms of the (outer) domain dimensions, we'd now have to bump those to nz and nw as we have two additional (inner) range dimensions. That's why I said you can't just "append" dimensions--they must be remapped from how you normally treat them when dealing with scalar fields.

and just sort of appending the dataset dimensions onto it.

Appending vs. prepending can be a bit confusing since C and HDF5 use C order (rightmost dimension varies fastest) while zfp uses Fortran order (leftmost dimension varies fastest) to describe the layout. The point I was trying to make is that you can't just assign the fastest-varying domain dimension to nx in case the datum type is H5T_ARRAY.

Well, that is what I think am doing. I need to develop an example that produces data I can then examine with hdf5 tools and confirm it's organized how I expect.

Right. Though it would be difficult to imagine that HDF5 would not treat this like the C code example above, where the dimensions in the datum type (range) vary fastest.

So, I would say...this is probably crap at the moment. But, I am learning what's involved in handling correctly.

BTW...what about complex numbers and ZFP. HDF5 recently added some support for them as a special case of H5T_COMPOUND type...I think.

zfp currently does not handle complex numbers (except as separate real and imaginary fields). We've done some work to generalize zfp's real basis to the complex domain, and the correct generalization is the $d$-dimensional Fourier transform for blocks of $4^d$ complex values. Although a lot of zfp code is templated to make it reasonably easy to support complex-valued data, the amount of coding effort is still substantial with respect to API extensions, tests, and documentation for a niche use case. Additionally, the Cartesian product of types (float and double, in this case), dimensions (1D-4D), compression modes (rate, precision, accuracy, reversible), back ends (serial, OpenMP, CUDA, HIP), languages (C, C++, Python Fortran), and APIs (low-level, high-level, array classes) make any such addition a huge undertaking.

Folks have also asked for support for FP16, which would be higher on my priority list but also is something that would require months of development time. If enough people ask for complex support, we'll consider it, but so far requests have been few.

@lindstro
Copy link
Member

Appending vs. prepending can be a bit confusing since C and HDF5 use C order (rightmost dimension varies fastest) while zfp uses Fortran order (leftmost dimension varies fastest) to describe the layout. The point I was trying to make is that you can't just assign the fastest-varying domain dimension to nx in case the datum type is H5T_ARRAY.

Actually, let me make one more point about inner vs. outer dimensions. It would be possible to always use x, y, ... as the domain dimensions, even when the fastest varying dimensions are part of the range. This is because zfp supports strided layouts, where the leftmost dimension does not necessarily vary fastest. So instead of a default stride of 1 for x and 4 for y in the tensor field example above, you could set them to 6 and 24, respectively, with the strides for z and w set to 1 and 2. There would possibly be a performance penalty, however, as zfp loops over x in the innermost loop, and the data would be gathered and scattered non-contiguously.

@markcmiller86
Copy link
Member Author

Actually, let me make one more point about inner vs. outer dimensions. It would be possible to always use x, y, ... as the domain dimensions, even when the fastest varying dimensions are part of the range. This is because zfp supports strided layouts, where the leftmost dimension does not necessarily vary fastest. So instead of a default stride of 1 for x and 4 for y in the tensor field example above, you could set them to 6 and 24, respectively, with the strides for z and w set to 1 and 2. There would possibly be a performance penalty, however, as zfp loops over x in the innermost loop, and the data would be gathered and scattered non-contiguously.

Ok, so when a chunk arrives at the filter, if we use ZFP's striding feature...which sounds like a great thing to do by the way. Does it mean we walk through that chunk multiple times each with a different stride and each time appending to the final output buffer that gets returned for that chunk?

@lindstro
Copy link
Member

Ok, so when a chunk arrives at the filter, if we use ZFP's striding feature...which sounds like a great thing to do by the way. Does it mean we walk through that chunk multiple times each with a different stride and each time appending to the final output buffer that gets returned for that chunk?

No, the data is traversed only once, but each zfp block is "gathered" using non-contiguous memory accesses. Take for example the assembly of a single 4D block during compression:

/* gather 4*4*4*4 block from strided array */
static void
_t2(gather, Scalar, 4)(Scalar* q, const Scalar* p, ptrdiff_t sx, ptrdiff_t sy, ptrdiff_t sz, ptrdiff_t sw)
{
  uint x, y, z, w;
  for (w = 0; w < 4; w++, p += sw - 4 * sz)
    for (z = 0; z < 4; z++, p += sz - 4 * sy)
      for (y = 0; y < 4; y++, p += sy - 4 * sx)
        for (x = 0; x < 4; x++, p += sx)
          *q++ = *p;
}

There's an equivalent scatter operation for decompression. Normally, the strides are sx = 1, sy = nx, sz = nx * ny, sw = nx * ny * nz for an array (or chunk) of size nx * ny * nz * nw. In this case, we have sequential access in the innermost loop, then skip over the minimal number of elements sy - 4 * sx when incrementing y, and so forth. Of course, if the data is not already arranged as blocks, access is only partially sequential (i.e., only along x). Though if nx = ny = nz = 4, then all 256 values of the block are fetched contiguously. But whenever sx != 1, all memory accesses are non-contiguous, unless sy = 3 * sx + 1 (and similarly for other dimensions).

For large chunks, it's probably preferable to maximize locality of reference, though I don't know if this is a serious concern in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants