Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May not be correct to cast MPI_Offset pointers to size_t pointers in some cases #79

Closed
edhartnett opened this issue Jun 20, 2016 · 6 comments
Assignees
Labels

Comments

@edhartnett
Copy link
Collaborator

edhartnett commented Jun 20, 2016

NetCDF uses size_t for sizes of things, like the size of data arrays,

PIO uses PIO_Offset for this purpose. PIO_Offset is defined in pio.h to be MPI_OFFSET, or MPI_LONG_LONG if MPI_OFFSET is not defined.

As far as I can tell from the MPICH mpi.h header file, it is using long long as MPI_OFFSET.

The size_t type is an unsigned long long on every platform I'm familiar with.

If ever the size of MPI_OFFSET and size_t are different, then the PIO code will fail quite dramatically. But that is unlikely. It's hard to imagine an MPI platform that is not going to use a 64-bit int of some kind. Similarly size_t is a 64-bit int in every modern C compiler.

However, the real problem is that size_t is unsigned, and MPI_OFFSET is signed. Can it be that someone could be tryping to set things to sizes larger than the maximum signed 64-bit integer? (2^^63 − 1 or 9,223,372,036,854,775,807). I don't know, that's a pretty big number.

An additional source of confusion is that both PIO_OFFSET and PIO_Offset are defined, as are MPI_OFFSET and MPI_Offset.

It is PIO_Offset that is actually used in the code, and that is defined to be MPI_Offset, which is defined in pio_internal.h to be a long long. So MPI_OFFSET is not actually used.

Some possible approaches:

  • Would be nice to remove PIO_OFFSET and MPI_Offset, leaving MPI_OFFSET (an MPI defined macro, in most cases) and PIO_Offset, which is used all over the code. We should probably set PIO_Offset to MPI_OFFSET, since that is what the user will expect.
  • Would be nice to confirm that there is even a possibility of an issue. Some things (such as the ndims) cannot be larger than other number NC_MAX_DIM, so by carefully looking at the netCDF api perhaps we can find that there is no possibility of even having numbers larger than 2^15-1.
  • Right now we are just overriding all the warnings by casting MPI_OFFSET pointers to size_t pointers. This is a bit sleazy - we should find a better way.
  • We could create (in all functions that currently cast) size_t arrays, copy values from MPI_OFFSET arrays, and then we don't have to cast. Probably this is the best solution, though somewhat laborious.
  • We could change PIO_OFFSET to size_t, but that would introduce these warnings in all existing user codes. :-(
  • We could check the size of MPI_OFFSET vs. size_t in the CMake step. If they are different, then we notify the user and fail to build. That way we can rely on safely casting between these types. (But if anyone ever does build a system where MPI_OFFSET and size_t are different sizes, PIO is hosed and needs major editing to work there.)

Any suggestions welcome. I wanted to capture this in an issue because I will probably forget all about it when I go on vacation next week. ;-)

@jayeshkrishna
Copy link
Contributor

Another possibility is a safe typecast macro that checks corner cases for debug builds (and does a plain cast for non-debug builds).

@edhartnett
Copy link
Collaborator Author

I'm not a huge fan of macros with any kind of logic in them. That makes the C code work very poorly with a debugger.

@jedwards4b
Copy link
Contributor

NetCDF uses size_t for sizes of things
How does this work for parallel netcdf? Is it using size_t and then
casting back to mpi_offset when interfacing mpi-io? I would say option 2
is best but worry about too many copies slowing things down.

On Sun, Jul 31, 2016 at 10:10 PM, Imperial Fleet Commander Tarth <
[email protected]> wrote:

I'm not a huge fan of macros with any kind of logic in them. That makes
the C code work very poorly with a debugger.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#79 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AF16GFirDb7CnIONZfdo6UldGKSSLFnbks5qbXHBgaJpZM4I56hh
.

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

@edhartnett
Copy link
Collaborator Author

I need to check carefully to see if it is ever even possible for this to cause a problem, before any action is taken. So I will do that. It may be just fine as it is, in which case I will document my results and move on. I just didn't want to forget this issue. ;-)

The parallel-netcdf library does not call the netCDF library directly, so does not face the same problem. They use MPI_Offset (instead of size_t) in their prototypes. I see at least one cases where they are casting a size_t to an MPI_Offset, but using a macro to check that it is not too large.

In terms of performance, it will only be a factor in the data reads and writes. Metadata operations all take place once, whereas data operations take place over and over. As far as I can see there is no way to have a valid start, count, or stride which is longer than NC_MAX_DIM, so we can just check at build time that this is OK and we're good. That leaves the metadata functions to check for potential issues.

@jedwards4b
Copy link
Contributor

I meant to say netcdf4 hdf5.

On Mon, Aug 1, 2016 at 9:40 AM, Imperial Fleet Commander Tarth <
[email protected]> wrote:

I need to check carefully to see if it is ever even possible for this to
cause a problem, before any action is taken. So I will do that. It may be
just fine as it is, in which case I will document my results and move on. I
just didn't want to forget this issue. ;-)

The parallel-netcdf library does not call the netCDF library directly, so
does not face the same problem. They use MPI_Offset (instead of size_t) in
their prototypes. I see at least one cases where they are casting a size_t
to an MPI_Offset, but using a macro to check that it is not too large.

In terms of performance, it will only be a factor in the data reads and
writes. Metadata operations all take place once, whereas data operations
take place over and over. As far as I can see there is no way to have a
valid start, count, or stride which is longer than NC_MAX_DIM, so we can
just check at build time that this is OK and we're good. That leaves the
metadata functions to check for potential issues.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#79 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AF16GC0P2k7LCmj9WMB0wC0N2ugIAeqrks5qbhOCgaJpZM4I56hh
.

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

@edhartnett
Copy link
Collaborator Author

I will investigate that question.

@edhartnett edhartnett changed the title not correct to cast MPI_Offset pointers to size_t pointers May not be correct to cast MPI_Offset pointers to size_t pointers in some cases Aug 4, 2016
@edhartnett edhartnett self-assigned this Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants