Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can argument start be NULL in get/put APIs? #231

Closed
wkliao opened this issue Mar 5, 2016 · 24 comments · Fixed by #1112
Closed

Can argument start be NULL in get/put APIs? #231

wkliao opened this issue Mar 5, 2016 · 24 comments · Fixed by #1112
Assignees
Milestone

Comments

@wkliao
Copy link
Contributor

wkliao commented Mar 5, 2016

I tested var1, vara, vars, and varm APIs using NULL for argument start.
All resulted in a coredump, but vars. (version 4.4.0 is used.)

Core file shows the followings for varm case.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000004043a8 in NCDEFAULT_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710, imapp=0x7ffee1af76f0,

value0=0x7ffee1af74b0, memtype=5) at dvarput.c:479

479 mystart[idim] = start[idim];
(gdb) where
#0 0x00000000004043a8 in NCDEFAULT_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710, imapp=0x7ffee1af76f0,

value0=0x7ffee1af74b0, memtype=5) at dvarput.c:479

#1 0x00000000004044d2 in NC_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710, map=0x7ffee1af76f0,

value=0x7ffee1af74b0, memtype=5) at dvarput.c:524

#2 0x0000000000405d9c in nc_put_varm_float (ncid=65536, varid=0, startp=0x0, countp=0x7ffee1af7730, stridep=0x7ffee1af7710, imapp=0x7ffee1af76f0,

op=0x7ffee1af74b0) at dvarput.c:1448

Other coredump files point to
#0 0x00000000004197bb in NCcoordck (ncp=0x25830f0, varp=0x25872f0, coord=0x0) at putget.c:736

736 if(coord > X_UINT_MAX) / rkr: bug fix from previous X_INT_MAX */

The netCDF C reference dose not explicitly say NULL start argument is illegal, but there is error code NC_EINVALCOORDS defined that may be used to tell. So, the question is whether NULL is allowed or not. If allowed, then netCDF needs to allocate and use an array of all 1s internally.

@edhartnett
Copy link
Contributor

Why do you want to do this?

If you want to read the entire variable, use the nc_get/put_var() functions.

Ed

On Sat, Mar 5, 2016 at 9:42 AM, Wei-keng Liao [email protected]
wrote:

I tested var1, vara, vars, and varm APIs using NULL for argument start.
All resulted in a coredump, but vars. (version 4.4.0 is used.)

Core file shows the followings for varm case.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000004043a8 in NCDEFAULT_put_varm (ncid=65536, varid=0,
start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710,
imapp=0x7ffee1af76f0,
value0=0x7ffee1af74b0, memtype=5) at dvarput.c:479
479 mystart[idim] = start[idim];
(gdb) where
#0 0x00000000004043a8 in NCDEFAULT_put_varm (ncid=65536, varid=0,
start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710,
imapp=0x7ffee1af76f0,
value0=0x7ffee1af74b0, memtype=5) at dvarput.c:479
#1 #1 0x00000000004044d2 in
NC_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730,
stride=0x7ffee1af7710, map=0x7ffee1af76f0,
value=0x7ffee1af74b0, memtype=5) at dvarput.c:524
#2 #2 0x0000000000405d9c in
nc_put_varm_float (ncid=65536, varid=0, startp=0x0, countp=0x7ffee1af7730,
stridep=0x7ffee1af7710, imapp=0x7ffee1af76f0,
op=0x7ffee1af74b0) at dvarput.c:1448

Other coredump files point to
#0 0x00000000004197bb in NCcoordck (ncp=0x25830f0, varp=0x25872f0,
coord=0x0) at putget.c:736
736 if(coord > X_UINT_MAX) / rkr: bug fix from previous X_INT_MAX */

The netCDF C reference dose not explicitly say NULL start argument is
illegal, but there is error code NC_EINVALCOORDS defined that may be used
to tell. So, the question is whether NULL is allowed or not. If allowed,
then netCDF needs to allocate and use an array of all 1s internally.


Reply to this email directly or view it on GitHub
#231.

@wkliao
Copy link
Contributor Author

wkliao commented Mar 6, 2016

I am a developer of PnetCDF. One of my purposes is to check if PnetCDF can follow netCDF's convention on error codes returned. The problem I found is an inconsistency within netCDF APIs for handling a NULL argument and unclear specification in the document.

@edhartnett
Copy link
Contributor

Well I agree it's a bug if passing NULL causes a segfault. That should
never happen.

But in terms of the error code that you get back, that's a tougher
question. I would expect to see NC_EINVAL but I see from the code that the
condition of NULL start, count and stride arrays is not checked for that,
as far as I can find.

Ed

On Sun, Mar 6, 2016 at 11:28 AM, Wei-keng Liao [email protected]
wrote:

I am a developer of PnetCDF. One of my purposes is to check if PnetCDF can
follow netCDF's convention on error codes returned. The problem I found is
an inconsistency within netCDF APIs for handling a NULL argument and
unclear specification in the document.


Reply to this email directly or view it on GitHub
#231 (comment).

@wkliao
Copy link
Contributor Author

wkliao commented Mar 6, 2016

The NULL checking for argument start does appear in vars APIs only. See line 198 of dvarput.c
mystart[i] = (start == NULL ? 0 : start[i]);
So, inconsistency is to be solved.

Either NC_EINVAL or NC_EINVALCOORDS is fine, as long as the causes are well defined in the document. However, I would prefer an error code that can provide more info to the problem. NC_EINVALCOORDS is more about the argument start.

I agree that invalid pointer arguments are hard to check. Segfaul is sometimes unavoidable. However, NULL, when used, has special meanings in many netCDF APIs. In this case, it can be defined as reading/writing an entire variable and ignores count, stride, and imap arguments. Maybe similar approach can be applied to count, stride, and imap. I just hope this can be clarified in netCDF document.

@DennisHeimbigner
Copy link
Collaborator

I think your analysis is correct. We should be checking for null arguments as appropriate, but it probably
should be done in a unified place in the files in libdispatch. Checking for other kinds of legality (like short
dimension vectors) is generally way too difficult.
For start,count,stride vector, we should rreturn NC_EINVALCOORDS.
I don't think we should be trying to interpret null arguments as meaning something.

@WardF
Copy link
Member

WardF commented Mar 7, 2016

I agree with @DennisHeimbigner; NULL checking and returning the proper error code is the way to go, and doing all of that in libdispatch feels like the natural place. I also agree that interpreting NULL isn't the way we want to go; it might disguise other problems when the NULL is inadvertent.

@WardF WardF added this to the 4.4.1 milestone Mar 7, 2016
@WardF WardF self-assigned this Mar 7, 2016
@WardF WardF modified the milestones: future, 4.4.1 Mar 7, 2016
@edhartnett
Copy link
Contributor

Howdy Wei-keng Liao at the parallel-netcdf team!

Did you change pnetcdf to exactly imitate netcdf for the 1.7.0 release? Because I just upgraded to that release and now the PIO tests are failing with exactly this sort of problem - a segmentation fault that seems to be caused by nulls as parameters.

Thanks,
Ed

@wkliao
Copy link
Contributor Author

wkliao commented Mar 21, 2016

Hi, Ed

No, PnetCDF 1.7.0 checks and returns NC_ENULLSTART if start argument is NULL
where NC_ENULLSTART is a PnetCDF error code.

An special case is when the variables are scalars and start is ignored.
Please check if your test runs are actually calling PnetCDF dispatch,
instead of netCDF-3 or -4.

Wei-keng

On Mar 21, 2016, at 10:53 AM, Imperial Fleet Commander Tarth wrote:

Howdy Wei-keng Liao at the parallel-netcdf team!

Did you change pnetcdf to exactly imitate netcdf for the 1.7.0 release? Because I just upgraded to that release and now the PIO tests are failing with exactly this sort of problem - a segmentation fault that seems to be caused by nulls as parameters.

Thanks,
Ed


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

@edhartnett
Copy link
Contributor

As you can see from the core dump output it is happening in ncmpii_sanity_check which is a pnetcdf function. Also this happened when I updated to pnetcdf 1.7.0.

Backtrace for this error:
#0 0x7F7589930D58
#1 0x7F758992FEE0
#2 0x303E234A4F
#3 0x45FF34 in ncmpii_sanity_check
#4 0x45B97C in ncmpi_iput_varn
#5 0x43BEC9 in pio_write_darray_multi_nc at pio_darray.c:525
#6 0x43D27D in PIOc_write_darray_multi at pio_darray.c:844
#7 0x44004F in flush_buffer at pio_darray.c:1760
#8 0x42EB17 in PIOc_sync at pio_file.c:419
#9 0x42E77A in PIOc_closefile at pio_file.c:284
#10 0x41A7A1 in __piolib_mod_MOD_closefile at piolib_mod.F90:1410
#11 0x41076E in __basic_tests_MOD_test_open at basic_tests.F90:236
#12 0x4129A5 in pio_unit_test_driver at driver.F90:156

@edhartnett
Copy link
Contributor

However Kate and Jim have apparently been using pnetcdf 1.7.0 for a few weeks with no problems, so likely this is a problem on my machine. Let me work on it for a while and see if I can figure out what is going on.

Also if this is still a problem I will move this conversation to the pnetcdf mailing list, where Ward and Dennis will not have to read about it. ;-)

@wkliao
Copy link
Contributor Author

wkliao commented Mar 1, 2017

I created a test program for this issue. It tests whether expected error codes (NC_EINVALCOORDS or NC_EEDGE) can be returned when using NULL pointers in arguments start, count, stride, or imap. When running against the latest master branch, it may cause a coredump due to trying dereference NULL pointers.

https://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/test/testcases/nc_null_args.c

Also, I added a function below in the test program
char* nc_err_code_name(int err)
It returns the error code strings, instead of error messages, which is convenient for netcdf developers. The same function also appears in nc_test/util.c under branch ghpull-319-arm. In case you want to add this program into netCDF, you may want to use the one in util.c.

@edhartnett
Copy link
Contributor

edhartnett commented Aug 9, 2018

OK, I have hit this issue when working on lazy vars for netcdf-4.

This is now fixed for all netCDF-4 code. However, instead of returning an error, which I think is not actually correct, the netCDF-4 functions assume starts of 0, counts of full dimension extent, and stride of 1.

That is:

nc_put_vars(ncid, varid, NULL, NULL, NULL, data);
and

nc_put_vara(ncid, varid, NULL, NULL, data);

behave exactly like:

nc_put_var(ncid, varid, data);

(Similarly for gets).

Unfortunately, the classic code does not do this, it segfaults. ;-(

As Dennis suggests, I am looking at putting the solution I am currently using for netCDF-4 files into the dispatch layer code, so that it works for all dispatch layers. This makes sense, since how NULL starts, counts, and strides are treated is part of the netCDF API that all dispatch layers should respect.

@DennisHeimbigner
Copy link
Collaborator

I think either defaulting start to all zeros or reporting an error are equally good
solutions. I am lean to returning NC_EINVALCOORDS as the solution.
Clearly this does not happen in the wild.
Also, vectors of ones and of zeros are already defined in libdispatch/ddispatch.c

size_t nc_sizevector0[NC_MAX_VAR_DIMS];
size_t nc_sizevector1[NC_MAX_VAR_DIMS];
ptrdiff_t nc_ptrdiffvector1[NC_MAX_VAR_DIMS];
size_t NC_coord_zero[NC_MAX_VAR_DIMS];
size_t NC_coord_one[NC_MAX_VAR_DIMS];
I see that there is some duplication, so the set above could probably
be reduced.

@edhartnett
Copy link
Contributor

Well the vara functions handle a missing (i.e. NULL) count by using the maximum extent of each dimension. This is done in the dispatch layer. So we don't know whether anyone is doing that, but we probably don't want to change it.

So for NULL start we can assume all 0 or return error, but we still have the problem of NULL count for vars for classic files. It causes segfault.

I think the cleanest solution is to assume NULL is the same as in nc_get_var(), that is, 0 for start, full extent for count, and 1 for stride. This should be done in the dispatch layer. It's already partially being done (i.e. for count in vara functions), but it needs to be done for start and stride as well, which will be very easy.

@wkliao
Copy link
Contributor Author

wkliao commented Aug 9, 2018

In PnetCDF, NC_EINVALCOORDS is returned when start is NULL for var1, vara, vars, and varm APIs, and NC_EEDGE is returned when count is NULL for vara, vars, and varm APIs.

@edhartnett
Copy link
Contributor

I am perfectly happy to do things the pnetcdf way, but that would break any user code that relies on nc_get/put_vara calls to automatically fill in the count array. Not sure how many users depend on that, probably not many.

But if we now return NC_EEDGE for that, it may break user code.

So it seems we have three paths to choose from:

1 - Return NC_EINVALCOORDS for NULL start everywhere, return NC_EEDGE for NULL count in vars functions, fill in counts for vara functions. (What error do we return for NULL stride?) (Inconsistent, but nothing breaks.)

2 - Return NC_EINVALCOORDS for NULL start, return NC_EEDGE for NULL count everywhere. (Consistent, but user code that depends on NULL counts for vara calls will break).

3 - Treat NULL start as all 0s, NULL count as dim full extent, and NULL stride as all 1s. This is consistent and nothing breaks, so I think it is what we should do.

@DennisHeimbigner or @WardF if you want to make a definitive choice, I will code it up and put up a PR. If you don't feel strongly about it, I will go with option 3.

@wkliao
Copy link
Contributor Author

wkliao commented Aug 10, 2018

PnetCDF always checks start and count against NULL since its official release 1.2.0.
assert() was used between 1.2.0 and 1.5.0. Starting from 1.6.0, error codes have been returned.

From the earlier discussion in this issue, both @DennisHeimbigner and @WardF agreed with the approach to check NULL and return error codes. Here is the quote from @WardF.

NULL checking and returning the proper error code is the way to go, and doing all of that in libdispatch feels like the natural place. I also agree that interpreting NULL isn't the way we want to go; it might disguise other problems when the NULL is inadvertent.

@edhartnett
Copy link
Contributor

So @wkliao do you suggest option 1 or 2?

The vara functions already interpret NULL count as full extent of dimensions. What should be do there? Continue as we have been? Or return error code?

@DennisHeimbigner
Copy link
Collaborator

If we enforce these rules in libdispatch/dvar{get/put}.c, then none of the underlying
dispatch tables will have to worry about this.
Personally, I am in favor of a variant of #2. Always disallow null start and stride arguments
and treat null count argument as we do now.

@edhartnett
Copy link
Contributor

OK, #2 sounds good to me. I will do this on my lazy var branch and put up a PR in a bit.

@edhartnett
Copy link
Contributor

If anyone is interested, here's how the answer looks right now.

I have this newly added function that does the work:

/**
 * @internal Check the start, count, and stride parameters for gets
 * and puts, and handle NULLs.
 *
 * @param ncid The file ID.
 * @param varid The variable ID.
 * @param start Pointer to start array. If NULL NC_EINVALCOORDS will
 * be returned for non-scalar variable.
 * @param count Pointer to pointer to count array. If *count is NULL,
 * an array of the correct size will be allocated, and filled with
 * counts that represent the full extent of the variable. In this
 * case, the memory must be freed by the caller.
 * @param stride Pointer to pointer to stride array. If NULL, stide is
 * ignored. If *stride is NULL an array of the correct size will be
 * allocated, and filled with ones. In this case, the memory must be
 * freed by the caller.
 *
 * @return ::NC_NOERR No error.
 * @return ::NC_EBADID Bad ncid.
 * @return ::NC_ENOTVAR Variable not found.
 * @return ::NC_ENOMEM Out of memory.
 * @return ::NC_EINVALCOORS Missing start array.
 * @author Ed Hartnett
 */
int
NC_check_nulls(int ncid, int varid, const size_t *start, size_t **count,
               ptrdiff_t **stride)
{
   int varndims;
   int stat;

   if ((stat = nc_inq_varndims(ncid, varid, &varndims)))
      return stat;

   /* For non-scalar vars, start is required. */
   if (!start && varndims)
      return NC_EINVALCOORDS;

   /* If count is NULL, assume full extent of var. */
   if (!*count)
   {
      if (!(*count = malloc(varndims * sizeof(size_t))))
         return NC_ENOMEM;
      if ((stat = NC_getshape(ncid, varid, varndims, *count)))
         return stat;
   }

   /* If stride is NULL, do nothing, if *stride is NULL use all 1s. */
   if (stride && !*stride)
   {
      int i;

      if (!(*stride = malloc(varndims * sizeof(size_t))))
         return NC_ENOMEM;
      for (i = 0; i < varndims; i++)
         *stride[i] = 1;
   }

   return NC_NOERR;
}

This is used in the vara and vars functions like this:

static int
NC_get_vars(int ncid, int varid, const size_t *start,
	    const size_t *edges, const ptrdiff_t *stride, void *value,
	    nc_type memtype)
{
   NC* ncp;
   size_t *my_count = (size_t *)edges;
   ptrdiff_t *my_stride = (ptrdiff_t *)stride;
   int stat;

   stat = NC_check_id(ncid, &ncp);
   if(stat != NC_NOERR) return stat;

   if(start == NULL || edges == NULL || stride == NULL) {
      stat = NC_check_nulls(ncid, varid, start, &my_count, &my_stride);
      if(stat != NC_NOERR) return stat;
   }

   stat = ncp->dispatch->get_vars(ncid,varid,start,my_count,my_stride,
                                  value,memtype);
   if(edges == NULL) free(my_count);
   if(stride == NULL) free(my_stride);
   return stat;
}

Note that nothing is done unless a NULL argument is passed, to minimally impact performance in the normal case.

@edhartnett
Copy link
Contributor

OK, this is ready to go on branch ejh_vars_null_count_issue_2.

Turns out, for every put and get we were doing an extra file lookup. I have removed them.

@gsjaardema
Copy link
Contributor

Minor issue is that there can be memory leaks if errors in NC_check_nulls. For example, if error in NC_getshape then will return early and memory allocated for count will not be freed.

Agree that this is minor issue with very low probability of happening and the fix involves uglifying the code.

@edhartnett
Copy link
Contributor

@gsjaardema thanks, I have fixed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment