Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 Limit on attribute creation order tracking #2054

Open
gsjaardema opened this issue Aug 5, 2021 · 18 comments
Open

HDF5 Limit on attribute creation order tracking #2054

gsjaardema opened this issue Aug 5, 2021 · 18 comments

Comments

@gsjaardema
Copy link
Contributor

[This discussion was taking place over email, but I think I should put it here for easier searching and tracking]

Original Issue:

It looks like I am hitting a limit of 65,536 for creating attributes with attribute creation order tracking enabled. Is that the limit in HDF5? If so, is it possible to increase this number by changing a #define, or is it an inherent limit in the data format?

This is happening in a NetCDF-4 format file, so the HDF5 is being created via the NetCDF API:

HDF5-DIAG: Error detected in HDF5 (1.10.7) thread 0:
#000: /scratch/gdsjaar/seacas/TPL/hdf5/hdf5-1.10.7/src/H5A.c line 285 in H5Acreate2(): unable to create attribute
major: Attribute
minor: Unable to initialize object
#1: /scratch/gdsjaar/seacas/TPL/hdf5/hdf5-1.10.7/src/H5Aint.c line 275 in H5A__create(): unable to create attribute in object header
major: Attribute
minor: Unable to insert object
#2: /scratch/gdsjaar/seacas/TPL/hdf5/hdf5-1.10.7/src/H5Oattribute.c line 296 in H5O__attr_create(): attribute creation index can't be incremented
major: Attribute
minor: Unable to increment reference count

HDF5 Response:

According to the section IV.A.2.v. "The Attribute Info Message" in the File Format Spec maximum creation index is 2 bytes, so it is a file format issue. I think we do have an issue since there is no limit now on the number of attributes. We will need to introduce changes to the file format. We need to have a conversation on HDF5 limitations on sizes and how much work it will be.

@gsjaardema
Copy link
Contributor Author

The issue is that NetCDF-4 uses "attribute creation order" tracking as described in docs/file_format_specification.md

\subsection creation_order Creation Order

The netCDF API maintains the creation order of objects that are
created in the file. The same is not true in HDF5, which maintains the
objects in alphabetical order. Starting in version 1.8 of HDF5, the
ability to maintain creation order was added. This must be explicitly
turned on in the HDF5 data file in several ways.

Each group must have link and attribute creation order set. The
following code (from libsrc4/nc4hdf.c) shows how the netCDF-4 library
sets these when creating a group.

\code
           /* Create group, with link_creation_order set in the group
            * creation property list. */
           if ((gcpl_id = H5Pcreate(H5P_GROUP_CREATE)) < 0)
              return NC_EHDFERR;
           if (H5Pset_link_creation_order(gcpl_id, H5P_CRT_ORDER_TRACKED|H5P_CRT_ORDER_INDEXED) < 0)
              BAIL(NC_EHDFERR);
           if (H5Pset_attr_creation_order(gcpl_id, H5P_CRT_ORDER_TRACKED|H5P_CRT_ORDER_INDEXED) < 0)
              BAIL(NC_EHDFERR);
           if ((grp->hdf_grpid = H5Gcreate2(grp->parent->hdf_grpid, grp->name,
                                            H5P_DEFAULT, gcpl_id, H5P_DEFAULT)) < 0)
              BAIL(NC_EHDFERR);
           if (H5Pclose(gcpl_id) < 0)
              BAIL(NC_EHDFERR);
\endcode

Each dataset in the HDF5 file must be created with a property list for
which the attribute creation order has been set to creation
ordering. The H5Pset_attr_creation_order function is used to set the
creation ordering of attributes of a variable.

The following example code (from libsrc4/nc4hdf.c) shows how the
creation ordering is turned on by the netCDF library.

\code
        /* Turn on creation order tracking. */
        if (H5Pset_attr_creation_order(plistid, H5P_CRT_ORDER_TRACKED|
                                       H5P_CRT_ORDER_INDEXED) < 0)
           BAIL(NC_EHDFERR);
\endcode

@gsjaardema
Copy link
Contributor Author

I recently hit this issue again (and due to limited long-term memory) spent some time triaging it and then remembered the issue...

I pulsed the THG group again and they are willing to look into increasing the maximum attribute creation index from 16 bits to a larger value, but that probably won't happen until 1.14.0 at the earliest...

Since I had users who needed to write their files now (although writing to netcdf-3 / netcdf-5 format works), and I will probably have users start hitting the limit more frequently as model complexity increases... I decided to experiment. I removed all calls to H5Pset_attr_creation_order() and was able to "successfully" create the file and it seems to run through subsequent read/write calls with no discernable issues.

Question

What is the disadvantage of writing a netCDF-4 file with the attribute creation order turned off? Will this cause issues? If so, what issues and are they problem-specific (i.e., do some applications/uses of netCDF need this turned on and some can function fine without it?)

Proposal

If it is OK for some applications to work with netCDF files with the attribute creation ordering turned off, would it be possible to add a configuration (preferably run-time or less desirably compile-time) to the netCDF library which would disable the setting of the attribute creation ordering. This would give me a solution now instead of in a year or two (plus the time waiting for associated applications to catch up and be able to handle hdf5-1.14.X format files...)

I can work up a PR for consideration if this seems like a possibility...

@gsjaardema
Copy link
Contributor Author

Note that it is probably OK to retain the attribute creation order tracking for the groups. The main area I am hitting the limit on is with datasets.

@DennisHeimbigner
Copy link
Collaborator

The difference is that when opening an existing dataset, the assigned attribute numbers
would differ from at creation time. It is possible, I suppose that some users do a bad
thing and access attributes by attribute number rather than name when reading a dataset.

@edwardhartnett
Copy link
Contributor

This is unfortunately very important. If we don't turn on creation ordering, the varids will change. They will be reordered into alphabetical order.

Plenty of codes out there depend on var 0 being something, var 1 being something else, etc. So reodering the vars will break all kinds of user code and be a major violation of backwards compatibility.

@gsjaardema
Copy link
Contributor Author

I am not suggesting this as a change to default behavior; I'm asking whether it would be possible to provide an option that the application / library could set to disable the use of the ordering if and only if the application / library knew that it could work correctly with non-deterministially-ordered variables...

@edwardhartnett
Copy link
Contributor

I believe the attribute table that keeps track of dimensions may know about varids - I don't know how what you propose can work, but I am certainly open to suggestions...

@gsjaardema
Copy link
Contributor Author

I have run a few tests on my netCDF library that has the attribute creation order tracking disabled and other than a different ordering for a few attributes, I can see no differences in the files. All of the vars appear in the exact same ordering in the file as they do with the attribute creation order tracking enabled (which I think makes sense...)

I'm not sure where to look to verify that the attribute table bookkeeping does not get changed, but the files that I am creating seem to be valid and are readable by tools linked with either a "pre-change" netcdf library or a "post-change" library.

If I run "ctest", I get some failures, but they all seem to be related to attribute ordering. I can see that if an application relied on the attributes appearing in a certain order, this change would break them, but in my uses, I always query the attribute by name and not position, so unless internally the library is relying on a specific ordering of the "hidden" attributes, I'm not sure how this would affect my subset of files. [I do agree that this breaks backward-compatibility so must be selectable at run-time and not the default behavior]

gsjaardema pushed a commit to gsjaardema/netcdf-c that referenced this issue Aug 6, 2021
Work in progress / Proof of concept:

Add a capability to disable the tracking of attribute creation order.
See Unidata#2054 for details.

This PR adds a `NC_NOATTCREORD` define which can be passed int the
`mode` argument to `nc_create`.  If it is present, then the
calls to set the attribute creation order tracking is disabled.
This should only be used for files in which you *know* that the
ordering of the attributes does not matter to *any* potential
readers of this database.
@gsjaardema
Copy link
Contributor Author

See #2056 for a proof-of-concept implementation of what I am proposing.

@gsjaardema
Copy link
Contributor Author

gsjaardema commented Aug 6, 2021

@edwardhartnett

This is unfortunately very important. If we don't turn on creation ordering, the varids will change.

I guess I am not understanding how the attribute creation ordering affects the varids. Could you explain? In my tests, it looks like the only thing that changes is the order of the attributes themselves.

@DennisHeimbigner
Copy link
Collaborator

Do you really have a variable with 2^16 attributes?

@gsjaardema
Copy link
Contributor Author

No, but I have a file that has more than 2^16 attributes since each variable has a few hidden attributes and the count is global over the file and not local to a particular variable.

@DennisHeimbigner
Copy link
Collaborator

Is there a separate counter for variables and for attributes? That would lessen the problem
since it would only affect users who access attributes by attribute number (rather than name).
Not sure how common that is.

@edwardhartnett
Copy link
Contributor

OK, I thought you were talking about changing the order of varids.

I agree that changing the order of attids has a far smaller impact.

One way to do this would be as a mode flag for the whole file at nc_create time. Another would be a variable setting. Unfortunately we don't have a mode flag for vars, so this would require a new function nc_def_var_att_ordering() or something.

In either case, I think old versions of netCDF-4 would still be able to work with the file. The user would have to know that attribute ordering is alphabetical not by creation.

@DennisHeimbigner
Copy link
Collaborator

Another solution is for netcdf to track creation order something like we track dimension ids.

@edhartnett
Copy link
Contributor

edhartnett commented Aug 9, 2021 via email

@DennisHeimbigner
Copy link
Collaborator

We also need to consider the reverse situation:

  1. library with creation order disabled creates a file
  2. someone else reads the file but their library has creation order enabled.

Will they experience problems?

@edwardhartnett
Copy link
Contributor

No I don't think that would be a problem, the read code takes the attributes in whatever order the HDF5 library presents them. Creation order is determined at dataset creation time for the HDF5 file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants