Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation - variable-length datatype description seems incomplete (doesn't mention global heap) #1682

Closed
gringer opened this issue Apr 24, 2022 · 4 comments
Assignees
Labels
Component - Documentation Doxygen, markdown, etc. Priority - 2. Medium ⏹ It would be nice to have this in the next release Type - Improvement Improvements that don't add a new feature or functionality
Milestone

Comments

@gringer
Copy link

gringer commented Apr 24, 2022

I'm working on trying to decode an HDF5 file (attached) manually using the specification (for the purpose of writing rust code for a decoder that quickly extracts raw signal data from nanopore FAST5 files), and am getting tripped up by the definition of variable-length data. I'm currently looking at the attribute message starting at location 0x14e62 in the file, which I've expanded out here in 8-byte chunks for clarity:

00014e62  0c 00 48 00 04 00 00 00  01 00 0d 00 14 00 08 00  ## attribute header / lengths
00014e72  66 69 6c 65 5f 76 65 72  73 69 6f 6e 00 00 00 00  ## attribute name
00014e82  19 01 01 00 10 00 00 00  10 00 00 00 01 00 00 00  ## datatype
00014e92  00 00 08 00 00 00 00 00  01 00 00 00 00 00 00 00  ## datatype / dataspace
00014ea2  03 00 00 00 00 08 00 00  00 00 00 00 73 00 00 00  ## dataspace

I can understand the first two lines of this:

  • Attribute message; 0x48 bytes excluding header; not shared; version 1; 0xd bytes for name; 0x14 bytes for datatype; 0x08 bytes for dataspace
  • Name [file_version\0] + padding to 8-byte boundary

The first 4 bytes of the third line define a v1 variable-length datatype; type string; null-terminated; with UTF-8 encoding. Beyond this I think the specification indicates a length should follow, so that's another 0x10 bytes for the datatype (consistent with the attribute information)... but then I get lost.

The datatype message specification suggests the next information that follows is "Properties", and for variable-length datatypes it states, "Each variable-length type is based on some parent type. The information for that parent type is described recursively by this field." Unfortunately, I don't understand what this means. The next four bytes in the datatype definition are 10 00 00 00, and I can't find anything in the specification to help decode them. If I assume this is a variable class/version information segment, then I would expect it to decode to v1, fixed point. If I assume this is the property of a string, the specification tells me, "There are no properties defined for the string class." If I assume this is the property of an array, I end up with a dimensionality of 16, and there's not enough space in the datatype to define 16 dimensions.

Moving onto the first eight bytes of the dataspace section (from 0x14e9a), I get v1, 0 dimensions (i.e. scalar value), and no set flags (with reserved bytes set to zero). That all seems fine, and consistent with what I expect. Following on from this (fifth line), I get lost again. The attribute message section tells me what should follow is the data itself, but that's not correct; this is not a null-terminated string.

After a lot of hunting through the file, and comparing with the output of h5dump, and checking h5debug, I found the version string I was looking for, in the global heap starting in the file at position 0x800, index 115 (0x73).

I found a little nugget of information in the specification for the global heap, which stated, "For example, data of variable-length datatype elements is stored in the global heap and is accessed via a global heap ID. The format for global heap IDs is described at the end of this section." It would have been really helpful for me if this information were in the section on variable-length datatype elements (it's not, from what I can tell). This clued me into realising that the last 12 bytes of this attribute section was probably the location of the global heap, and the index within that heap with the data (following what is described in the specification as, "The format for the ID used to locate an object in the global heap is described here:"). But I have no idea what the first 4 bytes on that line relate to (03 00 00 00). Are these flags relating to the heap, or the data? Are there any other variable types that use the global heap, or is the variable-length datatype the only one?

There seems to be additional information in this attribute section that is being parsed by h5dump / h5debug (e.g. CTYPE H5T_C_S1; H5T_LOC_0), but I can't see how it's defined in the HDF5 specification. Could someone please help me understand this?

$ h5debug perfect_guppy_3.6.0_LAST_gci_vs_Nb_mtDNA.fast5 96
...
   Message ID (sequence number):                   0x000c `attribute' (0)
   Dirty:                                          FALSE
   Message flags:                                  <DS>
   Chunk number:                                   1
   Raw message data (offset, size) in chunk:       (32, 72) bytes
   Message Information:                           
      Name:                                        "file_version"
      Character Set of Name:                       ASCII
      Object opened:                               FALSE
      Object:                                      0
      Creation Index:                              0
      Datatype...
         Encoded Size:                             20
         Type class:                               vlen
         Size:                                     16 bytes
         Version:                                  1
         Vlen type:                                string
         Location:                                 H5T_LOC_0
         Character Set:                            UTF-8
         String Padding:                           NULL Terminated
      Dataspace...
         Encoded Size:                             8
         Space class:                              H5S_SCALAR
...

h5debug ../../perfect_guppy_3.6.0_LAST_gci_vs_Nb_mtDNA.fast5 0x800
...
Object 115
   Obffset in block: 3760
   Reference count: 0
   Size of object body: 3/8
      0000: 32 2e 30                                         2.0

$ h5dump perfect_guppy_3.6.0_LAST_gci_vs_Nb_mtDNA.fast5
...
   ATTRIBUTE "file_version" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "2.0"
      }
   }
...

perfect_guppy.tar.gz

@gringer gringer changed the title Documentation - variable-length datatype description seems incomplete Documentation - variable-length datatype description seems incomplete (doesn't mention global heap) Apr 24, 2022
@andygotz
Copy link

@gringer you might be interested in the Rust binding for HDF5 in case you are not aware of it : https://github.com/aldanor/hdf5-rust

@gringer
Copy link
Author

gringer commented May 3, 2022

I did have a look at this; the readme suggests that hdf5-rust is a wrapper around the existing C library. I'm interested in pulling a specific dataset out of a specific hdf5 implementation, and want a more lean read-only approach.

Regardless of whether or not a wrapper would be suitable, the documentation still seems incomplete.

@gheber gheber added Component - Documentation Doxygen, markdown, etc. and removed documentation labels Mar 3, 2023
@derobins derobins added Priority - 2. Medium ⏹ It would be nice to have this in the next release Type - Improvement Improvements that don't add a new feature or functionality labels May 3, 2023
@mattjala mattjala added this to the 1.14.4 milestone Jan 19, 2024
@mattjala
Copy link
Contributor

mattjala commented Jan 22, 2024

Each variable-length type is based on some parent type. The information for that parent type is described recursively by this field.

This means that the 'Properties' field contains the entire datatype message for the entire parent type, starting from the class version flags to the properties. If you had DatatypeA, which was a variable length sequence of type DatatypeB, the order of elements in the message would be:

Class/Version A , Class Bit Fields A, Datatype Size A, Properties A where Properties A = Class/Version B, Class Bit Field B, Datatype Size B, Properties B.

The next four bytes in the datatype definition are 10 00 00 00, and I can't find anything in the specification to help decode them.

In the special case of a variable length string, the 'parent' type is considered to be an unsigned character H5T_NATIVE_UCHAR, which is treated as having the H5T_INTEGER datatype class. The bytes 10 00 00 00 are the flags of the integer, and the following bytes 01 00 00 00 are the size (one byte) followed by the value (0) and padding bytes.

The last section of the attribute message is a global heap ID, composed of a collection address (with a length determined by Size of Offsets) followed by an object index. The 03 00 00 00 bytes are part of the collection address.

CTYPE H5T_C_S1 is known to h5dump/h5debug because the datatype is a variable length string. H5T_C_S1 is the one-length string datatype used to create a variable length string.

I'm not sure what H5T_LOC_0 means - I suspect it may be something left over from an older version of the library.

Are there any other variable types that use the global heap, or is the variable-length datatype the only one?

Both variable-length data and region reference data store their data in the global heap.

It's true that the global heap should probably be mentioned in the variable length datatype message description - I'll update the documentation for this.

@mattjala
Copy link
Contributor

Resolved in #3950

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Documentation Doxygen, markdown, etc. Priority - 2. Medium ⏹ It would be nice to have this in the next release Type - Improvement Improvements that don't add a new feature or functionality
Projects
None yet
Development

No branches or pull requests

7 participants