-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
B-tree v2 for links from LINK_INFO messages #47
Conversation
I've been stuck for a while on reading a fractal heap import h5py
import pyfive
filename = "test.h5"
n = 10
with h5py.File(filename, mode="w", track_order=True) as f:
for i in range(n):
f.create_group(str(i)*10)
with pyfive.File(filename) as f:
assert len(f.keys()) == n prints this FRACTAL HEAP DIRECT BLOCK
h5debug test.h5 8909 6917 512
OrderedDict([('signature', b'FHDB'),
('version', 0),
('heap_header_adddress', 6917),
('block_offset', 0),
('checksum', 436527222)])
header size 21
data size 491
managed object ID size 4
block data 010400000000000000000a3030303030 ...
first managed object (0, 0, 1) address 4 size 0
B-TREE V2 RECORDS
[{'namehash': 1032326176, 'objectid': 31885837240576},
{'namehash': 1328140396, 'objectid': 31885837248000},
{'namehash': 1679198546, 'objectid': 31885837255424},
{'namehash': 1779768826, 'objectid': 31885837210880},
{'namehash': 2418919233, 'objectid': 31885837233152},
{'namehash': 2529727048, 'objectid': 31885837262848},
{'namehash': 3052067229, 'objectid': 31885837270272},
{'namehash': 3140234048, 'objectid': 31885837218304},
{'namehash': 3744768751, 'objectid': 31885837277696},
{'namehash': 3976419996, 'objectid': 31885837225728}]
h5debug test.h5 7221 7063 10
B-TREE V2 RECORDS
[{'creationorder': 0, 'objectid': 31885837210880},
{'creationorder': 1, 'objectid': 31885837218304},
{'creationorder': 2, 'objectid': 31885837225728},
{'creationorder': 3, 'objectid': 31885837233152},
{'creationorder': 4, 'objectid': 31885837240576},
{'creationorder': 5, 'objectid': 31885837248000},
{'creationorder': 6, 'objectid': 31885837255424},
{'creationorder': 7, 'objectid': 31885837262848},
{'creationorder': 8, 'objectid': 31885837270272},
{'creationorder': 9, 'objectid': 31885837277696}]
h5debug test.h5 7733 7101 10 I'm assuming that the data of a direct block in a fractal heap is a sequence of object IDs but that doesn't make sense when I see this:
which results in Also I have no idea what the "objectid" in the B-tree is supposed to represent (it's not an address, the numbers are too big). It refers to an object managed in the fractal heap but not sure how. Each record (and hence each fractal object) corresponds to one HDF5 group (there are 10 HDF5 groups in the example). Any help @jjhelmus would be appriciated. |
I think the objectid should not be parsed as an int (as is currently being done in BTreeV2GroupNames._parse_record and BTreeV2GroupOrders._parse_record) For example, for the first (name=='0000000000') group in test.h5 the objectid as bytes is [0, 21, 0, 0, 0, 29, 0], where 21 is the offset of the data in the fractal heap and 29 is the length of the data. For the second group the objectid is [0, 50, 0, 0, 0, 29, 0]... I think the referenced data is not a fractal heap object ID, but seems to be 0x01 and 0x04 followed by a unique id in the third byte (0, 1, 2, 3... for the 10 objects) and the data size == 10 in byte 11, followed by the data (and a checksum after the data?) with some more zero padding after that for unknown reasons. I was not able to match this up to anything in the HDF5 specification yet. EDIT: it does match the Link Message as far as I can see: |
in FractalHeap, self._managed_object_offset_size is being calculated incorrectly - the value "maximum_heap_size" in the header is log2 of the real value (it says so in the spec) and so self._managed_object_offset_size can be calculated exactly the same way as block_offset_size = n // 8 + min(n % 8, 1): since n will be the number of bits, this is a correct way to calculate the number of bytes needed to represent n bits. If these fixes are made, then self._managed_object_offset_size = 4 for test.h5, and the objectid returned in _read_node can be parsed correctly as a "Fractal Heap ID for Managed Objects". |
I very much appreciate your help @bmaranville ! I'll come back to this asap and try out your correction. |
This indeed fixes one problem. However what is still unclear from the HDF5 specs is what the "heap ID" actually is:
Then when looking at the fractal heap
How is the "heap ID" from the b-tree record related to the direct block data of the fractal heap? I will asked on the HDF5 forum: https://forum.hdfgroup.org/t/hdf5-specs-unclear-b-tree-v2-and-fractal-heap/8723 |
The fractal heap ID in the spec has different forms based on what what type of object it is - for "tiny" objects the ID contains the data, for "huge" objects it contains either a B-tree key or a direct pointer, while for "managed objects" (like the ones we've been looking at) it contains the offset (within the fractal heap) and length of the data; e.g. the first complete "link message" is found at exactly offset 21 with length 29 in the fractal heap, which is within the direct data block (though the addressing is from the beginning of the heap, not the data block) One thing that remains unclear to me is how you are supposed to know which type of heap ID you are reading - are you supposed to try to match the structure to the various sub-types and then use the first one that matches? |
If we know the offset of the fractal heap and the "heap ID" itself contains an offset (with respect to the heap offset) and a size, then a pure reader doesn't need to analyze the fractal heap at all it seems ... |
But how do we know that the 7-byte "heap ID" in the "Version 2 B-tree, Type 5 Record Layout - Link Name for Indexed Group" corresponds to a managed object ID? I don't see that explicitly stated anywhere in the spec. |
From the first byte to the heap ID:
ID type is 0: managed, 1: tiny, 2: huge |
Ah! Wonderful. Somehow I didn't notice that the first byte of all the ID sub-types had those same bits defined. |
Revisiting the size issue, in the spec I see two definitions
So in python for me this is
It's not the same thing. |
Ah ok but it is saved as log2 in the header sigh |
We can now retrieve the fractal heap data used by the b-tree that stores sorted HDF5 groups: FRACTAL HEAP DIRECT BLOCK
h5debug test.h5 8959 6967 512
OrderedDict([('signature', b'FHDB'),
('version', 0),
('heap_header_adddress', 6967),
('block_offset', 0),
('checksum', 991944457)])
header size 21
data size 491
block data b'\x01\x04\x00\x00\x00\x00\x00\x00\x00\x00\x0b00000000000[\x01\x00\x00\x00\x00\x00\x00' ...
B-TREE V2 RECORDS
h5debug test.h5 7783 7151 10
TODO: decode fractal heap data
b'\x01\x04\x00\x00\x00\x00\x00\x00\x00\x00\x0b00000000000[\x01\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x01\x00\x00\x00\x00\x00\x00\x00\x0b11111111111\x1b\x04\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x02\x00\x00\x00\x00\x00\x00\x00\x0b22222222222\xdb\x06\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x03\x00\x00\x00\x00\x00\x00\x00\x0b33333333333\x9b\t\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x04\x00\x00\x00\x00\x00\x00\x00\x0b44444444444[\x0c\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x05\x00\x00\x00\x00\x00\x00\x00\x0b55555555555G\x0f\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x06\x00\x00\x00\x00\x00\x00\x00\x0b66666666666W\x12\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x07\x00\x00\x00\x00\x00\x00\x00\x0b77777777777g\x15\x00\x00\x00\x00\x00\x00'
b'\x01\x04\x08\x00\x00\x00\x00\x00\x00\x00\x0b88888888888w\x18\x00\x00\x00\x00\x00\x00'
b'\x01\x04\t\x00\x00\x00\x00\x00\x00\x00\x0b99999999999\x1b\x0f\x00\x00\x00\x00\x00\x00' I didn't find any specification of how to decode that information. You can recognize the group index and name so we probably have the correct data blobs. |
In the link info message definition, it specifies that the data found in the heap should be decoded as a Link Message - I decoded one by hand in one of the comments above. |
aa7e879
to
f3683c0
Compare
c9441d6
to
4dd0bcc
Compare
Perfect, thanks @bmaranville ! This can be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good! I had only minor comments.
@woutdenolf and @jjhelmus I already ported this enhancement to a new branch of the jsfive library... I want to make sure that project makes a proper attribution of your work. If you want me to alter the LICENSE.txt please let me know. |
Oeph that's a lot of code duplication. I will probably add more things to |
@jjhelmus I guess you are no longer actively managing this project? Any objection if @bmaranville and myself merge things? |
I've let this project idle for too long, sorry about. @woutdenolf and @bmaranville, if you are still interesting in helping maintain or taking over this project I'd be happy to help make this possible. I've invited both of you as collaborators to this repository. |
Closes #46