-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forms should have optional keys, so that they can be used to look up data in regular, named containers #235
Comments
This implies also support for some utility method to take a keyed form and a dictionary of arrays and return an awkward array. |
Or alternatively, a dictionary of callables to be used as generators, or a single callable and the form key is passed as an argument, etc. |
There would be an advantage to having all the metadata be serializable as JSON. Data structures of callables are powerful, but not very portable. |
Depends where you want to set the interface between awkward and the data delivery library. Perhaps awkward knows the column key and partition index (for chunked arrays) and expects an external call to provide the flat array. |
What I had in mind when I wrote the original thing was that each one-dimensional array would have a (partition-id, position-in-tree) coordinate and that the Form would have a reference to the position-in-tree part. For example, in old awkward serialization, the full array was broken down into a bunch of one-dimensional arrays that the backend knew how to store. HDF5 was one of those backends: anything that can name binary blobs would do. Zarr's v3 extension could be put to the same task and it could resolve the biggest problem that the old serialization had: the HDF5 file didn't know it was an Awkward HDF5 file. Having an explicit extension mechanism would at least let Zarr users know that a particular array is supposed to be read with the Awkward library, so that it can raise an error if the library isn't available. I'm optimistic about this as a way to do it and wrote it up at zarr-developers/zarr-specs#62. The Form would take the role that the old schema.json took, with better separation between the description of where one-dimensional arrays go and those one-dimensional arrays themselves. You raised the point that we'd rather have one JSON that can be reinterpreted for all partitions than having a JSON per partition that's tightly glued to that partition. Whereas the old schema.json included names of actual arrays to find in the blob store (with their lengths), the new Forms can include keys that when joined with a partition-id tell you the names of actual arrays in the store. In Zarr's terminology, each partition can be a group and each group can have arrays with the same names in it. |
As soon as #348 can be merged, we'll have an equivalent of Awkward 0's persistence layer. The new functions are >>> original = ak.Array([[1, 2, 3], [], [4, 5]])
>>> form, container, num_partitions = ak.to_arrayset(original)
>>> form
{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node1" # <-- the new form_key that makes this possible
},
"form_key": "node0" # <-- the new form_key that makes this possible
}
>>> container
{'node0-offsets': array([0, 3, 3, 5], dtype=int64),
'node1': array([1, 2, 3, 4, 5])}
>>> print(num_partitions)
None Read it back with: >>> ak.from_arrayset(form, container)
<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'> Write to partitions with: >>> container = {}
>>> form, _, _ = ak.to_arrayset(ak.Array([[1, 2, 3], [], [4, 5]]), container, 0)
>>> form, _, _ = ak.to_arrayset(ak.Array([[6, 7, 8, 9]]), container, 1)
>>> form, _, _ = ak.to_arrayset(ak.Array([[], [], []]), container, 2)
>>> form, _, _ = ak.to_arrayset(ak.Array([[10]]), container, 3)
>>> form
{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node1"
},
"form_key": "node0"
}
>>> container
{'node0-offsets-part0': array([0, 3, 3, 5], dtype=int64),
'node1-part0': array([1, 2, 3, 4, 5]),
'node0-offsets-part1': array([0, 4], dtype=int64),
'node1-part1': array([6, 7, 8, 9]),
'node0-offsets-part2': array([0, 0, 0, 0], dtype=int64),
'node1-part2': array([], dtype=float64),
'node0-offsets-part3': array([0, 1], dtype=int64),
'node1-part3': array([10])} Read it back with: >>> ak.from_arrayset(form, container, 4)
<Array [[1, 2, 3], [], [4, ... [], [], [10]] type='8 * var * int64'>
>>> ak.partitions(ak.from_arrayset(form, container, 4))
[3, 1, 3, 1] Or read it back lazily: >>> lazy = ak.from_arrayset(form, container, 4, lazy=True, lazy_lengths=[3, 1, 3, 1])
>>> lazy.metadata["cache"]
{}
>>> lazy
<Array [[1, 2, 3], [], [4, ... [], [], [10]] type='8 * var * int64'>
>>> len(lazy.metadata["cache"])
3
>>> lazy + 100
<Array [[101, 102, 103], [], ... [], [], [110]] type='8 * var * int64'>
>>> len(lazy.metadata["cache"])
4 This is the same partitioned/lazy interface as As a side-note, Awkward Arrays can now be pickled. (cPickle protocol 2+ only; pybind11 constraints satisfied by Python 3. I could loosen that by serializing the Forms as JSON text and back again, but I don't think that's necessary—I can just say that this feature isn't automatic in Python 2.) |
I haven't explicitly created the HDF5 read/write and the |
@nsmith-: just so that you don't write something around Forms to give them this capability (i.e. to fetch objects from an object store with a naming convention), it really ought to be part of the Forms.
These keys aren't needed to check that VirtualArrays make sense, but they might be needed for bookkeeping in a system that reconstructs Awkward Arrays from one-dimensional arrays. Only the minimum needed so far has been written, but we can definitely add such a thing so that you don't have to work around them not having it.
The text was updated successfully, but these errors were encountered: