Forms should have optional keys, so that they can be used to look up data in regular, named containers #235

jpivarski · 2020-05-04T19:54:13Z

@nsmith-: just so that you don't write something around Forms to give them this capability (i.e. to fetch objects from an object store with a naming convention), it really ought to be part of the Forms.

These keys aren't needed to check that VirtualArrays make sense, but they might be needed for bookkeeping in a system that reconstructs Awkward Arrays from one-dimensional arrays. Only the minimum needed so far has been written, but we can definitely add such a thing so that you don't have to work around them not having it.

nsmith- · 2020-05-04T20:45:40Z

This implies also support for some utility method to take a keyed form and a dictionary of arrays and return an awkward array.

nsmith- · 2020-05-04T20:46:51Z

Or alternatively, a dictionary of callables to be used as generators, or a single callable and the form key is passed as an argument, etc.

jpivarski · 2020-05-04T22:09:35Z

There would be an advantage to having all the metadata be serializable as JSON. Data structures of callables are powerful, but not very portable.

nsmith- · 2020-05-04T22:18:53Z

Depends where you want to set the interface between awkward and the data delivery library. Perhaps awkward knows the column key and partition index (for chunked arrays) and expects an external call to provide the flat array.

jpivarski · 2020-05-05T00:48:41Z

What I had in mind when I wrote the original thing was that each one-dimensional array would have a (partition-id, position-in-tree) coordinate and that the Form would have a reference to the position-in-tree part.

For example, in old awkward serialization, the full array was broken down into a bunch of one-dimensional arrays that the backend knew how to store. HDF5 was one of those backends: anything that can name binary blobs would do. Zarr's v3 extension could be put to the same task and it could resolve the biggest problem that the old serialization had: the HDF5 file didn't know it was an Awkward HDF5 file. Having an explicit extension mechanism would at least let Zarr users know that a particular array is supposed to be read with the Awkward library, so that it can raise an error if the library isn't available. I'm optimistic about this as a way to do it and wrote it up at zarr-developers/zarr-specs#62.

The Form would take the role that the old schema.json took, with better separation between the description of where one-dimensional arrays go and those one-dimensional arrays themselves. You raised the point that we'd rather have one JSON that can be reinterpreted for all partitions than having a JSON per partition that's tightly glued to that partition. Whereas the old schema.json included names of actual arrays to find in the blob store (with their lengths), the new Forms can include keys that when joined with a partition-id tell you the names of actual arrays in the store. In Zarr's terminology, each partition can be a group and each group can have arrays with the same names in it.

jpivarski · 2020-07-22T23:53:33Z

As soon as #348 can be merged, we'll have an equivalent of Awkward 0's persistence layer. The new functions are ak.to_arrayset and ak.from_arrayset. The documentation should go online soon after it's merged to master as well, but here are some examples:

>>> original = ak.Array([[1, 2, 3], [], [4, 5]])
>>> form, container, num_partitions = ak.to_arrayset(original)
>>> form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "NumpyArray",
        "itemsize": 8,
        "format": "l",
        "primitive": "int64",
        "form_key": "node1"          # <-- the new form_key that makes this possible
    },
    "form_key": "node0"              # <-- the new form_key that makes this possible
}
>>> container
{'node0-offsets': array([0, 3, 3, 5], dtype=int64),
 'node1': array([1, 2, 3, 4, 5])}
>>> print(num_partitions)
None

Read it back with:

>>> ak.from_arrayset(form, container)
<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

Write to partitions with:

>>> container = {}
>>> form, _, _ = ak.to_arrayset(ak.Array([[1, 2, 3], [], [4, 5]]), container, 0)
>>> form, _, _ = ak.to_arrayset(ak.Array([[6, 7, 8, 9]]), container, 1)
>>> form, _, _ = ak.to_arrayset(ak.Array([[], [], []]), container, 2)
>>> form, _, _ = ak.to_arrayset(ak.Array([[10]]), container, 3)
>>> form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "NumpyArray",
        "itemsize": 8,
        "format": "l",
        "primitive": "int64",
        "form_key": "node1"
    },
    "form_key": "node0"
}
>>> container
{'node0-offsets-part0': array([0, 3, 3, 5], dtype=int64),
 'node1-part0': array([1, 2, 3, 4, 5]),
 'node0-offsets-part1': array([0, 4], dtype=int64),
 'node1-part1': array([6, 7, 8, 9]),
 'node0-offsets-part2': array([0, 0, 0, 0], dtype=int64),
 'node1-part2': array([], dtype=float64),
 'node0-offsets-part3': array([0, 1], dtype=int64),
 'node1-part3': array([10])}

Read it back with:

>>> ak.from_arrayset(form, container, 4)
<Array [[1, 2, 3], [], [4, ... [], [], [10]] type='8 * var * int64'>
>>> ak.partitions(ak.from_arrayset(form, container, 4))
[3, 1, 3, 1]

Or read it back lazily:

>>> lazy = ak.from_arrayset(form, container, 4, lazy=True, lazy_lengths=[3, 1, 3, 1])
>>> lazy.metadata["cache"]
{}
>>> lazy
<Array [[1, 2, 3], [], [4, ... [], [], [10]] type='8 * var * int64'>
>>> len(lazy.metadata["cache"])
3
>>> lazy + 100
<Array [[101, 102, 103], [], ... [], [], [110]] type='8 * var * int64'>
>>> len(lazy.metadata["cache"])
4

This is the same partitioned/lazy interface as ak.from_parquet.

As a side-note, Awkward Arrays can now be pickled. (cPickle protocol 2+ only; pybind11 constraints satisfied by Python 3. I could loosen that by serializing the Forms as JSON text and back again, but I don't think that's necessary—I can just say that this feature isn't automatic in Python 2.)

jpivarski · 2020-07-22T23:53:50Z

I haven't explicitly created the HDF5 read/write and the *.awkd file read/write (the format would be different in both cases). These "deconstructed" Awkward Arrays are not a good storage format, as @wctaylor discovered in scikit-hep/awkward-0.x#246.

jpivarski added the feature New feature or request label May 4, 2020

jpivarski linked a pull request Jul 22, 2020 that will close this issue

Added form_key (optional string) to all Forms. #348

Merged

jpivarski closed this as completed in #348 Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forms should have optional keys, so that they can be used to look up data in regular, named containers #235

Forms should have optional keys, so that they can be used to look up data in regular, named containers #235

jpivarski commented May 4, 2020

nsmith- commented May 4, 2020

nsmith- commented May 4, 2020

jpivarski commented May 4, 2020

nsmith- commented May 4, 2020

jpivarski commented May 5, 2020

jpivarski commented Jul 22, 2020

jpivarski commented Jul 22, 2020

Forms should have optional keys, so that they can be used to look up data in regular, named containers #235

Forms should have optional keys, so that they can be used to look up data in regular, named containers #235

Comments

jpivarski commented May 4, 2020

nsmith- commented May 4, 2020

nsmith- commented May 4, 2020

jpivarski commented May 4, 2020

nsmith- commented May 4, 2020

jpivarski commented May 5, 2020

jpivarski commented Jul 22, 2020

jpivarski commented Jul 22, 2020