Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forms should have optional keys, so that they can be used to look up data in regular, named containers #235

Closed
jpivarski opened this issue May 4, 2020 · 7 comments · Fixed by #348
Labels
feature New feature or request

Comments

@jpivarski
Copy link
Member

@nsmith-: just so that you don't write something around Forms to give them this capability (i.e. to fetch objects from an object store with a naming convention), it really ought to be part of the Forms.

These keys aren't needed to check that VirtualArrays make sense, but they might be needed for bookkeeping in a system that reconstructs Awkward Arrays from one-dimensional arrays. Only the minimum needed so far has been written, but we can definitely add such a thing so that you don't have to work around them not having it.

@jpivarski jpivarski added the feature New feature or request label May 4, 2020
@nsmith-
Copy link
Member

nsmith- commented May 4, 2020

This implies also support for some utility method to take a keyed form and a dictionary of arrays and return an awkward array.

@nsmith-
Copy link
Member

nsmith- commented May 4, 2020

Or alternatively, a dictionary of callables to be used as generators, or a single callable and the form key is passed as an argument, etc.

@jpivarski
Copy link
Member Author

There would be an advantage to having all the metadata be serializable as JSON. Data structures of callables are powerful, but not very portable.

@nsmith-
Copy link
Member

nsmith- commented May 4, 2020

Depends where you want to set the interface between awkward and the data delivery library. Perhaps awkward knows the column key and partition index (for chunked arrays) and expects an external call to provide the flat array.

@jpivarski
Copy link
Member Author

What I had in mind when I wrote the original thing was that each one-dimensional array would have a (partition-id, position-in-tree) coordinate and that the Form would have a reference to the position-in-tree part.

For example, in old awkward serialization, the full array was broken down into a bunch of one-dimensional arrays that the backend knew how to store. HDF5 was one of those backends: anything that can name binary blobs would do. Zarr's v3 extension could be put to the same task and it could resolve the biggest problem that the old serialization had: the HDF5 file didn't know it was an Awkward HDF5 file. Having an explicit extension mechanism would at least let Zarr users know that a particular array is supposed to be read with the Awkward library, so that it can raise an error if the library isn't available. I'm optimistic about this as a way to do it and wrote it up at zarr-developers/zarr-specs#62.

The Form would take the role that the old schema.json took, with better separation between the description of where one-dimensional arrays go and those one-dimensional arrays themselves. You raised the point that we'd rather have one JSON that can be reinterpreted for all partitions than having a JSON per partition that's tightly glued to that partition. Whereas the old schema.json included names of actual arrays to find in the blob store (with their lengths), the new Forms can include keys that when joined with a partition-id tell you the names of actual arrays in the store. In Zarr's terminology, each partition can be a group and each group can have arrays with the same names in it.

@jpivarski jpivarski linked a pull request Jul 22, 2020 that will close this issue
@jpivarski
Copy link
Member Author

As soon as #348 can be merged, we'll have an equivalent of Awkward 0's persistence layer. The new functions are ak.to_arrayset and ak.from_arrayset. The documentation should go online soon after it's merged to master as well, but here are some examples:

>>> original = ak.Array([[1, 2, 3], [], [4, 5]])
>>> form, container, num_partitions = ak.to_arrayset(original)
>>> form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "NumpyArray",
        "itemsize": 8,
        "format": "l",
        "primitive": "int64",
        "form_key": "node1"          # <-- the new form_key that makes this possible
    },
    "form_key": "node0"              # <-- the new form_key that makes this possible
}
>>> container
{'node0-offsets': array([0, 3, 3, 5], dtype=int64),
 'node1': array([1, 2, 3, 4, 5])}
>>> print(num_partitions)
None

Read it back with:

>>> ak.from_arrayset(form, container)
<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

Write to partitions with:

>>> container = {}
>>> form, _, _ = ak.to_arrayset(ak.Array([[1, 2, 3], [], [4, 5]]), container, 0)
>>> form, _, _ = ak.to_arrayset(ak.Array([[6, 7, 8, 9]]), container, 1)
>>> form, _, _ = ak.to_arrayset(ak.Array([[], [], []]), container, 2)
>>> form, _, _ = ak.to_arrayset(ak.Array([[10]]), container, 3)
>>> form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "NumpyArray",
        "itemsize": 8,
        "format": "l",
        "primitive": "int64",
        "form_key": "node1"
    },
    "form_key": "node0"
}
>>> container
{'node0-offsets-part0': array([0, 3, 3, 5], dtype=int64),
 'node1-part0': array([1, 2, 3, 4, 5]),
 'node0-offsets-part1': array([0, 4], dtype=int64),
 'node1-part1': array([6, 7, 8, 9]),
 'node0-offsets-part2': array([0, 0, 0, 0], dtype=int64),
 'node1-part2': array([], dtype=float64),
 'node0-offsets-part3': array([0, 1], dtype=int64),
 'node1-part3': array([10])}

Read it back with:

>>> ak.from_arrayset(form, container, 4)
<Array [[1, 2, 3], [], [4, ... [], [], [10]] type='8 * var * int64'>
>>> ak.partitions(ak.from_arrayset(form, container, 4))
[3, 1, 3, 1]

Or read it back lazily:

>>> lazy = ak.from_arrayset(form, container, 4, lazy=True, lazy_lengths=[3, 1, 3, 1])
>>> lazy.metadata["cache"]
{}
>>> lazy
<Array [[1, 2, 3], [], [4, ... [], [], [10]] type='8 * var * int64'>
>>> len(lazy.metadata["cache"])
3
>>> lazy + 100
<Array [[101, 102, 103], [], ... [], [], [110]] type='8 * var * int64'>
>>> len(lazy.metadata["cache"])
4

This is the same partitioned/lazy interface as ak.from_parquet.

As a side-note, Awkward Arrays can now be pickled. (cPickle protocol 2+ only; pybind11 constraints satisfied by Python 3. I could loosen that by serializing the Forms as JSON text and back again, but I don't think that's necessary—I can just say that this feature isn't automatic in Python 2.)

@jpivarski
Copy link
Member Author

I haven't explicitly created the HDF5 read/write and the *.awkd file read/write (the format would be different in both cases). These "deconstructed" Awkward Arrays are not a good storage format, as @wctaylor discovered in scikit-hep/awkward-0.x#246.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants