-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uproot.dask(<some tree specifier>) does not know about behaviors? #100
Comments
At a quick glance (I'm not closely tracking |
I think @agoose77 is right, to mirror the |
dask-awkard supports |
Cool, we heavily use behaviors in coffea for a number for pieces of functionality. It would be useful to chat, in general, about how our NanoEvents representation layer operates using dask-awkward. |
Just uploaded dask-awkward 2022.11a1 to PyPI, which includes a commit that added a |
Awesome - thank you. I will keep this issue alive as I work through getting a few things going with NanoEvents and dask-awkward, if that's fine with you. There are some major architectural choices we should discuss once the baseline (i.e. current user interface) is recovered. |
@nsmith- FYI |
@douglasdavis @agoose77 We will also need awkward.with_name support for dask arrays, corresponding to behaviors. Right now using
Would it be possible to fix this in another pre-release? |
I get the same error when using |
When you're working with Dask objects, you should use the |
Ah - OK I wasn't sure how y'all were passing things through or otherwise supplementing the functionality. Thanks! |
Well - different error but same outcome @agoose77. :-) |
@lgray This should be straightforward to add; would you happen to have a short example using |
The most rudimentary example that is also immediately useful:
|
P.S. in the case where an operation is relatively trivial; e.g. x = dak.from_parquet("file.parquet")
dak.num(x.some_field, axis=1) can be represented with x = dak.from_parquet("file.parquet")
dak.map_partitions(ak.num, x.some_field, axis=1) This may be useful for parts of the |
We also use it to do things like annotate doc strings and such. |
Ah - using We could do |
Right, it's a bit more of an intermediate-user feature. A lot of the |
In any case - we do a huge amount of array metadata manipulation and restructuring in NanoEvents having all the facilities to all of that through dask-awkward is where I'd like to be at, since it would be a good factorization. To get a basic idea of the interface: It looks like once we get some of the more basic pieces in place it's not going to be hard at all (though I have a question about numba functions and dask, hopefully these aren't incompatible). |
numba functions and dask are indeed compatible |
|
@douglasdavis moving further a bit on this - does dask_awkward want to support the usual notion of array typing in awkward? If I, for instance, do
whereas
This is often helpful for debugging types or generally a user knowing what they are dealing with at any given moment. I think it's OK that the length may be unknown, but the type of thing being indexed is really important! |
Lindsey Gray ***@***.***> writes:
@douglasdavis moving further a bit on this - does dask_awkward want to
support the usual notion of array typing in awkward?
Definitely! What you show below is a known issue. We need the
`Array.type` call to _not_ depend on length if the underlying layout is
a typetracer layout (because we rely on the "typetracer Array", which is
a length-less highlevel `Array` object).
This is just a guess.. but
maybe scikit-hep/awkward#1849 will help here. cc
@agoose77 - he might have a non-guess answer :)
… If I, for instance, do `print(x.type) if x is a dask_awkward array then I get as an error:
```Traceback (most recent call last):
File "testing.py", line 10, in <module>
print(y.type)
File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 927, in __getattr__
if self._maybe_behavior_method(attr):
File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 912, in _maybe_behavior_method
res = getattr(self._meta, attr)
File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/highlevel.py", line 460, in type
self._layout.form.type_from_behavior(self._behavior), len(self._layout)
TypeError: 'UnknownLengthType' object cannot be interpreted as an integer
```
whereas `print(x.compute().type)` could give:
```
40 * event
```
This is often helpful for debugging types or generally a user knowing
what they are dealing with at any given moment. I think it's OK that
the length may be unknown, but the type of thing being indexed is
really important!
|
What we'll use to support the class Array:
...
@property
def type(self):
return self._meta.type
... but this problem would show up: In [4]: a = dak.from_lists([[1,2,3],[4]])
In [5]: a._meta
Out[5]: <Array-typetracer type='?? * int64'>
In [6]: a._meta.type
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [6], line 1
----> 1 a._meta.type
File ~/.pyenv/versions/3.10.8/envs/dev/lib/python3.10/site-packages/awkward/highlevel.py:460, in Array.type(self)
448 @property
449 def type(self):
450 """
451 The high-level type of this Array; same as #ak.type.
452
(...)
457 wrapped by an #ak.types.ArrayType.
458 """
459 return ak.types.ArrayType(
--> 460 self._layout.form.type_from_behavior(self._behavior), len(self._layout)
461 )
TypeError: 'UnknownLengthType' object cannot be interpreted as an integer |
Yes - the type string assumes that we know the length. For dak arrays that have unknown length we will need to do something. We will want to show the user that the outer component is unknown. I would opt to let a length of None indicate an unknown array length. Will look at this after thanksgiving! |
Ok you got me to dig a little more into this 😛. We would be able to support @property
def type(self):
t = ak.types.ArrayType(a._meta._layout.form.type_from_behavior(a._meta._behavior), 0)
t._length = "??"
return t This feels very hacky! |
Thanks, @agoose77! |
Another one kind of in this vein: when Typically we fill the Example of it not forwarding info presently:
This is a trivial example, for most NanoAOD arrays the branch titles are a bit more explanatory. I know this is a bit of a creature comfort, but people use it when first jumping into looking at data for an analysis. |
And in particular the |
@lgray is referring to scikit-hep/uproot5#784. |
The workflow looks something like:
|
Thanks for the little workflow! In that example would you expect the |
Yes, and that is exactly what we do here within coffea (this is in awkward-1.0, so the analogues of what would be done in dask): Much, actually nearly all, of the functionality we have in coffea (as presently awkward-1.0) for mapping to files and abstracting data to arrays suits dask_awkward much better. |
Great, thanks for the explanation. I think we have one piece of the puzzle with: https://github.com/ContinuumIO/dask-awkward/blob/76860c2343058ca5693ce5756d026ce001cc6122/src/dask_awkward/lib/core.py#L1700-L1708 using it will ensure whatever new collection that gets created will have the proper |
There we go - thank you for translating concepts for me! This is quite a bit of the missing piece. |
I hadn't been thinking of I suppose that a |
Ah - small misreading. We don't use |
I suppose one question is if you want to make |
@jpivarski Not sure what the direction we are heading is very precisely yet. At least now I can get a type tracer built for a specialized NanoEvents schema and see what's missing from there. I suspect a more concrete picture will appear fairly soon. |
We have talked about that, and the technique (but not a public API for it) has turned up in Uproot as well: For now, the fact that this is "One Weird Trick" and not a specialized function in Awkward is less than ideal, but not a high priority because it's easy to implement and won't change. |
I think |
After all back and forth in this thread we are getting very close!
|
My next question is then: how do I take the type tracer and turn it into something that I can do: |
You'll want a collection that is represented by your typetracer. However you overwrite the form of a real awkward array (not a typetracer) you'll need to apply to the collection as a step in a task graph (most easily done with a map_partitions call). I'm guessing you get def overwrite_form(concrete_ak_array, form):
# do something that changes the form of some _real_ data (i.e. ak.from_buffers(form=...,))
# overwritten = ak.from_buffers(form, ..., some_use_of_concrete_ak_array)
return overwritten
data = uproot.dask(...)
z = something()
desired_form = z[1]._form
new_typetracer = dak.typetracer_from_form(desired_form)
data = data.map_partitions(overwrite_form, desired_form, meta=new_typetracer) If
We can add a function to the high level data = dak.update_form(data, form) |
OK - this perhaps exposes a bit of a problem. What could be done possibly, strangely enough, is to use the original Or is there a pleasant way to extract the data that is actually desired by the user? |
Hmm. So we have a column projection optimization that will ensure only necessary columns are read from a dataset; For example say data = uproot.dask({"file.root": "tree"})
thing = data.column001 + data.column100
thing.compute() The fact that you remap events_Photon_pt to events.Photon.pt may be a problem for what is currently implemented, I'm not sure. With the "brute force" optimization method (takes it bit more time than the version that just searches for
|
Ah but the from_buffers call in the graph may end up screwing the optimization and making everything seem necessary. I'll keep thinking about this, but a small workflow would still be useful |
Coming back to this - are there specific reasons you cannot use the "form_key" property of the form to separate what you're calling the data members in code from how they are located on disk? This is how it was done with the awkward-1.0 and I think the pattern still applies just fine? @jpivarski commentary? |
The "form_key" mechanism hasn't changed. (The only v1 → v2 change in that area is that the arguments to I've been reading these comments, and I don't see how the form key could be used to solve this problem. If you did have a solution that leveraged form keys, it should still work, but I don't know how that would have been done. |
Oh - my thinking was that the form key could encode/describe the data product you want to read from the original source file regardless of how it is nested in the dask_awkward object that's being made on top of it. Essentially being the connection to the raw data for dask's io layer to deal with. |
In order to do this column-removal optimization, (I think) Dask needs to know the difference between a source node and other nodes in its graph. But also, I asked about this at one of our meetings, and I remember @douglasdavis showing that the information exists and is being used. The Form's keys wouldn't be a place where that could go, since everything in the Form is source. Is |
In some sense the source nodes are already defined since I'm starting from |
Here it is in detail: uproot is using On the dask side we create an That function class has a On the dask side, we figure out which branches are necessary and then recreate the layer with a new instance of |
The way we determine the necessary branches is test if the task graph executes successfully when operating only on metadata. So in theory, if whatever operations you do without dask can be done to the metadata, we should still be able take advantage of the optimization. This is truly the great power of typetracer arrays! Here is where we call dask-awkward/src/dask_awkward/lib/optimize.py Line 175 in 3f30e0a
This causes the layer to recreate itself with a new function class instance that has redefined the columns. We need to document this but really any function class that defines a |
Thanks for going over this! What would be your proposed manner of implementing this with nanoevents? I could certainly implement something that does what is necessary but I would prefer to make sure I'm not reinventing wheels or otherwise choosing poor factorization. My first guess for this is to do something like:
data = uproot.dask(<something>)
desired_form, new_typetracer = apply_schema(desired_form_maker, data.layout.form)
remap = _NanoEventsRemap(desired_form, <some args>)
data = data.map_partitions(remap, desired_form, meta=new_typetracer) Where I'm not entirely sure exactly how to do that, actually, since I don't quite see what dataformat the |
If I should rather be just building my own mapping and circumventing uproot then we should say that now. I was very much under the impression I'd be able to put most of the handling of what's in the root file on the uproot/dask_awkward side of the fence. |
Lindsey Gray ***@***.***> writes:
If I should rather be just building my own mapping and circumventing
uproot then we should say that now. I was very much under the
impression I'd be able to put most of the handling of what's in the
root file on the uproot/dask_awkward side of the fence.
I don't think a custom mapping of data-on-disk -->
dask-awkward-collection is necessary, especially because `uproot.dask`
already handles column projection. I think the path you've started to
jot down is promising.
With this file here:
https://github.com/CoffeaTeam/coffea/blob/master/tests/samples/nano_dy.root?raw=true
can you write out a code snippet that implements `_NanoEventsRemap` for
the photon pt workflow on _eager_ data (don't worry about dask usage). I
can start playing with building a task graph and experiment with the
optimization.
|
So: is the eager-mode implementation of this using awkward2. It probably covers a bit too much in scope, but the main things to know are that:
If you check out that branch of the coffea repo, do To have something to poke at. If you'd rather me extract it I can do that, but I figured the full implementation in-situ might be illuminating for you! |
This is largely taken care of now. |
Hi!
When I try the following using: awkward 2.0.0rc3, uproot 5.0.0.rc6, dask-awkward 2022.11a0, I get a handle that appears to be an dask-awkward record array which should have a behavior:
$ wget https://github.com/CoffeaTeam/coffea/blob/master/tests/samples/nano_dy.root?raw=true $ python
but it results in:
Is this expected? In either case how do I add to or alter the behavior of a dask array?
The text was updated successfully, but these errors were encountered: