Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uproot.dask(<some tree specifier>) does not know about behaviors? #100

Closed
lgray opened this issue Nov 14, 2022 · 72 comments
Closed

uproot.dask(<some tree specifier>) does not know about behaviors? #100

lgray opened this issue Nov 14, 2022 · 72 comments

Comments

@lgray
Copy link
Collaborator

lgray commented Nov 14, 2022

Hi!

When I try the following using: awkward 2.0.0rc3, uproot 5.0.0.rc6, dask-awkward 2022.11a0, I get a handle that appears to be an dask-awkward record array which should have a behavior:

$ wget https://github.com/CoffeaTeam/coffea/blob/master/tests/samples/nano_dy.root?raw=true
$ python
>>> import uproot
>>> tree = uproot.dask({"./nano_dy.root": "Events"})
>>> tree.behavior

but it results in:

>>> tree.behavior
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 927, in __getattr__
    return self._call_behavior_property(attr)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 894, in _call_behavior_property
    return self.map_partitions(
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 967, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1295, in map_partitions
    return new_array_object(
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1114, in new_array_object
    actual_meta = compute_typetracer(dsk, name)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1061, in compute_typetracer
    return typetracer_array(Delayed(key, dsk.cull({key}), layer=name).compute())
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1573, in typetracer_array
    raise TypeError(msg)
TypeError: `a` should be an awkward array or a Dask awkward collection.
Got type <class 'NoneType'>

Is this expected? In either case how do I add to or alter the behavior of a dask array?

@agoose77
Copy link
Collaborator

At a quick glance (I'm not closely tracking dask-awkward just at the moment), it looks like we don't proxy the behaviour attribute to the meta object, e.g. like we do for layout. I think this might be the issue here, although the error itself is also interesting from a "what is happening instead" perspective

@douglasdavis
Copy link
Collaborator

I think @agoose77 is right, to mirror the awkward API we'll need to add a property to the dask-awkward Array that reports the behavior via its _meta.

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 14, 2022

how do I add to or alter the behavior of a dask array?

dask-awkard supports {ak,dak}.with_name(a, behavior=...). I'd be happy to chat about behavior support, it's an area of dask-awkward that could use some improvements w.r.t. mirroring awkward's API. (current tests: https://github.com/ContinuumIO/dask-awkward/blob/main/tests/test_behavior.py)

@lgray
Copy link
Collaborator Author

lgray commented Nov 14, 2022

Cool, we heavily use behaviors in coffea for a number for pieces of functionality. It would be useful to chat, in general, about how our NanoEvents representation layer operates using dask-awkward.

@douglasdavis
Copy link
Collaborator

Just uploaded dask-awkward 2022.11a1 to PyPI, which includes a commit that added a behavior property to the Array collection.

@lgray
Copy link
Collaborator Author

lgray commented Nov 14, 2022

Awesome - thank you. I will keep this issue alive as I work through getting a few things going with NanoEvents and dask-awkward, if that's fine with you.

There are some major architectural choices we should discuss once the baseline (i.e. current user interface) is recovered.

@lgray
Copy link
Collaborator Author

lgray commented Nov 15, 2022

@nsmith- FYI

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

@douglasdavis @agoose77 We will also need awkward.with_name support for dask arrays, corresponding to behaviors. Right now using ak.with_name results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lgray/coffea-dev/coffea/coffea/nanoevents/schemas/base.py", line 115, in apply_to_dask
    dask_record = awkward.with_name(dask_record, "NanoEvents")
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/operations/ak_with_name.py", line 35, in with_name
    return _impl(array, name, highlevel, behavior)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/operations/ak_with_name.py", line 40, in _impl
    layout = ak.operations.to_layout(array)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/operations/ak_to_layout.py", line 45, in to_layout
    return _impl(array, allow_record, allow_other, numpytype)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/operations/ak_to_layout.py", line 134, in _impl
    ak.operations.from_iter(array, highlevel=False),
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/operations/ak_from_iter.py", line 64, in from_iter
    return _impl(iterable, highlevel, behavior, allow_record, initial, resize)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/operations/ak_from_iter.py", line 89, in _impl
    builder.fromiter(iterable)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 930, in wrapper
    return self._call_behavior_method(attr, *args, **kwargs)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 891, in _call_behavior_method
    return self.map_partitions(
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 975, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1303, in map_partitions
    return new_array_object(
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1122, in new_array_object
    actual_meta = compute_typetracer(dsk, name)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1069, in compute_typetracer
    return typetracer_array(Delayed(key, dsk.cull({key}), layer=name).compute())
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 1581, in typetracer_array
    raise TypeError(msg)
TypeError: `a` should be an awkward array or a Dask awkward collection.
Got type <class 'list'>

Would it be possible to fix this in another pre-release?

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

I get the same error when using ak.with_parameter, we use parameters quite a lot to annotate arrays with TTree metadata.

@agoose77
Copy link
Collaborator

When you're working with Dask objects, you should use the import dask_awkward as dak namespace - the Awkward high level API won't be Dask aware :)

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

Ah - OK I wasn't sure how y'all were passing things through or otherwise supplementing the functionality. Thanks!

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

>>> BaseSchema.apply_to_dask(x)
/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py:1276: UserWarning: metadata could not be determined; a compute on the first partition will occur.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lgray/coffea-dev/coffea/coffea/nanoevents/schemas/base.py", line 116, in apply_to_dask
    dask_record = dask_awkward.with_parameter(dask_record, "metadata", {})
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/structure.py", line 481, in with_parameter
    raise DaskAwkwardNotImplemented("TODO")
dask_awkward.utils.DaskAwkwardNotImplemented: TODO

If you would like this unsupported call to be supported by
dask-awkward please open an issue at:
https://github.com/ContinuumIO/dask-awkward.

Well - different error but same outcome @agoose77. :-)

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 16, 2022

@lgray This should be straightforward to add; would you happen to have a short example using with_parameter?

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

The most rudimentary example that is also immediately useful:

dask_record = uproot.dask({"some/nanoaod/file.root": "Events"})
dask_record = dask_awkward.with_parameter(dask_record, "metadata", {})

@douglasdavis
Copy link
Collaborator

P.S. in the case where an operation is relatively trivial; dak.map_partitions provides a generic API for applying an awkward function to a collection:

e.g.

x = dak.from_parquet("file.parquet")
dak.num(x.some_field, axis=1)

can be represented with

x = dak.from_parquet("file.parquet")
dak.map_partitions(ak.num, x.some_field, axis=1)

This may be useful for parts of the ak. namespace that don't currently have support in the dak. namespace. It's still useful for us to know which parts of ak. are needed!

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

We also use it to do things like annotate doc strings and such.

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

Ah - using dak.map_partitions would be extremely disruptive to people currently doing analysis. It would be best to be in a position where a user could import dask_awkward as ak and operations-wise be able to execute the same code.

We could do dak.map_partitions behind the scenes.

@douglasdavis
Copy link
Collaborator

We could do dak.map_partitions behind the scenes.

Right, it's a bit more of an intermediate-user feature. A lot of the dak. <--> ak. connection is made with map_partitions.

@lgray
Copy link
Collaborator Author

lgray commented Nov 16, 2022

In any case - we do a huge amount of array metadata manipulation and restructuring in NanoEvents having all the facilities to all of that through dask-awkward is where I'd like to be at, since it would be a good factorization.

To get a basic idea of the interface:
https://www.youtube.com/watch?v=McKSS_WjLwU&t=1346s

It looks like once we get some of the more basic pieces in place it's not going to be hard at all (though I have a question about numba functions and dask, hopefully these aren't incompatible).

@douglasdavis
Copy link
Collaborator

numba functions and dask are indeed compatible

@douglasdavis
Copy link
Collaborator

with_parameter and without_parameters are in 2022.11a2

@lgray
Copy link
Collaborator Author

lgray commented Nov 18, 2022

@douglasdavis moving further a bit on this - does dask_awkward want to support the usual notion of array typing in awkward?

If I, for instance, do print(x.type) if x is a dask_awkward array then I get as an error:

  File "testing.py", line 10, in <module>
    print(y.type)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 927, in __getattr__
    if self._maybe_behavior_method(attr):
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py", line 912, in _maybe_behavior_method
    res = getattr(self._meta, attr)
  File "/Users/lgray/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/awkward/highlevel.py", line 460, in type
    self._layout.form.type_from_behavior(self._behavior), len(self._layout)
TypeError: 'UnknownLengthType' object cannot be interpreted as an integer

whereas print(x.compute().type) could give:

40 * event

This is often helpful for debugging types or generally a user knowing what they are dealing with at any given moment. I think it's OK that the length may be unknown, but the type of thing being indexed is really important!

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 18, 2022 via email

@douglasdavis
Copy link
Collaborator

What we'll use to support the type property will just be x._meta.type:

class Array:
    ...

    @property
    def type(self):
        return self._meta.type

    ...

but this problem would show up:

In [4]: a = dak.from_lists([[1,2,3],[4]])

In [5]: a._meta
Out[5]: <Array-typetracer type='?? * int64'>

In [6]: a._meta.type
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [6], line 1
----> 1 a._meta.type

File ~/.pyenv/versions/3.10.8/envs/dev/lib/python3.10/site-packages/awkward/highlevel.py:460, in Array.type(self)
    448 @property
    449 def type(self):
    450     """
    451     The high-level type of this Array; same as #ak.type.
    452 
   (...)
    457     wrapped by an #ak.types.ArrayType.
    458     """
    459     return ak.types.ArrayType(
--> 460         self._layout.form.type_from_behavior(self._behavior), len(self._layout)
    461     )

TypeError: 'UnknownLengthType' object cannot be interpreted as an integer

@agoose77
Copy link
Collaborator

Yes - the type string assumes that we know the length. For dak arrays that have unknown length we will need to do something. We will want to show the user that the outer component is unknown. I would opt to let a length of None indicate an unknown array length. Will look at this after thanksgiving!

@douglasdavis
Copy link
Collaborator

Ok you got me to dig a little more into this 😛. We would be able to support .type with this implementation:

@property
def type(self):
    t = ak.types.ArrayType(a._meta._layout.form.type_from_behavior(a._meta._behavior), 0)
    t._length = "??"
    return t

This feels very hacky!

@douglasdavis
Copy link
Collaborator

Thanks, @agoose77!

@lgray
Copy link
Collaborator Author

lgray commented Nov 18, 2022

Another one kind of in this vein: when __doc__ is set in an awkward array's parameters it's forwarded to ipython/jupyter help and this is a pretty useful tool to have when poking around one's data.

Typically we fill the __doc__ with the TTree's or TBranch's title, but it could be used for all sorts of information or helpful hints. Could this functionality be added at the dak level as opposed to having to go through compute() to get it?

Example of it not forwarding info presently:

In [5]: z = y.compute()

In [6]: z?
Type:            NanoEventsArray
String form:     [<NanoEventsRecord {run: 1, luminosityBlock: 13889, ...} type='event'>, ...]
Length:          40
File:            ~/coffea-dev/coffea/coffea/nanoevents/methods/base.py
Docstring:       Events
Class docstring:
NanoEvents mixin class

This mixin class is used as the top-level type for NanoEvents objects.

In [7]: y?
Type:        Array
String form: dask.awkward<with-parameter, npartitions=1>
Length:      40
File:        ~/miniforge3/envs/coffea-dev/lib/python3.8/site-packages/dask_awkward/lib/core.py
Docstring:  
Partitioned, lazy, and parallel Awkward Array Dask collection.

The class constructor is not intended for users. Instead use
factory functions like :py:func:`~dask_awkward.from_parquet`,
:py:func:`~dask_awkward.from_json`, etc.

Within dask-awkward the ``new_array_object`` factory function is
used for creating new **instances.**

This is a trivial example, for most NanoAOD arrays the branch titles are a bit more explanatory. I know this is a bit of a creature comfort, but people use it when first jumping into looking at data for an analysis.

@lgray
Copy link
Collaborator Author

lgray commented Nov 18, 2022

And in particular the dask_awkward version of the form/layout doesn't say what it is expected to be once computed, which may be confusing. Perhaps we can say "dask-awkward proxy for XYZClass" or something to that effect? This should be possible knowing the behavior.

@jpivarski
Copy link
Collaborator

when __doc__ is set in an awkward array's parameters

@lgray is referring to scikit-hep/uproot5#784.

@lgray
Copy link
Collaborator Author

lgray commented Nov 21, 2022

The workflow looks something like:

base_array = opener("some/file/some/where.something") # uproot.dask or dak.from_<serialized format>
desired_form = apply_schema(base_array, specified_schema) # this is in derived library code (e.g. nanoevents)
desired_array = remapping(base_array, desired_form) # this is presently done with ak.from_buffers
user_function(desired_array) # this could be a whole physics analysis that returns histograms

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 21, 2022

Thanks for the little workflow! In that example would you expect the remapping function use ak.to_buffers(base_array) and return ak.from_buffers(form=desired_form, ...) in its body?

@lgray
Copy link
Collaborator Author

lgray commented Nov 21, 2022

Yes, and that is exactly what we do here within coffea (this is in awkward-1.0, so the analogues of what would be done in dask):
https://github.com/CoffeaTeam/coffea/blob/master/coffea/nanoevents/factory.py#L382-L397

Much, actually nearly all, of the functionality we have in coffea (as presently awkward-1.0) for mapping to files and abstracting data to arrays suits dask_awkward much better.

@douglasdavis
Copy link
Collaborator

Great, thanks for the explanation. I think we have one piece of the puzzle with: https://github.com/ContinuumIO/dask-awkward/blob/76860c2343058ca5693ce5756d026ce001cc6122/src/dask_awkward/lib/core.py#L1700-L1708

using it will ensure whatever new collection that gets created will have the proper _meta attr

@lgray
Copy link
Collaborator Author

lgray commented Nov 21, 2022

There we go - thank you for translating concepts for me! This is quite a bit of the missing piece.

@jpivarski
Copy link
Collaborator

I hadn't been thinking of from_buffers as a function that would be "dak-able," but I can see how it occupies a central place. It is the most flexible way to make new arrays, and some things would have been implemented using it if they hadn't come first.

I suppose that a dak.from_buffers would be possible, taking dask.array values in the dict, assuming that they're partitioned in a compatible way. But it doesn't sound to me like that's the direction you're headed with this.

@lgray
Copy link
Collaborator Author

lgray commented Nov 21, 2022

Ah - small misreading. We don't use to_buffers at all but instead we build up the form and its various keys from the TTree metdata and then from_buffers that.

@lgray
Copy link
Collaborator Author

lgray commented Nov 21, 2022

I suppose one question is if you want to make typetracer_from_form a public interface, it's not exported at present so I'm not sure how much I want to build off of it. I can certainly prototype something up, will be ready to change a lot if necessary.

@lgray
Copy link
Collaborator Author

lgray commented Nov 21, 2022

@jpivarski Not sure what the direction we are heading is very precisely yet. At least now I can get a type tracer built for a specialized NanoEvents schema and see what's missing from there. I suspect a more concrete picture will appear fairly soon.

@jpivarski
Copy link
Collaborator

We have talked about that, and the technique (but not a public API for it) has turned up in Uproot as well:

https://github.com/scikit-hep/uproot5/blob/b36a0229391c00e101fdbfd56925afa9e2bd2ebb/src/uproot/_dask.py#L610-L613

For now, the fact that this is "One Weird Trick" and not a specialized function in Awkward is less than ideal, but not a high priority because it's easy to implement and won't change.

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 21, 2022

I suppose one question is if you want to make typetracer_from_form a public interface, it's not exported at present so I'm not sure how much I want to build off of it. I can certainly prototype something up, will be ready to change a lot if necessary.

I think dak.typetracer_from_form is a good addition; added with ContinuumIO/dask-awkward@ac41e4c

@lgray
Copy link
Collaborator Author

lgray commented Nov 23, 2022

After all back and forth in this thread we are getting very close!

>>> events = dak.typetracer_from_form(z[1]._form)
>>> events.Photon.pt
<Array-typetracer type='?? * var * float32'>
>>> events.Photon
<Array-typetracer type='?? * var * Photon[eCorr: float32, energyErr: float3...'>
>>> events
<Array-typetracer type='?? * NanoEvents[ChsMET: MissingET[phi: float32[para...'>
>>> events.fields
['ChsMET', 'LHEWeight', 'genWeight', 'L1simulation', 'SoftActivityJetNjets5', 'L1', 'Muon', 'GenVisTau', 'SoftActivityJetHT10', 'LHEScaleWeight', 'PSWeight', 'IsoTrack', 'PuppiMET', 'Photon', 'GenPart', 'FatJet', 'fixedGridRhoFastjetCentralCalo', 'event', 'SV', 'MET', 'LHEPart', 'SoftActivityJetHT2', 'L1Reco', 'genTtbarId', 'LHE', 'SoftActivityJetNjets2', 'GenDressedLepton', 'CaloMET', 'LHEReweightingWeight', 'Generator', 'Pileup', 'Jet', 'fixedGridRhoFastjetCentralChargedPileUp', 'luminosityBlock', 'HLTriggerFirstPath', 'SubGenJetAK8', 'TkMET', 'CorrT1METJet', 'HTXS', 'Flag', 'SoftActivityJetHT', 'Electron', 'fixedGridRhoFastjetAll', 'SoftActivityJet', 'TrigObj', 'run', 'HLTriggerFinalPath', 'OtherPV', 'PV', 'SoftActivityJetHT5', 'fixedGridRhoFastjetCentral', 'SubJet', 'SoftActivityJetNjets10', 'FsrPhoton', 'GenMET', 'GenJetAK8', 'HLT', 'fixedGridRhoFastjetCentralNeutral', 'GenJet', 'LHEPdfWeight', 'RawMET', 'btagWeight', 'Tau']

@lgray
Copy link
Collaborator Author

lgray commented Nov 23, 2022

My next question is then: how do I take the type tracer and turn it into something that I can do:
events.Photon.pt.compute()
on? (sorry if I missed it!)

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 23, 2022

You'll want a collection that is represented by your typetracer. However you overwrite the form of a real awkward array (not a typetracer) you'll need to apply to the collection as a step in a task graph (most easily done with a map_partitions call).

I'm guessing you get z from somewhere in this workflow and data from somewhere in this workflow:

def overwrite_form(concrete_ak_array, form):
    # do something that changes the form of some _real_ data (i.e. ak.from_buffers(form=...,))
    # overwritten = ak.from_buffers(form, ..., some_use_of_concrete_ak_array)
    return overwritten

data = uproot.dask(...)
z = something()
desired_form = z[1]._form
new_typetracer = dak.typetracer_from_form(desired_form)
data = data.map_partitions(overwrite_form, desired_form, meta=new_typetracer)

If data has the same fields as your events.fields you should now be able to do

data.Photon.pt.compute()

We can add a function to the high level dak API that updates the form of a collection (something like dak.update_form) essentially does that code block above but is as simple as:

data = dak.update_form(data, form)

@lgray
Copy link
Collaborator Author

lgray commented Nov 23, 2022

OK - this perhaps exposes a bit of a problem.
What I want to be able to do is use dask's lazy evaluation to read only the data that is asked for by the end-user analysis.
Since NanoEvents describes the entire transformed data frame it would cause the reading of all the data when we ask for ak.from_buffers, which is definitely not what we want.

What could be done possibly, strangely enough, is to use the original data = uproot.dask as the data mapping and use the form keys to extract what is expected (and call compute, when/if needed I guess?).

Or is there a pleasant way to extract the data that is actually desired by the user?

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 23, 2022

Hmm. So we have a column projection optimization that will ensure only necessary columns are read from a dataset; uproot.dask already benefits from this.

For example say file.root has tree with 100 branches: column{001-100}. If I use uproot.dask without specifying the branches I want, we will automatically read only ["column001", "column100"] with this workflow (the optimization happens during compute):

data = uproot.dask({"file.root": "tree"})
thing = data.column001 + data.column100
thing.compute()

The fact that you remap events_Photon_pt to events.Photon.pt may be a problem for what is currently implemented, I'm not sure.

With the "brute force" optimization method (takes it bit more time than the version that just searches for getitem calls). We execute the dask graph on only the metadata and check to see if the computation fails due to a missing field. If the computation fails, we know that the given field that we've held out is necessary. In theory if one of the steps in the task graph is just modifying the form then the brute force method may work out of the box. If we manually overwrite the metadata at some point in the task graph we may run into trouble. But I'm not sure. We'll need a small example workflow to play with that goes through the process of

  1. make an awkward array collection with uproot.dask
  2. do some kind of updating of the form
  3. do a computation dependent on the new form
  4. check if successful computation while logging what was read by uproot (to make sure we read only necessary branches)

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 23, 2022

Ah but the from_buffers call in the graph may end up screwing the optimization and making everything seem necessary. I'll keep thinking about this, but a small workflow would still be useful

@lgray
Copy link
Collaborator Author

lgray commented Nov 28, 2022

Coming back to this - are there specific reasons you cannot use the "form_key" property of the form to separate what you're calling the data members in code from how they are located on disk?

This is how it was done with the awkward-1.0 and I think the pattern still applies just fine?

@jpivarski commentary?

@jpivarski
Copy link
Collaborator

The "form_key" mechanism hasn't changed. (The only v1 → v2 change in that area is that the arguments to from_buffers for constructing buffer names from form keys has been simplified. In particular, there's no "part#-" substring for partitions, since there are no partitions anymore.)

I've been reading these comments, and I don't see how the form key could be used to solve this problem. If you did have a solution that leveraged form keys, it should still work, but I don't know how that would have been done.

@lgray
Copy link
Collaborator Author

lgray commented Nov 28, 2022

Oh - my thinking was that the form key could encode/describe the data product you want to read from the original source file regardless of how it is nested in the dask_awkward object that's being made on top of it. Essentially being the connection to the raw data for dask's io layer to deal with.

@jpivarski
Copy link
Collaborator

In order to do this column-removal optimization, (I think) Dask needs to know the difference between a source node and other nodes in its graph. But also, I asked about this at one of our meetings, and I remember @douglasdavis showing that the information exists and is being used. The Form's keys wouldn't be a place where that could go, since everything in the Form is source.

Is uproot.dask adding the metadata that distinguishes source nodes from other nodes, and @lgray needs to be told how to add that metadata? Is that all that's needed here?

@lgray
Copy link
Collaborator Author

lgray commented Nov 28, 2022

In some sense the source nodes are already defined since I'm starting from uproot.dask and applying a new form off of that. Purely metadata changes. So I suppose all I am missing is the way to "forward" what uproot.dask has into however I've remapped it.

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 29, 2022

uproot.dask indeed takes advantage of what I think you are calling the source node vs regular node difference. "Source nodes" (i.e. nodes that do the data loading) are generated from AwkwardIOLayers. The use of dask_awkward.from_map is where this happens (from_map will create an AwkwardIOLayer); the optimization searches for layers of this type and will call the project_columns method on the layers' io_func (the actual callable that does the loading).

Here it is in detail:

uproot is using from_map and passes a function class called _UprootRead to be the callable that does the loading:

https://github.com/scikit-hep/uproot5/blob/3bb79bcecb3c0df2d4787d28258e3054614ea374/src/uproot/_dask.py#L770-L775

On the dask side we create an AwkwardIOLayer which will create a graph that has nodes that will call _UprootRead.__call__.

That function class has a project_columns method that remakes the class with a new attribute that controls which branches are read:

https://github.com/scikit-hep/uproot5/blob/3bb79bcecb3c0df2d4787d28258e3054614ea374/src/uproot/_dask.py#L583-L599

On the dask side, we figure out which branches are necessary and then recreate the layer with a new instance of _UprootRead that now reads only the necessary branches.

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 29, 2022

The way we determine the necessary branches is test if the task graph executes successfully when operating only on metadata. So in theory, if whatever operations you do without dask can be done to the metadata, we should still be able take advantage of the optimization. This is truly the great power of typetracer arrays!

Here is where we call project_columns such that we instantiate a version of the callable where only the necessary columns will be used:

new_layer = v.project_columns(necessary_cols)

This causes the layer to recreate itself with a new function class instance that has redefined the columns. We need to document this but really any function class that defines a project_columns method is compatible with the column projection optimization (and _UprootRead indeed implements a project_columns method: scikit-hep/uproot5#755).

@lgray
Copy link
Collaborator Author

lgray commented Nov 29, 2022

Thanks for going over this! What would be your proposed manner of implementing this with nanoevents? I could certainly implement something that does what is necessary but I would prefer to make sure I'm not reinventing wheels or otherwise choosing poor factorization.

My first guess for this is to do something like:

  • Have a class _NanoEventsRemap that creates the correspondence between the altered schema calls so I can just use map_partitions on it and therefore I can do (from above):
data = uproot.dask(<something>)
desired_form, new_typetracer = apply_schema(desired_form_maker, data.layout.form)
remap = _NanoEventsRemap(desired_form, <some args>)
data = data.map_partitions(remap, desired_form, meta=new_typetracer)

Where _NanoEventsRemap essentially stores the mapping from events.Photon.pt in the NanoEvents form to Photon_pt in the uproot.dask raw interpretation. So the internals of it would have to examine the type tracer to see what to actually call from?

I'm not entirely sure exactly how to do that, actually, since I don't quite see what dataformat the _NanoEventsRemap class remap instance would receive into __call__. I can certainly always just go find out but beter to ask first. There would need to be some functionality of generating new calls in the task graph on the fly (there are synthetic or otherwise augmented columns in NanoEvents for convenience).

@lgray
Copy link
Collaborator Author

lgray commented Nov 30, 2022

If I should rather be just building my own mapping and circumventing uproot then we should say that now. I was very much under the impression I'd be able to put most of the handling of what's in the root file on the uproot/dask_awkward side of the fence.

@douglasdavis
Copy link
Collaborator

douglasdavis commented Nov 30, 2022 via email

@lgray
Copy link
Collaborator Author

lgray commented Nov 30, 2022

So:
https://github.com/CoffeaTeam/coffea/blob/awkward2_dev/coffea/nanoevents/mapping/uproot.py

is the eager-mode implementation of this using awkward2. It probably covers a bit too much in scope, but the main things to know are that:

  • it constructs the desired form and applies it to data
  • uses the form_key in the awkward array to specify where to get the data from and possibly any transformations to that data (with a rather small embedded DSL)

If you check out that branch of the coffea repo, do cd coffea && pip install -e '.[dev]' you can then run:
pytest tests/test_nanoevents.py (https://github.com/CoffeaTeam/coffea/blob/awkward2_dev/tests/test_nanoevents.py)

To have something to poke at. If you'd rather me extract it I can do that, but I figured the full implementation in-situ might be illuminating for you!

@lgray
Copy link
Collaborator Author

lgray commented Feb 22, 2023

This is largely taken care of now.

@lgray lgray closed this as completed Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants