Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: structure removal order for unions #3397

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

pfackeldey
Copy link
Collaborator

Fixes: #3180

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - Great! Thanks for providing the fix so quickly! Indeed, iterating over the unique index fixes the issue.

@pfackeldey
Copy link
Collaborator Author

@pfackeldey - Great! Thanks for providing the fix so quickly! Indeed, iterating over the unique index fixes the issue.

this PR does however make the assumption that the number of unique tags corresponds to the len(self._contents), which I think should be always true. Could you confirm this is true? If yes, then I'm confident that this is the right fix 👍

@agoose77
Copy link
Collaborator

@pfackeldey just reading the comments and not the diff -- len(contents) may not equal unique(tags) if a content is not indexed. I think we have code-paths that can break this invariant, e.g.

import awkward as ak
import numpy as np
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
assert np.unique(y.layout.tags).size == len(y.layout.contents)

@pfackeldey
Copy link
Collaborator Author

@pfackeldey just reading the comments and not the diff -- len(contents) may not equal unique(tags) if a content is not indexed. I think we have code-paths that can break this invariant, e.g.

import awkward as ak
import numpy as np
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
assert np.unique(y.layout.tags).size == len(y.layout.contents)

Ah I see. In this case this implementation is still working correct though:

# with this PR
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
ak.ravel(y)
# <Array [1, 2] type='2 * int64'>

on main this actually fails, which I would consider wrong aswell (so this PR might fix another bug aswell?):

x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
ak.ravel(y)
# ... > AssertionError: cannot merge NumpyArray with ListOffsetArray

The loop over the unique ordered tags will pick (in order) the contents where tags point to. This is always less or equal to the length of contents. Do you see any problem with that @agoose77 ?

@agoose77
Copy link
Collaborator

agoose77 commented Feb 10, 2025

@pfackeldey the original code has two bugs (which are the same kind of bug)! This PR fixes one, but not the other I think.

The intention of remove_structure is to allow the caller to obtain a collection of in-order zero-copy (where possible) 1D buffers such that (approximately) A_ij -> B_k (in Einstein notation). That's only partially possible here. I think of unions a bit like a flattened matrix, i.e. B_k where B_k1 and B_k2 may come from different contents. As such, a contiguous array of B_k will require copying each union item into a contiguous buffer.

For trivial-unions (e.g. a union of integer and bool), the proper way to flatten the layout in an order-preserving way is to write a kernel that fills a new buffer with result = [contents[tags[i]][i] for i in self.length]. But, for non-trivial unions that's a lot harder to figure out. For each content contents[j], there may be N associated values that end up in the structureless result.

Moreover, as we are dealing with a union, to remove the structure probably means we want to simplify a union over the results such that the union disappears and the result is simplified to a common type.

So, maybe the proper way to do this is to build a new union via .simplified over [content[i]._carry(...)._remove_structure() ...] and throw an error if the union doesn't vanish.

But, I also feel compelled to say -- _remove_structure does a few different things. It might be that there are some contexts where we don't care about the order invariant in the final result (e.g. non-positional reducers). In that case, perhaps we want to accept a flag that lets us take a fast path involving less carrying. Although, even then we would need to ensure that the type of the result is correct (i.e. figure out the result type and cast each content[i]._carry(...)._remove_structure() ...] to that type).

@ianna ianna force-pushed the pfackeldey/fix_ravel_order_uniontags branch from ff36fdf to d5b2d9c Compare March 3, 2025 14:51
@ianna ianna marked this pull request as ready for review March 3, 2025 14:58
@ianna ianna requested a review from agoose77 March 3, 2025 14:59
@ianna
Copy link
Collaborator

ianna commented Mar 3, 2025

@pfackeldey and @agoose77 - please, check when you have time. I'm sure, it's not the most efficient implementation - I just want to make sure it is correct and I'm not missing anything. Thanks!

…kit-hep/awkward into pfackeldey/fix_ravel_order_uniontags
Copy link
Collaborator Author

@pfackeldey pfackeldey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right direction 👍, but the code won't be compatible with typetracers as is right now. I'm not sure though how to fix this properly - I'll give this some thoughts...

We should make sure that the following works (this is on main):

import awkward as ak

a = ak.Array([[1, 2, 3], 4, [5, 6], 7, 8], backend="typetracer")
ak.ravel(a)
# <Array-typetracer [...] type='## * int64'>

@ianna
Copy link
Collaborator

ianna commented Mar 4, 2025

@pfackeldey - #3409 should fix the failing tests. It will be auto-merged soon. This PR will need to be re-based.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fill_none followed by ravel reorders arrays
3 participants