Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Groupby aggregation (min/max/count) hangs when the group column has NaN #49276

Closed
wingkitlee0 opened this issue Dec 15, 2024 · 2 comments · Fixed by #49420
Closed

[Data] Groupby aggregation (min/max/count) hangs when the group column has NaN #49276

wingkitlee0 opened this issue Dec 15, 2024 · 2 comments · Fixed by #49420
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@wingkitlee0
Copy link
Contributor

What happened + What you expected to happen

Groupby min/max/count etc will hang (with increasing number of tasks) when the group column contains np.nan. For example,

ds = ray.data.from_items([1.0, 1.0, 2.0, np.nan])
ds.groupby("item").count().take_all()

However, map_groups would work:

ds.groupby("item").map_groups(lambda x: {"count": [len(x["item"])]}).take_all()

String column with None also work.

On Pandas, depending on the dropna=True/False, the np.nan will be treated as a separate group.

Versions / Dependencies

ray 2.40.0
numpy 2.1.3

Reproduction script

ds = ray.data.from_items([1.0, 1.0, 2.0, np.nan])
ds.groupby("item").count().take_all()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@wingkitlee0 wingkitlee0 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 15, 2024
@jcotant1 jcotant1 added the data Ray Data-related issues label Dec 16, 2024
@richardliaw richardliaw added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 17, 2024
@richardliaw
Copy link
Contributor

Great catch! Let me know if you can find the root cause. It's probably in our BlockAccessor or transform functions code.

@richardliaw
Copy link
Contributor

Otherwise we may try to get to this in the next couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants