Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Incorrect output when using groupby().transform() in eval #5511

Closed
3 tasks done
susmitpy opened this issue Jan 2, 2023 · 2 comments
Closed
3 tasks done

BUG: Incorrect output when using groupby().transform() in eval #5511

susmitpy opened this issue Jan 2, 2023 · 2 comments
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P1 Important tasks that we should complete soon

Comments

@susmitpy
Copy link
Contributor

susmitpy commented Jan 2, 2023

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as mpd
df = mpd.DataFrame()
df["num"] = range(1,1001)
df["group"] = ["A"]*500 + ["B"]*500
print(df.groupby("group")["num"].transform('min').unique()) # correct [1,501]
print(df.eval("num.groupby(group).transform('min')").unique()) # incorrect [1,251,501,751]

Issue Description

Whenever groupby and transform is used within df.eval(), it looks like the aggregation is being performed on individual partitions and hence the final result is not proper (my guess).

In the example, since there are only two groups, the count of unique minimum values in the result should be only 2.
This is correctly demonstrated by [1, 501] when the operation is performed normally.
However when the same operation is performed and the expression is passed as a string, the result is wrong.

Expected Behavior

It should work in the same way it is working when it is performed normally.
The aggregated value, minimum value for each group should be only one per group. i.e. 1 for group A and 501 for group B

Error Logs

No response

Installed Versions

Checked on two different versions
Check 1

Modin dependencies

modin : 0.18.0
ray : 2.2.0

pandas dependencies

pandas : 1.5.2
numpy : 1.22.2

Check 2

Modin dependencies

modin : 0.15.3
ray : 1.9.0

pandas dependencies

pandas : 1.4.4
numpy : 1.21.2

@susmitpy susmitpy added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Jan 2, 2023
@RehanSD
Copy link
Collaborator

RehanSD commented Jan 3, 2023

Hi @susmitpy! Thank you so much for opening this issue! I've verified that I can reproduce it locally, as well as confirmed, that the offending lines seem to be these:

new_modin_frame = self._modin_frame.apply_full_axis(
1,
lambda df: pandas.DataFrame(df.eval(expr, inplace=False, **kwargs)),
new_index=self.index,
new_columns=new_columns,
)

where the eval is applied full-axis across the column axis (in order to make sure we don't get KeyError's on the column names int he eval expression), but not full-axis across the row axis. I'm not 100% sure what the best solution is here - we could try parsing the expression and seeing if it requires full-column-axis or can be satisfied with being row-axis and broadcasting columns as necessary and always perform eval full-row-axis, but I'm not sure that that's the best solution. Would love to hear your thoughts as well!

Tagging @mvashishtha and @vnlitvinov to get their opinions as well!

@RehanSD RehanSD added P1 Important tasks that we should complete soon and removed Triage 🩹 Issues that need triage labels Jan 3, 2023
@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
@anmyachev
Copy link
Collaborator

print(df.eval("num.groupby(group).transform('min')").unique()) returns correct [1,501] on 0e42667 (current master)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P1 Important tasks that we should complete soon
Projects
None yet
Development

No branches or pull requests

3 participants