-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] drop_duplicates does not drop all duplicates #1959
Comments
And here is another issue of drop_duplicates, it sometimes drops rows with all nulls, but pandas doesn't, so the behavior is not consistent import modin.pandas as pd
import pandas
nan=float("nan")
pdf=pandas.DataFrame(
[[5.0, 4.0], [0.0, nan], [nan, 8.0], [3.0, 1.0], [7.0, 1.0],
[nan, 7.0], [nan, nan], [5.0, nan], [nan, nan], [4.0, 6.0],
[7.0, nan], [nan, nan], [nan, nan], [nan, nan], [1.0, nan],
[nan, nan], [nan, 4.0], [7.0, 4.0], [8.0, 6.0], [1.0, nan],
[5.0, 4.0], [nan, nan], [nan, nan], [nan, nan], [4.0, 8.0],
[3.0, 4.0], [nan, nan], [3.0, 7.0], [nan, 5.0], [nan, 5.0]], columns=['a', 'b'])
mdf = pd.DataFrame(pdf)
print(pdf.drop_duplicates().sort_values(["a","b"]))
print(mdf.drop_duplicates().sort_values(["a","b"])) # this does not have (nan,nan) row The drop null behavior itself is not consistent inside modin drop_duplicates the following works fine: import modin.pandas as pd
import pandas
nan=float("nan")
pdf=pandas.DataFrame([[5.0, 4.0], [nan, nan]], columns=['a', 'b'])
mdf = pd.DataFrame(pdf)
print(pdf.drop_duplicates().sort_values(["a","b"]))
print(mdf.drop_duplicates().sort_values(["a","b"])) |
Thanks @goodwanghan for the report! I can reproduce the mismatch. This is implemented with a deterministic hashing function. In this case, for some reason the hash function is not producing identical hashes on some of the common rows, this will be good to investigate. |
Root cause is in #1987. Looks like functions work in each partition separately without any reduction. |
There should a reduction because it is happening with a call to modin/modin/pandas/dataframe.py Lines 288 to 292 in f6b6040
|
@devin-petersohn All works when |
…licated` and `drop_duplicates` functions Signed-off-by: Alexey Prutskov <[email protected]>
) Signed-off-by: Alexey Prutskov <[email protected]>
…_duplicates` functions (modin-project#1994) Signed-off-by: Alexey Prutskov <[email protected]>
System information
modin.__version__
): 0.8.0Describe the problem
Modin does not drop all duplicates (with ray backend)
Source code / logs
The text was updated successfully, but these errors were encountered: