Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] drop_duplicates does not drop all duplicates #1959

Closed
goodwanghan opened this issue Aug 25, 2020 · 5 comments · Fixed by #1994
Closed

[BUG] drop_duplicates does not drop all duplicates #1959

goodwanghan opened this issue Aug 25, 2020 · 5 comments · Fixed by #1994
Assignees
Labels
bug 🦗 Something isn't working
Milestone

Comments

@goodwanghan
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
  • Modin version (modin.__version__): 0.8.0
  • Python version: 3.6.9
  • Code we can use to reproduce:
import modin.pandas as pd
import pandas

pdf=pandas.DataFrame(
    [[5, 'ssssss0'], [0, 'ssssss4'], [3, 'ssssss5'], [3, 'ssssss5'], [7, 'ssssss6'], 
     [9, 'ssssss8'], [3, 'ssssss4'], [5, 'ssssss1'], [2, 'ssssss4'], [4, 'ssssss9'], 
     [7, 'ssssss8'], [6, 'ssssss1'], [8, 'ssssss1'], [8, 'ssssss7'], [1, 'ssssss9'], 
     [6, 'ssssss9'], [7, 'ssssss3'], [7, 'ssssss6'], [8, 'ssssss7'], [1, 'ssssss2'], 
     [5, 'ssssss0'], [9, 'ssssss3'], [8, 'ssssss5'], [9, 'ssssss9'], [4, 'ssssss4'], 
     [3, 'ssssss4'], [0, 'ssssss6'], [3, 'ssssss4'], [5, 'ssssss4'], [0, 'ssssss3'], 
     [2, 'ssssss4'], [3, 'ssssss4'], [8, 'ssssss8'], [1, 'ssssss4'], [3, 'ssssss3'], 
     [3, 'ssssss7'], [3, 'ssssss5'], [7, 'ssssss5'], [0, 'ssssss0'], [1, 'ssssss1'], 
     [9, 'ssssss5'], [9, 'ssssss9'], [0, 'ssssss3'], [4, 'ssssss0'], [7, 'ssssss5'], 
     [3, 'ssssss0'], [2, 'ssssss1'], [7, 'ssssss2'], [2, 'ssssss4'], [0, 'ssssss2']], columns=['a', 'b'])

mdf = pd.DataFrame(pdf)

print(pdf.drop_duplicates().shape)
print(mdf.drop_duplicates().shape)

Describe the problem

Modin does not drop all duplicates (with ray backend)

Source code / logs

(37, 2)
(41, 2)
@goodwanghan goodwanghan added the bug 🦗 Something isn't working label Aug 25, 2020
@goodwanghan
Copy link
Author

And here is another issue of drop_duplicates, it sometimes drops rows with all nulls, but pandas doesn't, so the behavior is not consistent

import modin.pandas as pd
import pandas

nan=float("nan")
pdf=pandas.DataFrame(
    [[5.0, 4.0], [0.0, nan], [nan, 8.0], [3.0, 1.0], [7.0, 1.0], 
     [nan, 7.0], [nan, nan], [5.0, nan], [nan, nan], [4.0, 6.0], 
     [7.0, nan], [nan, nan], [nan, nan], [nan, nan], [1.0, nan], 
     [nan, nan], [nan, 4.0], [7.0, 4.0], [8.0, 6.0], [1.0, nan], 
     [5.0, 4.0], [nan, nan], [nan, nan], [nan, nan], [4.0, 8.0], 
     [3.0, 4.0], [nan, nan], [3.0, 7.0], [nan, 5.0], [nan, 5.0]], columns=['a', 'b'])

mdf = pd.DataFrame(pdf)

print(pdf.drop_duplicates().sort_values(["a","b"]))
print(mdf.drop_duplicates().sort_values(["a","b"])) # this does not have (nan,nan) row

The drop null behavior itself is not consistent inside modin drop_duplicates

the following works fine:

import modin.pandas as pd
import pandas

nan=float("nan")
pdf=pandas.DataFrame([[5.0, 4.0], [nan, nan]], columns=['a', 'b'])

mdf = pd.DataFrame(pdf)

print(pdf.drop_duplicates().sort_values(["a","b"]))
print(mdf.drop_duplicates().sort_values(["a","b"]))

@devin-petersohn devin-petersohn added this to the 0.8.2 milestone Aug 27, 2020
@devin-petersohn
Copy link
Collaborator

Thanks @goodwanghan for the report! I can reproduce the mismatch.

This is implemented with a deterministic hashing function. In this case, for some reason the hash function is not producing identical hashes on some of the common rows, this will be good to investigate.

@prutskov prutskov self-assigned this Aug 28, 2020
@prutskov
Copy link
Contributor

Root cause is in #1987. Looks like functions work in each partition separately without any reduction.

@devin-petersohn
Copy link
Collaborator

There should a reduction because it is happening with a call to apply, which handles whole axis together.

if len(df.columns) > 1:
hashed = df.apply(lambda s: hash(tuple(s)), axis=1).to_frame()
else:
hashed = df
duplicates = hashed.apply(lambda s: s.duplicated(keep=keep)).squeeze(axis=1)

@prutskov
Copy link
Contributor

prutskov commented Sep 1, 2020

@devin-petersohn All works when dtype of columns is numeric or boolean (our tests check this), but if we have columns with different dtypes (numeric and string, for example) we will have problem which was described here (our tests don't check this). I will fix this in #1994.

prutskov added a commit to prutskov/modin that referenced this issue Sep 7, 2020
…licated` and

`drop_duplicates` functions

Signed-off-by: Alexey Prutskov <[email protected]>
anmyachev pushed a commit that referenced this issue Sep 7, 2020
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants