[BUG] drop_duplicates does not drop all duplicates #1959

goodwanghan · 2020-08-25T23:45:34Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
Modin version (modin.__version__): 0.8.0
Python version: 3.6.9
Code we can use to reproduce:

import modin.pandas as pd
import pandas

pdf=pandas.DataFrame(
    [[5, 'ssssss0'], [0, 'ssssss4'], [3, 'ssssss5'], [3, 'ssssss5'], [7, 'ssssss6'], 
     [9, 'ssssss8'], [3, 'ssssss4'], [5, 'ssssss1'], [2, 'ssssss4'], [4, 'ssssss9'], 
     [7, 'ssssss8'], [6, 'ssssss1'], [8, 'ssssss1'], [8, 'ssssss7'], [1, 'ssssss9'], 
     [6, 'ssssss9'], [7, 'ssssss3'], [7, 'ssssss6'], [8, 'ssssss7'], [1, 'ssssss2'], 
     [5, 'ssssss0'], [9, 'ssssss3'], [8, 'ssssss5'], [9, 'ssssss9'], [4, 'ssssss4'], 
     [3, 'ssssss4'], [0, 'ssssss6'], [3, 'ssssss4'], [5, 'ssssss4'], [0, 'ssssss3'], 
     [2, 'ssssss4'], [3, 'ssssss4'], [8, 'ssssss8'], [1, 'ssssss4'], [3, 'ssssss3'], 
     [3, 'ssssss7'], [3, 'ssssss5'], [7, 'ssssss5'], [0, 'ssssss0'], [1, 'ssssss1'], 
     [9, 'ssssss5'], [9, 'ssssss9'], [0, 'ssssss3'], [4, 'ssssss0'], [7, 'ssssss5'], 
     [3, 'ssssss0'], [2, 'ssssss1'], [7, 'ssssss2'], [2, 'ssssss4'], [0, 'ssssss2']], columns=['a', 'b'])

mdf = pd.DataFrame(pdf)

print(pdf.drop_duplicates().shape)
print(mdf.drop_duplicates().shape)

Describe the problem

Modin does not drop all duplicates (with ray backend)

Source code / logs

(37, 2)
(41, 2)

The text was updated successfully, but these errors were encountered:

goodwanghan · 2020-08-26T05:47:31Z

And here is another issue of drop_duplicates, it sometimes drops rows with all nulls, but pandas doesn't, so the behavior is not consistent

import modin.pandas as pd
import pandas

nan=float("nan")
pdf=pandas.DataFrame(
    [[5.0, 4.0], [0.0, nan], [nan, 8.0], [3.0, 1.0], [7.0, 1.0], 
     [nan, 7.0], [nan, nan], [5.0, nan], [nan, nan], [4.0, 6.0], 
     [7.0, nan], [nan, nan], [nan, nan], [nan, nan], [1.0, nan], 
     [nan, nan], [nan, 4.0], [7.0, 4.0], [8.0, 6.0], [1.0, nan], 
     [5.0, 4.0], [nan, nan], [nan, nan], [nan, nan], [4.0, 8.0], 
     [3.0, 4.0], [nan, nan], [3.0, 7.0], [nan, 5.0], [nan, 5.0]], columns=['a', 'b'])

mdf = pd.DataFrame(pdf)

print(pdf.drop_duplicates().sort_values(["a","b"]))
print(mdf.drop_duplicates().sort_values(["a","b"])) # this does not have (nan,nan) row

The drop null behavior itself is not consistent inside modin drop_duplicates

the following works fine:

import modin.pandas as pd
import pandas

nan=float("nan")
pdf=pandas.DataFrame([[5.0, 4.0], [nan, nan]], columns=['a', 'b'])

mdf = pd.DataFrame(pdf)

print(pdf.drop_duplicates().sort_values(["a","b"]))
print(mdf.drop_duplicates().sort_values(["a","b"]))

devin-petersohn · 2020-08-27T15:26:02Z

Thanks @goodwanghan for the report! I can reproduce the mismatch.

This is implemented with a deterministic hashing function. In this case, for some reason the hash function is not producing identical hashes on some of the common rows, this will be good to investigate.

prutskov · 2020-08-31T14:35:23Z

Root cause is in #1987. Looks like functions work in each partition separately without any reduction.

devin-petersohn · 2020-09-01T12:38:25Z

There should a reduction because it is happening with a call to apply, which handles whole axis together.

modin/modin/pandas/dataframe.py

Lines 288 to 292 in f6b6040

    
           if len(df.columns) > 1: 
        
               hashed = df.apply(lambda s: hash(tuple(s)), axis=1).to_frame() 
        
           else: 
        
               hashed = df 
        
           duplicates = hashed.apply(lambda s: s.duplicated(keep=keep)).squeeze(axis=1)

prutskov · 2020-09-01T12:50:09Z

@devin-petersohn All works when dtype of columns is numeric or boolean (our tests check this), but if we have columns with different dtypes (numeric and string, for example) we will have problem which was described here (our tests don't check this). I will fix this in #1994.

…licated` and `drop_duplicates` functions Signed-off-by: Alexey Prutskov <[email protected]>

) Signed-off-by: Alexey Prutskov <[email protected]>

…_duplicates` functions (modin-project#1994) Signed-off-by: Alexey Prutskov <[email protected]>

goodwanghan added the bug 🦗 Something isn't working label Aug 25, 2020

devin-petersohn added this to the 0.8.2 milestone Aug 27, 2020

prutskov self-assigned this Aug 28, 2020

prutskov mentioned this issue Sep 1, 2020

FIX-#1959 #1987: Fix incorrect work of duplicated and drop_duplicates functions #1994

Merged

6 tasks

prutskov added a commit to prutskov/modin that referenced this issue Sep 7, 2020

FIX-modin-project#1959 modin-project#1987: Fix incorrect work of `dup…

f52578c

…licated` and `drop_duplicates` functions Signed-off-by: Alexey Prutskov <[email protected]>

anmyachev closed this as completed in #1994 Sep 7, 2020

anmyachev pushed a commit that referenced this issue Sep 7, 2020

FIX-#1959 #1987: Fix duplicated and drop_duplicates functions (#1994

8265b71

) Signed-off-by: Alexey Prutskov <[email protected]>

aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020

FIX-modin-project#1959 modin-project#1987: Fix duplicated and `drop…

c49b9bc

…_duplicates` functions (modin-project#1994) Signed-off-by: Alexey Prutskov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] drop_duplicates does not drop all duplicates #1959

[BUG] drop_duplicates does not drop all duplicates #1959

goodwanghan commented Aug 25, 2020

goodwanghan commented Aug 26, 2020

devin-petersohn commented Aug 27, 2020

prutskov commented Aug 31, 2020

devin-petersohn commented Sep 1, 2020

prutskov commented Sep 1, 2020

[BUG] drop_duplicates does not drop all duplicates #1959

[BUG] drop_duplicates does not drop all duplicates #1959

Comments

goodwanghan commented Aug 25, 2020

System information

Describe the problem

Source code / logs

goodwanghan commented Aug 26, 2020

devin-petersohn commented Aug 27, 2020

prutskov commented Aug 31, 2020

devin-petersohn commented Sep 1, 2020

prutskov commented Sep 1, 2020