Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] left join style merge reorders resultant dataframe #2259

Closed
beckernick opened this issue Mar 1, 2019 · 1 comment
Closed

[BUG] left join style merge reorders resultant dataframe #2259

beckernick opened this issue Mar 1, 2019 · 1 comment
Labels
dask Dask issue

Comments

@beckernick
Copy link
Member

Description

As a user, I expect the resultant dataframe from a left join to preserve the row ordering from my left dataframe. This reordering occurs when the left and right dataframes have all keys in common and when the right dataframe only has a subset of the left's keys.

Example

import cudf
import numpy as np
import pandas as pd
import dask_cudf
import dask.dataframe as dd

## Subset of keys
df_a = cudf.DataFrame()
df_a['key'] = [0, 1, 2, 3, 4]
df_a['vals_a'] = [float(i + 10) for i in range(5)]

df_b = cudf.DataFrame()
df_b['key'] = [1, 2, 4]
df_b['vals_b'] = [float(i+100) for i in range(3)]

ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)
ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)

merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()
print(merged)
   key  vals_a  vals_b
0    1    11.0   100.0
1    2    12.0   101.0
2    4    14.0   102.0
3    0    10.0        
0    3    13.0   


## All keys in common
df_a = cudf.DataFrame()
df_a['key'] = [0, 1, 2, 3, 4]
df_a['vals_a'] = [float(i + 10) for i in range(5)]

df_b = cudf.DataFrame()
df_b['key'] = [0, 1, 2, 3, 4]
df_b['vals_b'] = [float(i+100) for i in range(5)]

ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)
ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)

merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()
print(merged)
   key  vals_a  vals_b
0    0    10.0   100.0
1    1    11.0   101.0
2    2    12.0   102.0
3    4    14.0   104.0
0    3    13.0   103.0
@mike-wendt mike-wendt transferred this issue from rapidsai/dask-cudf Jul 15, 2019
@kkraus14
Copy link
Collaborator

Duplicate of #1781

@kkraus14 kkraus14 marked this as a duplicate of #1781 Jul 15, 2019
@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue
Projects
None yet
Development

No branches or pull requests

4 participants