Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Merging on index gives incorrect index in result #1806

Closed
ayushdg opened this issue May 20, 2019 · 8 comments · Fixed by #1840
Closed

[BUG] Merging on index gives incorrect index in result #1806

ayushdg opened this issue May 20, 2019 · 8 comments · Fixed by #1840
Labels
bug Something isn't working Needs Triage Need team to review and classify

Comments

@ayushdg
Copy link
Member

ayushdg commented May 20, 2019

Describe the bug
When merging two dataframes based on index the result does not seem consistent with pandas

Steps/Code to reproduce bug

import cudf

df = cudf.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = ['a','b','c','d','e']
df.index=[5,3,4,2,1]

df2 = cudf.DataFrame()
df2['a2'] = [1,2,3,4,5]
df2['res'] = ['a','b','c','d','e']

df.merge(df2,left_index=True,right_index=True,how='left').to_pandas()

Output:

  a b a2 res
5 2 b 4 d
4 3 c 5 e
3 4 d 3 c
2 5 e 2 b
1 1 a -1 None

Expected behavior

import pandas as pd

df = pd.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = ['a','b','c','d','e']
df.index=[5,3,4,2,1]

df2 = pd.DataFrame()
df2['a2'] = [1,2,3,4,5]
df2['res'] = ['a','b','c','d','e']

df.merge(df2,left_index=True,right_index=True,how='left')

Output:

  a b a2 res
5 1 a NaN NaN
3 2 b 4.0 d
4 3 c 5.0 e
2 4 d 3.0 c
1 5 e 2.0 b

Environment details (please complete the following information):

  • Environment location: Docker
  • Method of cuDF install: Source
  • Please run and attach the output of the cudf/print_env.sh script to gather relevant environment details: On 0.8 at commit a66db9e

Additional context
Add any other context about the problem here.

@ayushdg ayushdg added Needs Triage Need team to review and classify bug Something isn't working labels May 20, 2019
@ayushdg ayushdg changed the title [BUG] Merging on index gives incorrect result [BUG] Merging on index gives incorrect index in result May 20, 2019
@ayushdg
Copy link
Member Author

ayushdg commented May 20, 2019

Might be somewhat related to the discussion here: #1781 and how cudf handles non deterministic ordering and indices.

@thomcom
Copy link
Contributor

thomcom commented May 23, 2019

It looks like it is just the ordering problem, coupled with a type promotion that cudf didn't do. My vote is not a bug. I'm not sure if we'll ever guarantee same order as Pandas?

@ayushdg
Copy link
Member Author

ayushdg commented May 23, 2019

The ordering my not match pandas but indices corresponding to the new ordering by cudf is incorrect.

If we were to change the order up the indices of the result should match the order:
Expected output for cudf with new ordering

  a b a2 res
3 2 b 4 d
4 3 c 5 e
2 4 d 3 c
1 5 e 2 b
5 1 a -1 None

@thomcom
Copy link
Contributor

thomcom commented May 23, 2019

The indices are the implicit ordering of the data, we can't construct an index without an order. When sort based groupby is implemented we can likely match pandas exactly. Depends on #1342

@ayushdg
Copy link
Member Author

ayushdg commented May 23, 2019

Sorry for the confusion. This is w.r.t the merge method.
I might be completely wrong here but: https://github.com/rapidsai/cudf/blob/branch-0.8/python/cudf/dataframe/dataframe.py#L1479

Assigns the index to be a column as well that gets passed to the cpp layer during the merge. When the ordering or columns gets changed it should also affect the ordering of this column (lhs[on]/rhs[on]) which was originally the index of my df.

So after getting the output if we set the output index to match the lhs[on] column, the results should be consistent.

I'm sorry for being relatively vague, I'll try to debug and play around with my example a bit more to see check if my theory is correct or not.

@thomcom
Copy link
Contributor

thomcom commented May 23, 2019

Hmm I was under the impression that merge, join, groupby, etc all use the same hash based join method, which discards ordering as a result.

I think you're right, though, that this is a slightly different issue. I suspect that our "index" object is being completely dropped from the merge, ignored, and then added back to the result of the merge. If the index was treated as another column during the merge, the index data should persist as you suggested in Expected output for cudf with new ordering

@thomcom
Copy link
Contributor

thomcom commented May 23, 2019

This is something that can probably be fixed in cudf, in fact.

@thomcom
Copy link
Contributor

thomcom commented May 23, 2019

Like this lol


In [1]: import cudf
   ...:
   ...: df = cudf.DataFrame()
   ...: df['a'] = [1,2,3,4,5]
   ...: df['b'] = ['a','b','c','d','e']
   ...: df.index=[5,3,4,2,1]
   ...:
   ...: df2 = cudf.DataFrame()
   ...: df2['a2'] = [1,2,3,4,5]
   ...: df2['res'] = ['a','b','c','d','e']
   ...:
   ...: df.merge(df2,left_index=True,right_index=True,how='left').to_pandas()
Out[1]:
   a  b   a2   res
3  2  b    4     d
4  3  c    5     e
2  4  d    3     c
1  5  e    2     b
5  1  a  NaN  None

Thanks for setting me straight :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Needs Triage Need team to review and classify
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants