[BUG] Merging on index gives incorrect index in result #1806

ayushdg · 2019-05-20T23:01:06Z

Describe the bug
When merging two dataframes based on index the result does not seem consistent with pandas

Steps/Code to reproduce bug

import cudf

df = cudf.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = ['a','b','c','d','e']
df.index=[5,3,4,2,1]

df2 = cudf.DataFrame()
df2['a2'] = [1,2,3,4,5]
df2['res'] = ['a','b','c','d','e']

df.merge(df2,left_index=True,right_index=True,how='left').to_pandas()

Output:

	a	b	a2	res
5	2	b	4	d
4	3	c	5	e
3	4	d	3	c
2	5	e	2	b
1	1	a	-1	None

Expected behavior

import pandas as pd

df = pd.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = ['a','b','c','d','e']
df.index=[5,3,4,2,1]

df2 = pd.DataFrame()
df2['a2'] = [1,2,3,4,5]
df2['res'] = ['a','b','c','d','e']

df.merge(df2,left_index=True,right_index=True,how='left')

Output:

	a	b	a2	res
5	1	a	NaN	NaN
3	2	b	4.0	d
4	3	c	5.0	e
2	4	d	3.0	c
1	5	e	2.0	b

Environment details (please complete the following information):

Environment location: Docker
Method of cuDF install: Source
Please run and attach the output of the cudf/print_env.sh script to gather relevant environment details: On 0.8 at commit a66db9e

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

ayushdg · 2019-05-20T23:17:13Z

Might be somewhat related to the discussion here: #1781 and how cudf handles non deterministic ordering and indices.

thomcom · 2019-05-23T15:37:32Z

It looks like it is just the ordering problem, coupled with a type promotion that cudf didn't do. My vote is not a bug. I'm not sure if we'll ever guarantee same order as Pandas?

ayushdg · 2019-05-23T17:13:28Z

The ordering my not match pandas but indices corresponding to the new ordering by cudf is incorrect.

If we were to change the order up the indices of the result should match the order:
Expected output for cudf with new ordering

	a	b	a2	res
3	2	b	4	d
4	3	c	5	e
2	4	d	3	c
1	5	e	2	b
5	1	a	-1	None

thomcom · 2019-05-23T17:19:14Z

The indices are the implicit ordering of the data, we can't construct an index without an order. When sort based groupby is implemented we can likely match pandas exactly. Depends on #1342

ayushdg · 2019-05-23T17:31:41Z

Sorry for the confusion. This is w.r.t the merge method.
I might be completely wrong here but: https://github.com/rapidsai/cudf/blob/branch-0.8/python/cudf/dataframe/dataframe.py#L1479

Assigns the index to be a column as well that gets passed to the cpp layer during the merge. When the ordering or columns gets changed it should also affect the ordering of this column (lhs[on]/rhs[on]) which was originally the index of my df.

So after getting the output if we set the output index to match the lhs[on] column, the results should be consistent.

I'm sorry for being relatively vague, I'll try to debug and play around with my example a bit more to see check if my theory is correct or not.

thomcom · 2019-05-23T17:48:33Z

Hmm I was under the impression that merge, join, groupby, etc all use the same hash based join method, which discards ordering as a result.

I think you're right, though, that this is a slightly different issue. I suspect that our "index" object is being completely dropped from the merge, ignored, and then added back to the result of the merge. If the index was treated as another column during the merge, the index data should persist as you suggested in Expected output for cudf with new ordering

thomcom · 2019-05-23T17:51:33Z

This is something that can probably be fixed in cudf, in fact.

thomcom · 2019-05-23T17:59:35Z

Like this lol


In [1]: import cudf
   ...:
   ...: df = cudf.DataFrame()
   ...: df['a'] = [1,2,3,4,5]
   ...: df['b'] = ['a','b','c','d','e']
   ...: df.index=[5,3,4,2,1]
   ...:
   ...: df2 = cudf.DataFrame()
   ...: df2['a2'] = [1,2,3,4,5]
   ...: df2['res'] = ['a','b','c','d','e']
   ...:
   ...: df.merge(df2,left_index=True,right_index=True,how='left').to_pandas()
Out[1]:
   a  b   a2   res
3  2  b    4     d
4  3  c    5     e
2  4  d    3     c
1  5  e    2     b
5  1  a  NaN  None

Thanks for setting me straight :)

ayushdg added Needs Triage Need team to review and classify bug Something isn't working labels May 20, 2019

ayushdg changed the title ~~[BUG] Merging on index gives incorrect result~~ [BUG] Merging on index gives incorrect index in result May 20, 2019

thomcom mentioned this issue May 23, 2019

[REVIEW] Retain index correctly during merge left_on right_on #1840

Merged

kkraus14 closed this as completed in #1840 May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Merging on index gives incorrect index in result #1806

[BUG] Merging on index gives incorrect index in result #1806

ayushdg commented May 20, 2019 •

edited

Loading

ayushdg commented May 20, 2019

thomcom commented May 23, 2019

ayushdg commented May 23, 2019

thomcom commented May 23, 2019

ayushdg commented May 23, 2019

thomcom commented May 23, 2019

thomcom commented May 23, 2019

thomcom commented May 23, 2019

[BUG] Merging on index gives incorrect index in result #1806

[BUG] Merging on index gives incorrect index in result #1806

Comments

ayushdg commented May 20, 2019 • edited Loading

ayushdg commented May 20, 2019

thomcom commented May 23, 2019

ayushdg commented May 23, 2019

thomcom commented May 23, 2019

ayushdg commented May 23, 2019

thomcom commented May 23, 2019

thomcom commented May 23, 2019

thomcom commented May 23, 2019

ayushdg commented May 20, 2019 •

edited

Loading