Skip to content

Commit

Permalink
Fix some legacy tests
Browse files Browse the repository at this point in the history
Original change here #3188
Why were we casting to "float64" in the old testcase?
Maybe related to this comment? #3188 (comment)
  • Loading branch information
er-eis committed Apr 30, 2024
1 parent 9cebaa2 commit 0736e4e
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 14 deletions.
15 changes: 9 additions & 6 deletions python/cudf/cudf/core/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -438,13 +438,16 @@ def concat(objs, axis=0, join="outer", ignore_index=False, sort=None):
else:
df[col_label] = col

if ignore_index:
# with ignore_index the column names change to numbers
df.columns = pd.RangeIndex(len(result_columns))
elif not only_series:
df.columns = cudf.MultiIndex.from_tuples(df._column_names)
if keys is None:
df.columns = result_columns.unique()
if ignore_index:
df.columns = pd.RangeIndex(len(result_columns.unique()))
else:
pass
if ignore_index:
# with ignore_index the column names change to numbers
df.columns = pd.RangeIndex(len(result_columns))
elif not only_series:
df.columns = cudf.MultiIndex.from_tuples(df._column_names)

if empty_inner:
# if join is inner and it contains an empty df
Expand Down
14 changes: 6 additions & 8 deletions python/cudf/cudf/tests/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,8 @@ def test_concat_columns(axis):
assert_eq(expect, got, check_index_type=True)


def test_concat_multiindex_dataframe():
@pytest.mark.parametrize("axis", [0, 1])
def test_concat_multiindex_dataframe(axis):
gdf = cudf.DataFrame(
{
"w": np.arange(4),
Expand All @@ -233,14 +234,11 @@ def test_concat_multiindex_dataframe():
pdg2 = pdg.iloc[:, 1:]
gdg1 = cudf.from_pandas(pdg1)
gdg2 = cudf.from_pandas(pdg2)
expected = pd.concat([pdg1, pdg2], axis=axis)
result = cudf.concat([gdg1, gdg2], axis=axis)
assert_eq(
cudf.concat([gdg1, gdg2]).astype("float64"),

This comment has been minimized.

Copy link
@wence-

wence- May 1, 2024

Contributor

tl;dr: this change is correct and a good one.

This cast might have been needed in the past because pd.concat([pdg1, pdg2]) turns the integer columns into float ones and fills the missing values with NaNs:

pd.concat([pdg1, pdg2], axis=0)
       y    z
w x          
1 1  1.0  NaN
0 0  0.0  NaN
3 3  3.0  NaN
2 2  2.0  NaN
1 1  NaN  1.0
0 0  NaN  0.0
3 3  NaN  3.0
2 2  NaN  2.0

Conversely, cudf uses nullable dtypes in (basically all) cases and produces integer columns with nulls:

cudf.concat([gdg1, gdg2], axis=0)
        y     z
w x            
1 1     1  <NA>
0 0     0  <NA>
3 3     3  <NA>
2 2     2  <NA>
1 1  <NA>     1
0 0  <NA>     0
3 3  <NA>     3
2 2  <NA>     2

assert_eq calls .to_pandas() on the cudf dataframe and then calls the pandas testing assert_frame_equal function. If to_pandas() were to produce a nullable Int64 dtype for columns then the dtype check would fail. By casting to float64 first, this is avoided.

As it happens, it's unnecessary, because to_pandas() by default produces non-nullable columns in the pandas dataframe, so converting the cudf dataframe to_pandas will produce float64 columns anyway.

pd.concat([pdg1, pdg2]),
check_index_type=True,
)
assert_eq(
cudf.concat([gdg1, gdg2], axis=1),
pd.concat([pdg1, pdg2], axis=1),
expected,
result,
check_index_type=True,
)

Expand Down

0 comments on commit 0736e4e

Please sign in to comment.