[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

mattf · 2023-01-26T19:32:20Z

Describe the bug
working with import cudf as pd

Steps/Code to reproduce bug

In [1]: import cudf as pd

In [2]: pd.__version__
Out[2]: '22.12.0'

In [3]: df = pd.DataFrame({'salary': [30000,40000,50000,85000,75000], 'gender': list('MFMFM')})

In [4]: df.groupby('gender')['salary'].transform('cumsum')
/home/matt/.local/lib/python3.9/site-packages/cudf/core/join/_join_helpers.py:105: UserWarning: Can't safely cast column from int64 to object, upcasting to None.
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.groupby('gender')['salary'].transform('cumsum')

File ~/.local/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:1076, in GroupBy.transform(self, function)
   1071 except TypeError as e:
   1072     raise NotImplementedError(
   1073         "Currently, `transform()` supports only aggregations."
   1074     ) from e
-> 1076 return self._broadcast(result)

File ~/.local/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:1029, in GroupBy._broadcast(self, values)
   1013 """
   1014 Broadcast the results of an aggregation to the group
   1015 
   (...)
   1026 ``self.obj``.
   1027 """
   1028 if not values.index.equals(self.grouping.keys):
-> 1029     values = values._align_to_index(
   1030         self.grouping.keys, how="right", allow_non_unique=True
   1031     )
   1032     values.index = self.obj.index
   1033 return values

File ~/.local/lib/python3.9/site-packages/cudf/core/indexed_frame.py:2184, in IndexedFrame._align_to_index(self, index, how, sort, allow_non_unique)
   2181 elif how == "right":
   2182     rhs[sort_col_id] = cudf.core.column.arange(len(rhs))
-> 2184 result = lhs.join(rhs, how=how, sort=sort)
   2185 if how in ("left", "right"):
   2186     result = result.sort_values(sort_col_id)

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:4038, in DataFrame.join(self, other, on, how, lsuffix, rsuffix, sort)
   4035 if on is not None:
   4036     raise NotImplementedError("The on parameter is not yet supported")
-> 4038 df = self.merge(
   4039     other,
   4040     left_index=True,
   4041     right_index=True,
   4042     how=how,
   4043     suffixes=(lsuffix, rsuffix),
   4044     sort=sort,
   4045 )
   4046 df.index.name = (
   4047     None if self.index.name != other.index.name else self.index.name
   4048 )
   4049 return df

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:3987, in DataFrame.merge(self, right, on, left_on, right_on, left_index, right_index, how, sort, lsuffix, rsuffix, indicator, suffixes)
   3984 elif how in {"leftsemi", "leftanti"}:
   3985     merge_cls = MergeSemi
-> 3987 return merge_cls(
   3988     lhs,
   3989     rhs,
   3990     on=on,
   3991     left_on=left_on,
   3992     right_on=right_on,
   3993     left_index=left_index,
   3994     right_index=right_index,
   3995     how=how,
   3996     sort=sort,
   3997     indicator=indicator,
   3998     suffixes=suffixes,
   3999 ).perform_merge()

File ~/.local/lib/python3.9/site-packages/cudf/core/join/join.py:166, in Merge.perform_merge(self)
    164 lcol = left_key.get(self.lhs)
    165 rcol = right_key.get(self.rhs)
--> 166 lcol_casted, rcol_casted = _match_join_keys(lcol, rcol, self.how)
    167 left_join_cols.append(lcol_casted)
    168 right_join_cols.append(rcol_casted)

File ~/.local/lib/python3.9/site-packages/cudf/core/join/_join_helpers.py:110, in _match_join_keys(lcol, rcol, how)
    104     else:
    105         warnings.warn(
    106             f"Can't safely cast column from {rtype} to {ltype}, "
    107             f"upcasting to {common_type}."
    108         )
--> 110 return lcol.astype(common_type), rcol.astype(common_type)

File ~/.local/lib/python3.9/site-packages/cudf/core/column/column.py:892, in ColumnBase.astype(self, dtype, **kwargs)
    890     return self.as_timedelta_column(dtype, **kwargs)
    891 else:
--> 892     return self.as_numerical_column(dtype, **kwargs)

File ~/.local/lib/python3.9/site-packages/cudf/core/column/string.py:5384, in StringColumn.as_numerical_column(self, dtype, **kwargs)
   5382 elif out_dtype.kind == "f":
   5383     if not libstrings.is_float(string_col).all():
-> 5384         raise ValueError(
   5385             "Could not convert strings to float "
   5386             "type due to presence of non-floating values."
   5387         )
   5389 result_col = _str_to_numeric_typecast_functions[out_dtype](string_col)
   5390 return result_col

ValueError: Could not convert strings to float type due to presence of non-floating values.

In [5]: df.to_pandas().groupby('gender')['salary'].transform('cumsum')
Out[5]: 
0     30000
1     40000
2     80000
3    125000
4    155000
Name: salary, dtype: int64

In [6]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   salary  5 non-null      int64
 1   gender  5 non-null      object
dtypes: int64(1), object(1)
memory usage: 69.0+ bytes

Expected behavior
same behavior as pandas

The text was updated successfully, but these errors were encountered:

When performing a groupby-transform with a scan aggregation, the intermediate result obtained from calling groupby-agg is already the correct shape and does not need to be broadcast to align with the grouping keys. To handle this, make sure that if the requested transform is a scan that we don't try and broadcast. While here, tighten up the input checking: transform only applies to a single aggregation, rather than the more general interface offered by agg. - Closes rapidsai#12621 - Closes rapidsai#15448

When performing a groupby-transform with a scan aggregation, the intermediate result obtained from calling groupby-agg is already the correct shape and does not need to be broadcast to align with the grouping keys. To handle this, make sure that if the requested transform is a scan that we don't try and broadcast. While here, tighten up the input checking: transform only applies to a single aggregation, rather than the more general interface offered by agg. - Closes #12621 - Closes #15448 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) URL: #15450

mattf added Needs Triage Need team to review and classify bug Something isn't working labels Jan 26, 2023

GregoryKimball modified the milestone: Pandas API Alignment and Coverage Feb 7, 2023

GregoryKimball closed this as completed Jun 6, 2023

GregoryKimball reopened this Jun 6, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 6, 2023

wence- self-assigned this Apr 3, 2024

wence- mentioned this issue Apr 3, 2024

[BUG] groupby-transform silently broadcasts scan aggregations, producing wrong result #15448

Closed

wence- mentioned this issue Apr 3, 2024

Handle case of scan aggregation in groupby-transform #15450

Merged

3 tasks

rapids-bot bot closed this as completed in #15450 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

mattf commented Jan 26, 2023

[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

Comments

mattf commented Jan 26, 2023