Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

Closed
mattf opened this issue Jan 26, 2023 · 0 comments · Fixed by #15450
Closed

[BUG] groupby.transform('cumsum') fails due to non-floating values #12621

mattf opened this issue Jan 26, 2023 · 0 comments · Fixed by #15450
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.

Comments

@mattf
Copy link

mattf commented Jan 26, 2023

Describe the bug
working with import cudf as pd

Steps/Code to reproduce bug

In [1]: import cudf as pd

In [2]: pd.__version__
Out[2]: '22.12.0'

In [3]: df = pd.DataFrame({'salary': [30000,40000,50000,85000,75000], 'gender': list('MFMFM')})

In [4]: df.groupby('gender')['salary'].transform('cumsum')
/home/matt/.local/lib/python3.9/site-packages/cudf/core/join/_join_helpers.py:105: UserWarning: Can't safely cast column from int64 to object, upcasting to None.
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.groupby('gender')['salary'].transform('cumsum')

File ~/.local/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:1076, in GroupBy.transform(self, function)
   1071 except TypeError as e:
   1072     raise NotImplementedError(
   1073         "Currently, `transform()` supports only aggregations."
   1074     ) from e
-> 1076 return self._broadcast(result)

File ~/.local/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:1029, in GroupBy._broadcast(self, values)
   1013 """
   1014 Broadcast the results of an aggregation to the group
   1015 
   (...)
   1026 ``self.obj``.
   1027 """
   1028 if not values.index.equals(self.grouping.keys):
-> 1029     values = values._align_to_index(
   1030         self.grouping.keys, how="right", allow_non_unique=True
   1031     )
   1032     values.index = self.obj.index
   1033 return values

File ~/.local/lib/python3.9/site-packages/cudf/core/indexed_frame.py:2184, in IndexedFrame._align_to_index(self, index, how, sort, allow_non_unique)
   2181 elif how == "right":
   2182     rhs[sort_col_id] = cudf.core.column.arange(len(rhs))
-> 2184 result = lhs.join(rhs, how=how, sort=sort)
   2185 if how in ("left", "right"):
   2186     result = result.sort_values(sort_col_id)

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:4038, in DataFrame.join(self, other, on, how, lsuffix, rsuffix, sort)
   4035 if on is not None:
   4036     raise NotImplementedError("The on parameter is not yet supported")
-> 4038 df = self.merge(
   4039     other,
   4040     left_index=True,
   4041     right_index=True,
   4042     how=how,
   4043     suffixes=(lsuffix, rsuffix),
   4044     sort=sort,
   4045 )
   4046 df.index.name = (
   4047     None if self.index.name != other.index.name else self.index.name
   4048 )
   4049 return df

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:3987, in DataFrame.merge(self, right, on, left_on, right_on, left_index, right_index, how, sort, lsuffix, rsuffix, indicator, suffixes)
   3984 elif how in {"leftsemi", "leftanti"}:
   3985     merge_cls = MergeSemi
-> 3987 return merge_cls(
   3988     lhs,
   3989     rhs,
   3990     on=on,
   3991     left_on=left_on,
   3992     right_on=right_on,
   3993     left_index=left_index,
   3994     right_index=right_index,
   3995     how=how,
   3996     sort=sort,
   3997     indicator=indicator,
   3998     suffixes=suffixes,
   3999 ).perform_merge()

File ~/.local/lib/python3.9/site-packages/cudf/core/join/join.py:166, in Merge.perform_merge(self)
    164 lcol = left_key.get(self.lhs)
    165 rcol = right_key.get(self.rhs)
--> 166 lcol_casted, rcol_casted = _match_join_keys(lcol, rcol, self.how)
    167 left_join_cols.append(lcol_casted)
    168 right_join_cols.append(rcol_casted)

File ~/.local/lib/python3.9/site-packages/cudf/core/join/_join_helpers.py:110, in _match_join_keys(lcol, rcol, how)
    104     else:
    105         warnings.warn(
    106             f"Can't safely cast column from {rtype} to {ltype}, "
    107             f"upcasting to {common_type}."
    108         )
--> 110 return lcol.astype(common_type), rcol.astype(common_type)

File ~/.local/lib/python3.9/site-packages/cudf/core/column/column.py:892, in ColumnBase.astype(self, dtype, **kwargs)
    890     return self.as_timedelta_column(dtype, **kwargs)
    891 else:
--> 892     return self.as_numerical_column(dtype, **kwargs)

File ~/.local/lib/python3.9/site-packages/cudf/core/column/string.py:5384, in StringColumn.as_numerical_column(self, dtype, **kwargs)
   5382 elif out_dtype.kind == "f":
   5383     if not libstrings.is_float(string_col).all():
-> 5384         raise ValueError(
   5385             "Could not convert strings to float "
   5386             "type due to presence of non-floating values."
   5387         )
   5389 result_col = _str_to_numeric_typecast_functions[out_dtype](string_col)
   5390 return result_col

ValueError: Could not convert strings to float type due to presence of non-floating values.

In [5]: df.to_pandas().groupby('gender')['salary'].transform('cumsum')
Out[5]: 
0     30000
1     40000
2     80000
3    125000
4    155000
Name: salary, dtype: int64

In [6]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   salary  5 non-null      int64
 1   gender  5 non-null      object
dtypes: int64(1), object(1)
memory usage: 69.0+ bytes

Expected behavior
same behavior as pandas

@mattf mattf added Needs Triage Need team to review and classify bug Something isn't working labels Jan 26, 2023
@GregoryKimball GregoryKimball reopened this Jun 6, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 6, 2023
@wence- wence- self-assigned this Apr 3, 2024
wence- added a commit to wence-/cudf that referenced this issue Apr 3, 2024
When performing a groupby-transform with a scan aggregation, the
intermediate result obtained from calling groupby-agg is already the
correct shape and does not need to be broadcast to align with the
grouping keys.

To handle this, make sure that if the requested transform is a scan
that we don't try and broadcast.

While here, tighten up the input checking: transform only applies to a
single aggregation, rather than the more general interface offered by
agg.

- Closes rapidsai#12621
- Closes rapidsai#15448
rapids-bot bot pushed a commit that referenced this issue Apr 16, 2024
When performing a groupby-transform with a scan aggregation, the intermediate result obtained from calling groupby-agg is already the correct shape and does not need to be broadcast to align with the grouping keys.

To handle this, make sure that if the requested transform is a scan that we don't try and broadcast.

While here, tighten up the input checking: transform only applies to a single aggregation, rather than the more general interface offered by agg.

- Closes #12621
- Closes #15448

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #15450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants