Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mismatch in up-casting Series.__setitem__ between pandas and cudf #11901

Closed
wence- opened this issue Oct 11, 2022 · 1 comment · Fixed by #11904
Closed

[BUG] mismatch in up-casting Series.__setitem__ between pandas and cudf #11901

wence- opened this issue Oct 11, 2022 · 1 comment · Fixed by #11904
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Oct 11, 2022

Describe the bug

import pandas as pd
import cudf
import numpy as np

series = pd.Series([1, 2, 3], dtype=np.int64)
ceries = cudf.from_pandas(series)

series[1] = 1.5
ceries[1] = 1.5

assert np.allclose(series.values, ceries.values)

series.dtype => dtype('float64)
ceries.dtype => dtype('int64')

Expected behavior

cudf should mimic pandas behaviour here.

[Aside, I hate hate hate this upcasting from ints to floats since it is lossy for round-tripping, but I suppose we must live with it.]

Things work if I specify a list of indices:

series[[1, 2]] = 1.5
ceries[[1, 2]] = 1.5
assert np.allclose(series.values, ceries.values) => Fine

The relevant code on a series does this:

https://github.com/rapidsai/cudf/blob/branch-22.12/python/cudf/cudf/core/series.py#L216-L234

Note that if the key is a single integer, we don't upcast the column to the union dtype.

This logic goes back to #2442 which introduced __setitem__ (@shwina can you remember why?). I note that the __setitem__ implementation on Column has the opposite casting rules (if col[idx] = value does value.astype(col.dtype)), so there must be an explicit choice in Series.__setitem__ to mimic pandas.

Plausibly this is a fix:

diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py
index f11052096e..b1e2773a55 100644
--- a/python/cudf/cudf/core/series.py
+++ b/python/cudf/cudf/core/series.py
@@ -222,14 +222,13 @@ class _SeriesIlocIndexer(_FrameIndexer):
             and _is_non_decimal_numeric_dtype(value.dtype)
         ):
             # normalize types if necessary:
-            if not is_integer(key):
-                to_dtype = np.result_type(
-                    value.dtype, self._frame._column.dtype
-                )
-                value = value.astype(to_dtype)
-                self._frame._column._mimic_inplace(
-                    self._frame._column.astype(to_dtype), inplace=True
-                )
+            to_dtype = np.result_type(
+                value.dtype, self._frame._column.dtype
+            )
+            value = value.astype(to_dtype)
+            self._frame._column._mimic_inplace(
+                self._frame._column.astype(to_dtype), inplace=True
+            )
 
         self._frame._column[key] = value
 
@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify pandas labels Oct 11, 2022
@shwina
Copy link
Contributor

shwina commented Oct 11, 2022

(@shwina can you remember why?

I cannot. I agree that automatically upcasting is the better behaviour.

@wence- wence- self-assigned this Oct 11, 2022
wence- added a commit to wence-/cudf that referenced this issue Oct 11, 2022
To mimic pandas, we must upcast a column to the numpy result_type of
the column itself and the input value dtype. This previously occurred
in all relevant cases except when the index provided to __setitem__
was a single integer (originally introduced in rapidsai#2442). Closes rapidsai#11901.
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 21, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 4, 2022
To mimic pandas, we must upcast a column to the numpy result_type of the column itself and the input value dtype. This previously occurred in all relevant cases except when the index provided to __setitem__ was a single integer (originally introduced in #2442). Closes #11901.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: #11904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants