-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH/QST]: Behaviour of type promotion in __setitem__
#12039
Comments
What a fantastic analysis. |
Just a comment about the One of the factors of
And in this example from your table had
I think this is "lossy" because of an
|
This is easily one of the most detailed issues I have ever seen! |
Depends what you mean by lossy. float64 only has enough precision to exactly represent int53 (because the mantissa has 53 bits). Hence, mapping between int64 and float64 is neither surjective (you can't reach nan and inf) nor injective (multiple int64s can map to the same float64), so you can't roundtrip successfully. In [36]: np.int64(2**63 - 10) == np.int64(2**63 - 100)
Out[36]: False
In [37]: np.int64(2**63 - 10).astype(np.float64) == np.int64(2**63 - 100).astype(np.float64)
Out[37]: True
These are all true because the right operand of equality is promoted to
I characterise this as "lossy" because you didn't round-trip successfully. As noted, it isn't really a pandas issue per-se, but rather numpy (and thence promotion via C rules). I note that numpy doesn't quite implement C/C++ promotion rules for integer and floating point types. Since it widens floating point types in addition between ints and floats. C++:
These promotions are always in-range, since numpy:
These promotions are bijective (if we ignore nan and inf) if the integer type is narrower than the floating point type. (So the int32 promotion can be undone losslessly). These kind of automatic promotions are basically a fight I lost before I was born in old programming languages, though some more modern ones force you to always explicitly cast so that you're aware that you might be doing A Bad Thing. |
Updated the code to generate tables for a bunch of different scenarios. You can now run and it produces a separate file for each scenario, which you can diff against each other to see the points of difference. Mostly things work (with #11904), except #12072 means that equality is busted. And, slice-based setitem is broken for length-one ranges if setting with an array rather than a scalar (that's #12073). |
Sorry for the long post, I hope it doesn't derail or is off topic. I will mainly try to explain the state of NumPy and what NEP 50 does. Happy to re-read again without that in mind! N.B. please avoid First, there are three things that come together to understand the NumPy behavior:
There is discussion about lossy conversion of Python scalars. Which touches something that I am currently struggling with to push NEP 50. So bringing it up, because if you clearly want the notion of "lossy", it might inform how I do this for NEP 50 (it is not central to NEP 50, and doesn't have to be; but I need to settle on something). For NumPy dtypes we have that defined through the "casting safety" (which can be "no", "equiv", "safe", "same kind", "unsafe". For values we don't really define it, and I don't think the categories quite work. |
NEP 50 itself does not touch the annoying fact that |
I think generally from the pandas point of view (with respect to Here are some examples (related to your original table) where we differ from numpy due to the above: 1 . "Don't truncate so upcast dtype" example
|
Unless we want to implement bignums in libcudf, I think this is not something we can really contemplate. |
This is a reasonable philosophy, though it does rather depend on what you mean by "losing numeric precision".
This seems like a nice thing to do, except that it changes behaviour of the series:
I would argue that this choice did lose numeric precision, and particularly in existing values that weren't touched by the indexing. |
So pandas (and cudf) use common DType to promote a column when necessary. That leads to two things:
The first issue is independent of NEP 50 and I am not sure there is much to do about it. You could consider disallowing it. It probably means making a clearer distinction between "common DType" (for storing into a single column) and "promotion" (for operations). For the second thing, NEP 50 might nudge towards making it an error in |
That's fair. Not 100% sure of the origin of promoting to
This is also fair. Maybe this is a case of pandas being too accomodating, but we can always nudge users to |
If you're going to do integral to floating type promotion and want to maintain accuracy in the largest number of cases (and don't care too much about memory footprint) the promoting to float64 is a reasonable choice. |
Summary
CUDF is not consistent with Pandas (under a bunch of circumstances) in
its behaviour when upcasting during
__setitem__
. In some cases, wemight want to mimic pandas behaviour (though they are very keen to use
value-based type promotion). In others, where we have more structured
dtypes than pandas, we need to decide what to do (current behaviour is
internally inconsistent and buggy in a bunch of cases).
I summarise what I think the current state is (by way of experiment),
and then discuss some options. Opinions welcome!
cc: @vyasr, @mroeschke, @shwina
Pandas behaviour
Pandas version 1.5.1, MacOS (Apple Silicon)
Edit: updated code for generating more tables.
I should note that these tables are for single index
__setitem__
(s.iloc[i] = value
). I should check if the same behaviour also occurs for:__setitem__
with single values.iloc[:1] = [value]
__setitem__
with list of valuess.iloc[:2] = [value for _ in range(2)]
__setitem__
with singleton values.iloc[[True, False]] = [value]
__setitem__
with multiple valuess.iloc[[True, False, True]] = [value, value]
__setitem__
with single values.iloc[[1]] = value
__setitem__
with multiple valuess.iloc[[1, 2]] = [value, value]
Code to generate tables
Numeric columns
Integer column dtypes
dtype width < max integer width
Initial values
[2**31 - 10, 2**31 - 100, 3]
.np.int32
isrepresentative of any integer type that is smaller than the max width.
np.dtype[int32]
10
np.dtype[int32]
np.dtype[int32]
np.int64(10)
np.dtype[int32]
np.dtype[int32]
1099511627776
np.dtype[longlong]
np.dtype[int32]
np.int64(1099511627776)
np.dtype[longlong]
np.dtype[int32]
1208925819614629174706176
np.dtype[object_]
np.dtype[int32]
10.5
np.dtype[float64]
np.dtype[int32]
np.float64(10.0)
np.dtype[int32]
np.dtype[int32]
np.float64(10.5)
np.dtype[float64]
np.dtype[int32]
np.float32(10.0)
np.dtype[int32]
np.dtype[int32]
np.float32(10.5)
np.dtype[float64]
dtype width == max integer width
Initial values
[2 ** 63 - 10, 2 ** 63 - 100, 3]
. These provoke edgecases in upcasting because:
np.dtype[int64]
10
np.dtype[int64]
np.dtype[int64]
np.int64(10)
np.dtype[int64]
np.dtype[int64]
1099511627776
np.dtype[int64]
np.dtype[int64]
np.int64(1099511627776)
np.dtype[int64]
np.dtype[int64]
1208925819614629174706176
np.dtype[object_]
np.dtype[int64]
10.5
np.dtype[float64]
np.dtype[int64]
np.float64(10.0)
np.dtype[int64]
np.dtype[int64]
np.float64(10.5)
np.dtype[float64]
np.dtype[int64]
np.float32(10.0)
np.dtype[int64]
np.dtype[int64]
np.float32(10.5)
np.dtype[float64]
Float column dtypes
dtype width < max float width
Initial values
[np.finfo(np.float32).max, np.float32(np.inf), 3]
np.dtype[float32]
10
np.dtype[float32]
np.dtype[float32]
np.int64(10)
np.dtype[float32]
np.dtype[float32]
1099511627776
np.dtype[float32]
np.dtype[float32]
np.int64(1099511627776)
np.dtype[float32]
np.dtype[float32]
np.int32(2147483548)
np.dtype[float64]
np.dtype[float32]
np.int64(9223372036854775708)
np.dtype[float32]
np.dtype[float32]
1208925819614629174706076
np.dtype[object_]
np.dtype[float32]
10.5
np.dtype[float32]
np.dtype[float32]
np.float64(10.0)
np.dtype[float32]
np.dtype[float32]
np.float64(10.5)
np.dtype[float32]
np.dtype[float32]
np.float64(3.4028234663852886e+39)
np.dtype[float64]
np.dtype[float32]
np.float32(10.0)
np.dtype[float32]
np.dtype[float32]
np.float32(10.5)
np.dtype[float32]
dtype width == max float width
Initial values
[np.finfo(np.float64).max, np.float64(np.inf), 3]
np.dtype[float64]
10
np.dtype[float64]
np.dtype[float64]
np.int64(10)
np.dtype[float64]
np.dtype[float64]
1099511627776
np.dtype[float64]
np.dtype[float64]
np.int64(1099511627776)
np.dtype[float64]
np.dtype[float64]
np.int32(2147483548)
np.dtype[float64]
np.dtype[float64]
np.int64(9223372036854775708)
np.dtype[float64]
np.dtype[float64]
1208925819614629174706076
np.dtype[object_]
np.dtype[float64]
10.5
np.dtype[float64]
np.dtype[float64]
np.float64(10.0)
np.dtype[float64]
np.dtype[float64]
np.float64(10.5)
np.dtype[float64]
np.dtype[float64]
np.float64(3.4028234663852886e+39)
np.dtype[float64]
np.dtype[float64]
np.float32(10.0)
np.dtype[float64]
np.dtype[float64]
np.float32(10.5)
np.dtype[float64]
Everything else
Basically, you can put anything in a column and you get an object out,
but numpy types are converted to
object
first.CUDF behaviour
CUDF trunk, and state in #11904.
Numeric columns
Integer column dtypes
dtype width < max integer width
Initial values
[2**31 - 10, 2**31 - 100, 3]
.np.int32
isrepresentative of any integer type that is smaller than the max width.
np.dtype[int32]
10
np.dtype[int32]
8np.dtype[int64]
9np.dtype[int32]
np.int64(10)
np.dtype[int32]
8np.dtype[int64]
9np.dtype[int32]
1099511627776
np.dtype[int32]
8np.dtype[int64]
9np.dtype[int32]
np.int64(1099511627776)
np.dtype[int32]
8np.dtype[int64]
9np.dtype[int32]
1208925819614629174706176
np.dtype[int32]
10.5
np.dtype[int32]
8np.dtype[float64]
9np.dtype[int32]
np.float64(10.0)
np.dtype[int32]
8np.dtype[float64]
9np.dtype[int32]
np.float64(10.5)
np.dtype[int32]
8np.dtype[float64]
9np.dtype[int32]
np.float32(10.0)
np.dtype[int32]
8np.dtype[float64]
9np.dtype[int32]
np.float32(10.5)
np.dtype[int32]
8np.dtype[float64]
9dtype width == max integer width
Initial values
[2 ** 63 - 10, 2 ** 63 - 100, 3]
.np.dtype[int64]
10
np.dtype[int64]
8np.dtype[int64]
9np.dtype[int64]
np.int64(10)
np.dtype[int64]
8np.dtype[int64]
9np.dtype[int64]
1099511627776
np.dtype[int64]
8np.dtype[int64]
9np.dtype[int64]
np.int64(1099511627776)
np.dtype[int64]
8np.dtype[int64]
9np.dtype[int64]
1208925819614629174706176
np.dtype[int64]
10.5
np.dtype[int64]
8np.dtype[float64]
9np.dtype[int64]
np.float64(10.0)
np.dtype[int64]
8np.dtype[float64]
9np.dtype[int64]
np.float64(10.5)
np.dtype[int64]
8np.dtype[float64]
9np.dtype[int64]
np.float32(10.0)
np.dtype[int64]
8np.dtype[float64]
9np.dtype[int64]
np.float32(10.5)
np.dtype[int64]
8np.dtype[float64]
9Float column dtypes
dtype width < max float width
Initial values
[np.finfo(np.float32).max, np.float32(np.inf), 3]
np.dtype[float32]
10
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.int64(10)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
1099511627776
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.int64(1099511627776)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.int32(2147483548)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.int64(9223372036854775708)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
1208925819614629174706076
np.dtype[float32]
10.5
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.float64(10.0)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.float64(10.5)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.float64(3.4028234663852886e+39)
np.dtype[float32]
8np.dtype[float64]
9np.dtype[float32]
np.float32(10.0)
np.dtype[float32]
8np.dtype[float32]
9np.dtype[float32]
np.float32(10.5)
np.dtype[float32]
8np.dtype[float32]
9dtype width == max float width
Initial values
[np.finfo(np.float64).max, np.float64(np.inf), 3]
np.dtype[float64]
10
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.int64(10)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
1099511627776
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.int64(1099511627776)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.int32(2147483548)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.int64(9223372036854775708)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
1208925819614629174706076
np.dtype[float64]
10.5
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.float64(10.0)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.float64(10.5)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.float64(3.4028234663852886e+39)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.float32(10.0)
np.dtype[float64]
np.dtype[float64]
np.dtype[float64]
np.float32(10.5)
np.dtype[float64]
np.dtype[float64]
Everything else
This is where it starts to get really messy. This section is a work
in progress. We should decide what we want the semantics to be,
because in most cases pandas doesn't have the same dtypes that CUDF does.
Inserting strings into numerical columns
This "works", for some value of "works" on #11904 if the string value
is parseable as the target dtype.
So
And similarly for float strings and float dtypes.
This is probably a nice feature.
Inserting things into string columns
Works if the the "thing" is convertible to a string (so numbers work),
but Scalars with list or struct dtypes don't work.
I would argue that explicit casting from the user here is probably
better.
List columns
The new value must have an identical dtype to that of the target column.
Struct columns
The new value must have leaf dtypes that are considered compatible in
some sense, but then the leaves are downcast to the leaf dtypes of the
target column. So this is lossy and likely a bug:
What I think we want (for composite columns)
For composite columns, if the dtype shapes match, I think the casting
rule should be to traverse to the leaf dtypes and promote using the
rules for non-composite columns. If shapes don't match,
__setitem__
should not be allowed.
This, to me, exhibits principle of least surprise.
Footnotes
value is exact in the initial dtype ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31
next largest numpy type that contains the value ↩ ↩2 ↩3 ↩4
not representable in a numpy type, so coercion to object column ↩ ↩2 ↩3 ↩4
default float type is float64 ↩
np.int32
is losslessly convertible tonp.float64
↩np.int64
is not losslessly convertiblenp.float64
↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13value is not losslessly representable, but also, expecting
np.float64
! ↩Bug fixed by Fix type casting in Series.__setitem__ #11904 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31
CUDF doesn't inspect values, so type-based promotion (difference
from pandas) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30
As for 6, but promotion from
np.int32
tonp.float32
isalso not lossless. ↩ ↩2
The text was updated successfully, but these errors were encountered: