Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH/QST]: Behaviour of type promotion in __setitem__ #12039

Open
6 tasks done
Tracked by #12793
wence- opened this issue Nov 1, 2022 · 13 comments
Open
6 tasks done
Tracked by #12793

[ENH/QST]: Behaviour of type promotion in __setitem__ #12039

wence- opened this issue Nov 1, 2022 · 13 comments
Labels
2 - In Progress Currently a work in progress feature request New feature or request improvement Improvement / enhancement to an existing function proposal Change current process or code Python Affects Python cuDF API. question Further information is requested

Comments

@wence-
Copy link
Contributor

wence- commented Nov 1, 2022

Summary

CUDF is not consistent with Pandas (under a bunch of circumstances) in
its behaviour when upcasting during __setitem__. In some cases, we
might want to mimic pandas behaviour (though they are very keen to use
value-based type promotion). In others, where we have more structured
dtypes than pandas, we need to decide what to do (current behaviour is
internally inconsistent and buggy in a bunch of cases).

I summarise what I think the current state is (by way of experiment),
and then discuss some options. Opinions welcome!

cc: @vyasr, @mroeschke, @shwina

Pandas behaviour

Pandas version 1.5.1, MacOS (Apple Silicon)

Edit: updated code for generating more tables.

I should note that these tables are for single index __setitem__ (s.iloc[i] = value). I should check if the same behaviour also occurs for:

  • slice-based __setitem__ with single value s.iloc[:1] = [value]
  • slice-based __setitem__ with list of values s.iloc[:2] = [value for _ in range(2)]
  • mask-based __setitem__ with singleton value s.iloc[[True, False]] = [value]
  • mask-based __setitem__ with multiple values s.iloc[[True, False, True]] = [value, value]
  • index-based __setitem__ with single value s.iloc[[1]] = value
  • index-based __setitem__ with multiple values s.iloc[[1, 2]] = [value, value]
Code to generate tables
from __future__ import annotations

import os
from enum import Enum, IntEnum, auto
from itertools import filterfalse, repeat
from operator import not_
from pathlib import Path

import numpy as np
import pandas as pd
import typer

try:
    import cudf
    import cupy

    class Backend(str, Enum):
        PANDAS = "pandas"
        CUDF = "cudf"

except ImportError:

    class Backend(str, Enum):
        PANDAS = "pandas"


def numeric_series(values, dtype, *, pandas):
    if pandas:
        return pd.Series(values, dtype=dtype)
    else:
        return cudf.Series(values, dtype=dtype)


def format_val(v):
    try:
        dt = v.dtype
        return f"np.{dt.type.__name__}({v})"
    except AttributeError:
        return f"{v}"


class IndexType(IntEnum):
    SINGLE_INT = auto()
    SINGLETON_SLICE = auto()
    CONTIG_SLICE = auto()
    STRIDED_SLICE = auto()
    SINGLETON_MASK = auto()
    GENERAL_MASK = auto()
    SINGLETON_SCATTER = auto()
    GENERAL_SCATTER = auto()


def indexing(index_type: IndexType, n: int) -> tuple[int | slice | list, slice | list]:
    assert n >= 3
    if index_type == IndexType.SINGLE_INT:
        return n - 1, slice(0, n - 1, None)
    elif index_type == IndexType.SINGLETON_SLICE:
        return slice(1, 2, 1), [0, *range(2, n)]
    elif index_type == IndexType.CONTIG_SLICE:
        return slice(1, n - 2, 1), [0, *range(n - 2, n)]
    elif index_type == IndexType.STRIDED_SLICE:
        return slice(0, n, 2), slice(1, n, 2)
    elif index_type == IndexType.SINGLETON_MASK:
        yes = [False, True, *repeat(False, n - 2)]
        no = list(map(not_, yes))
        return yes, no
    elif index_type == IndexType.GENERAL_MASK:
        yes = [True, False, True, *repeat(False, n - 3)]
        no = list(map(not_, yes))
        return yes, no
    elif index_type == IndexType.SINGLETON_SCATTER:
        yes = [1]
        # Oh for Haskell-esque sections
        no = list(filterfalse(yes.__contains__, range(n)))
        return yes, no
    elif index_type == IndexType.GENERAL_SCATTER:
        yes = [0, 2]
        no = list(filterfalse(yes.__contains__, range(n)))
        return yes, no
    else:
        raise ValueError("Unhandled case")


def generate_table(f, initial_values, values_to_try, dtype, *, index_type, pandas):
    initial_values = np.asarray(initial_values, dtype=object)
    f.write("| Initial dtype | New value | Final dtype | Lossy? |\n")
    f.write("|---------------|-----------|-------------|--------|\n")

    yes, no = indexing(index_type, len(initial_values))
    for value in values_to_try:
        s = numeric_series(initial_values, dtype=dtype, pandas=pandas)
        otype = f"np.{type(s.dtype).__name__}"
        try:
            if index_type == IndexType.SINGLETON_SLICE:
                value = cupy.asarray([value])
            s.iloc[yes] = value
        except BaseException as e:
            f.write(f"| `{otype}` | `{format_val(value)}` | N/A | {e} |\n")
            continue
        ntype = f"np.{type(s.dtype).__name__}"
        expect = (np.asarray if pandas else cupy.asarray)(
            initial_values[no], dtype=dtype
        )
        original_lost_info = (s.iloc[no].astype(dtype) != expect).any()
        try:
            new_vals = s.iloc[yes].astype(value.dtype)
        except AttributeError:
            if pandas:
                new_vals = np.asarray(s.iloc[yes])
            else:
                new_vals = cupy.asarray(s.iloc[yes])
        new_lost_info = (new_vals != value).any()
        lossy = "Yes" if original_lost_info or new_lost_info else "No"
        f.write(f"| `{otype}` | `{format_val(value)}` | `{ntype}` | {lossy} |\n")


def generate_tables(output_directory: Path, backend: Backend, index_type: IndexType):
    integer_column_values_to_try = [
        10,
        np.int64(10),
        2**40,
        np.int64(2**40),
        2**80,
        10.5,
        np.float64(10),
        np.float64(10.5),
        np.float32(10),
        np.float32(10.5),
    ]
    float_column_values_to_try = [
        10,
        np.int64(10),
        2**40,
        np.int64(2**40),
        np.int32(2**31 - 100),
        np.int64(2**63 - 100),
        2**80 - 100,
        10.5,
        np.float64(10),
        np.float64(10.5),
        np.float64(np.finfo(np.float32).max.astype(np.float64) * 10),
        np.float32(10),
        np.float32(10.5),
    ]

    pandas = backend == Backend.PANDAS
    filename = f"{backend}-setitem-{index_type.name}.md"
    with open(output_directory / filename, "w") as f:
        if pandas:
            f.write(f"Pandas {pd.__version__} behaviour for {index_type!r}\n\n")
        else:
            f.write(f"CUDF {cudf.__version__} behaviour for {index_type!r}\n\n")

        generate_table(
            f,
            [2**31 - 10, 2**31 - 100, 3, 4, 5],
            integer_column_values_to_try,
            np.int32,
            index_type=index_type,
            pandas=pandas,
        )
        f.write("\n")
        generate_table(
            f,
            [2**63 - 10, 2**63 - 100, 3, 4, 5],
            integer_column_values_to_try,
            np.int64,
            index_type=index_type,
            pandas=pandas,
        )
        f.write("\n")
        generate_table(
            f,
            [np.finfo(np.float32).max, np.float32(np.inf), 3, 4, 5],
            float_column_values_to_try,
            np.float32,
            index_type=index_type,
            pandas=pandas,
        )
        f.write("\n")
        generate_table(
            f,
            [np.finfo(np.float64).max, np.float64(np.inf), 3, 4, 5],
            float_column_values_to_try,
            np.float64,
            index_type=index_type,
            pandas=pandas,
        )


def main(
    output_directory: Path = typer.Argument(Path("."), help="Output directory for results"),
    backend: Backend = typer.Option("pandas", help="Dataframe backend to test"),
):
    os.makedirs(output_directory, exist_ok=True)
    for index_type in IndexType.__members__.values():
        generate_tables(output_directory, backend, index_type)


if __name__ == "__main__":
    typer.run(main)

Numeric columns

Integer column dtypes

dtype width < max integer width

Initial values [2**31 - 10, 2**31 - 100, 3]. np.int32 is
representative of any integer type that is smaller than the max width.

Initial dtype New value Final dtype Lossy?
np.dtype[int32] 10 np.dtype[int32] No1
np.dtype[int32] np.int64(10) np.dtype[int32] No1
np.dtype[int32] 1099511627776 np.dtype[longlong] No2
np.dtype[int32] np.int64(1099511627776) np.dtype[longlong] No2
np.dtype[int32] 1208925819614629174706176 np.dtype[object_] No3
np.dtype[int32] 10.5 np.dtype[float64] No4
np.dtype[int32] np.float64(10.0) np.dtype[int32] No1
np.dtype[int32] np.float64(10.5) np.dtype[float64] No2
np.dtype[int32] np.float32(10.0) np.dtype[int32] No1
np.dtype[int32] np.float32(10.5) np.dtype[float64] No5

dtype width == max integer width

Initial values [2 ** 63 - 10, 2 ** 63 - 100, 3]. These provoke edge
cases in upcasting because:

import numpy as np
np.find_common_type([], [np.int64, np.float64])
# => np.float64 Noooooo! Hates it
# Yes, I know this is the same as the integer to float promotion in
# C/C++, I'm allowed to hate that too.
Initial dtype New value Final dtype Lossy?
np.dtype[int64] 10 np.dtype[int64] No1
np.dtype[int64] np.int64(10) np.dtype[int64] No1
np.dtype[int64] 1099511627776 np.dtype[int64] No1
np.dtype[int64] np.int64(1099511627776) np.dtype[int64] No1
np.dtype[int64] 1208925819614629174706176 np.dtype[object_] No3
np.dtype[int64] 10.5 np.dtype[float64] Yes6
np.dtype[int64] np.float64(10.0) np.dtype[int64] No1
np.dtype[int64] np.float64(10.5) np.dtype[float64] Yes6
np.dtype[int64] np.float32(10.0) np.dtype[int64] No1
np.dtype[int64] np.float32(10.5) np.dtype[float64] Yes6

Float column dtypes

dtype width < max float width

Initial values [np.finfo(np.float32).max, np.float32(np.inf), 3]

Initial dtype New value Final dtype Lossy?
np.dtype[float32] 10 np.dtype[float32] No1
np.dtype[float32] np.int64(10) np.dtype[float32] No1
np.dtype[float32] 1099511627776 np.dtype[float32] No1
np.dtype[float32] np.int64(1099511627776) np.dtype[float32] No1
np.dtype[float32] np.int32(2147483548) np.dtype[float64] No1
np.dtype[float32] np.int64(9223372036854775708) np.dtype[float32] Yes7
np.dtype[float32] 1208925819614629174706076 np.dtype[object_] No3
np.dtype[float32] 10.5 np.dtype[float32] No1
np.dtype[float32] np.float64(10.0) np.dtype[float32] No1
np.dtype[float32] np.float64(10.5) np.dtype[float32] No1
np.dtype[float32] np.float64(3.4028234663852886e+39) np.dtype[float64] No2
np.dtype[float32] np.float32(10.0) np.dtype[float32] No1
np.dtype[float32] np.float32(10.5) np.dtype[float32] No1

dtype width == max float width

Initial values [np.finfo(np.float64).max, np.float64(np.inf), 3]

Initial dtype New value Final dtype Lossy?
np.dtype[float64] 10 np.dtype[float64] No1
np.dtype[float64] np.int64(10) np.dtype[float64] No1
np.dtype[float64] 1099511627776 np.dtype[float64] No1
np.dtype[float64] np.int64(1099511627776) np.dtype[float64] No1
np.dtype[float64] np.int32(2147483548) np.dtype[float64] No1
np.dtype[float64] np.int64(9223372036854775708) np.dtype[float64] Yes6
np.dtype[float64] 1208925819614629174706076 np.dtype[object_] No3
np.dtype[float64] 10.5 np.dtype[float64] No1
np.dtype[float64] np.float64(10.0) np.dtype[float64] No1
np.dtype[float64] np.float64(10.5) np.dtype[float64] No1
np.dtype[float64] np.float64(3.4028234663852886e+39) np.dtype[float64] No1
np.dtype[float64] np.float32(10.0) np.dtype[float64] No1
np.dtype[float64] np.float32(10.5) np.dtype[float64] No1

Everything else

Basically, you can put anything in a column and you get an object out,
but numpy types are converted to object first.

CUDF behaviour

CUDF trunk, and state in #11904.

Numeric columns

Integer column dtypes

dtype width < max integer width

Initial values [2**31 - 10, 2**31 - 100, 3]. np.int32 is
representative of any integer type that is smaller than the max width.

Initial dtype New value Final dtype (trunk) Final dtype (#11904) Lossy? (trunk) Lossy? (#11904)
np.dtype[int32] 10 np.dtype[int32]8 np.dtype[int64]9 No No
np.dtype[int32] np.int64(10) np.dtype[int32]8 np.dtype[int64]9 No No
np.dtype[int32] 1099511627776 np.dtype[int32]8 np.dtype[int64]9 Yes No
np.dtype[int32] np.int64(1099511627776) np.dtype[int32]8 np.dtype[int64]9 Yes No
np.dtype[int32] 1208925819614629174706176 OverflowError OverflowError N/A N/A
np.dtype[int32] 10.5 np.dtype[int32]8 np.dtype[float64]9 Yes No
np.dtype[int32] np.float64(10.0) np.dtype[int32]8 np.dtype[float64]9 No No
np.dtype[int32] np.float64(10.5) np.dtype[int32]8 np.dtype[float64]9 Yes No
np.dtype[int32] np.float32(10.0) np.dtype[int32]8 np.dtype[float64]9 No No
np.dtype[int32] np.float32(10.5) np.dtype[int32]8 np.dtype[float64]9 Yes No

dtype width == max integer width

Initial values [2 ** 63 - 10, 2 ** 63 - 100, 3].

Initial dtype New value Final dtype (trunk) Final dtype (#11904) Lossy? (trunk) Lossy? (#11904)
np.dtype[int64] 10 np.dtype[int64]8 np.dtype[int64]9 No No
np.dtype[int64] np.int64(10) np.dtype[int64]8 np.dtype[int64]9 No No
np.dtype[int64] 1099511627776 np.dtype[int64]8 np.dtype[int64]9 No No
np.dtype[int64] np.int64(1099511627776) np.dtype[int64]8 np.dtype[int64]9 No No
np.dtype[int64] 1208925819614629174706176 OverflowError OverflowError N/A N/A
np.dtype[int64] 10.5 np.dtype[int64]8 np.dtype[float64]9 Yes Yes6
np.dtype[int64] np.float64(10.0) np.dtype[int64]8 np.dtype[float64]9 No Yes6
np.dtype[int64] np.float64(10.5) np.dtype[int64]8 np.dtype[float64]9 Yes Yes6
np.dtype[int64] np.float32(10.0) np.dtype[int64]8 np.dtype[float64]9 No Yes6
np.dtype[int64] np.float32(10.5) np.dtype[int64]8 np.dtype[float64]9 Yes Yes6

Float column dtypes

dtype width < max float width

Initial values [np.finfo(np.float32).max, np.float32(np.inf), 3]

Initial dtype New value Final dtype (trunk) Final dtype (#11904) Lossy? (trunk) Lossy? (#11904)
np.dtype[float32] 10 np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] np.int64(10) np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] 1099511627776 np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] np.int64(1099511627776) np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] np.int32(2147483548) np.dtype[float32]8 np.dtype[float64]9 Yes10 No
np.dtype[float32] np.int64(9223372036854775708) np.dtype[float32]8 np.dtype[float64]9 Yes10 Yes6
np.dtype[float32] 1208925819614629174706076 OverflowError OverflowError N/A N/A
np.dtype[float32] 10.5 np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] np.float64(10.0) np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] np.float64(10.5) np.dtype[float32]8 np.dtype[float64]9 No No
np.dtype[float32] np.float64(3.4028234663852886e+39) np.dtype[float32]8 np.dtype[float64]9 Yes8 No
np.dtype[float32] np.float32(10.0) np.dtype[float32]8 np.dtype[float32]9 No No
np.dtype[float32] np.float32(10.5) np.dtype[float32]8 np.dtype[float32]9 No No

dtype width == max float width

Initial values [np.finfo(np.float64).max, np.float64(np.inf), 3]

Initial dtype New value Final dtype (trunk) Final dtype (#11904) Lossy? (trunk) Lossy? (#11904)
np.dtype[float64] 10 np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.int64(10) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] 1099511627776 np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.int64(1099511627776) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.int32(2147483548) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.int64(9223372036854775708) np.dtype[float64] np.dtype[float64] Yes6 Yes6
np.dtype[float64] 1208925819614629174706076 OverflowError OverflowError N/A N/A
np.dtype[float64] 10.5 np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.float64(10.0) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.float64(10.5) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.float64(3.4028234663852886e+39) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.float32(10.0) np.dtype[float64] np.dtype[float64] No No
np.dtype[float64] np.float32(10.5) np.dtype[float64] np.dtype[float64] No No

Everything else

This is where it starts to get really messy. This section is a work
in progress. We should decide what we want the semantics to be,
because in most cases pandas doesn't have the same dtypes that CUDF does.

Inserting strings into numerical columns

This "works", for some value of "works" on #11904 if the string value
is parseable as the target dtype.

So

s = cudf.Series([1, 2, 3], dtype=int)
s.iloc[2] = "4" # works
s.iloc[2] = "0xf" # => ValueError: invalid literal for int() with base 10: '0xf'

And similarly for float strings and float dtypes.

This is probably a nice feature.

Inserting things into string columns

Works if the the "thing" is convertible to a string (so numbers work),
but Scalars with list or struct dtypes don't work.

I would argue that explicit casting from the user here is probably
better.

List columns

The new value must have an identical dtype to that of the target column.

Struct columns

The new value must have leaf dtypes that are considered compatible in
some sense, but then the leaves are downcast to the leaf dtypes of the
target column. So this is lossy and likely a bug:

 sr = cudf.Series([{"a": 1, "b": 2}])
 sr.iloc[0] = {"a": 10.5, "b": 2}
 sr[0] # => {"a": 10, "b": 2} (lost data in "a")

What I think we want (for composite columns)

For composite columns, if the dtype shapes match, I think the casting
rule should be to traverse to the leaf dtypes and promote using the
rules for non-composite columns. If shapes don't match, __setitem__
should not be allowed.

This, to me, exhibits principle of least surprise.

Footnotes

  1. value is exact in the initial dtype 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

  2. next largest numpy type that contains the value 2 3 4

  3. not representable in a numpy type, so coercion to object column 2 3 4

  4. default float type is float64

  5. np.int32 is losslessly convertible to np.float64

  6. np.int64 is not losslessly convertible np.float64 2 3 4 5 6 7 8 9 10 11 12 13

  7. value is not losslessly representable, but also, expecting
    np.float64!

  8. Bug fixed by Fix type casting in Series.__setitem__ #11904 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

  9. CUDF doesn't inspect values, so type-based promotion (difference
    from pandas) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

  10. As for 6, but promotion from np.int32 to np.float32 is
    also not lossless. 2

@wence- wence- added feature request New feature or request question Further information is requested proposal Change current process or code Needs Triage Need team to review and classify pandas improvement Improvement / enhancement to an existing function labels Nov 1, 2022
@shwina
Copy link
Contributor

shwina commented Nov 2, 2022

What a fantastic analysis.

@mroeschke
Copy link
Contributor

Just a comment about the Lossy classification:

One of the factors of Lossy="Yes" is if:

expect = (np.asarray if pandas else cupy.asarray)(
    initial_values[:new_location], dtype=dtype
)
original_lost_info = (s.iloc[:new_location].astype(dtype) != expect).any()

And in this example from your table had Lossy="Yes":

In [26]: ser = pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)

In [27]: ser
Out[27]:
0    9223372036854775798
1    9223372036854775708
2                      3
dtype: int64

In [29]: ser.iloc[2] = 10.5

In [32]: ser
Out[32]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

I think this is "lossy" because of an int64 -> float64 -> int64 conversion, not necessarily because there's a lossiness in value

In [47]: ser.iloc[0] == 2 ** 63 - 10
Out[47]: True

In [48]: ser.iloc[1] == 2 ** 63 - 100
Out[48]: True

In [49]: ser.iloc[2] == 10.5
Out[49]: True

In [50]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)
Out[50]:
0    9223372036854775798
1    9223372036854775708
2                      3
dtype: int64

In [51]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64).astype(np.float64).astype(np.int64)
Out[51]:
0   -9223372036854775808
1   -9223372036854775808
2                      3
dtype: int64

In [52]: np.array([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64).astype(np.float64).astype(np.int64)
Out[52]: array([-9223372036854775808, -9223372036854775808,                    3])

@vyasr
Copy link
Contributor

vyasr commented Nov 3, 2022

What a fantastic analysis.

This is easily one of the most detailed issues I have ever seen!

@wence-
Copy link
Contributor Author

wence- commented Nov 4, 2022

Just a comment about the Lossy classification:

One of the factors of Lossy="Yes" is if:

expect = (np.asarray if pandas else cupy.asarray)(
    initial_values[:new_location], dtype=dtype
)
original_lost_info = (s.iloc[:new_location].astype(dtype) != expect).any()

And in this example from your table had Lossy="Yes":

In [26]: ser = pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)

In [27]: ser
Out[27]:
0    9223372036854775798
1    9223372036854775708
2                      3
dtype: int64

In [29]: ser.iloc[2] = 10.5

In [32]: ser
Out[32]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

I think this is "lossy" because of an int64 -> float64 -> int64 conversion, not necessarily because there's a lossiness in value

Depends what you mean by lossy. float64 only has enough precision to exactly represent int53 (because the mantissa has 53 bits). Hence, mapping between int64 and float64 is neither surjective (you can't reach nan and inf) nor injective (multiple int64s can map to the same float64), so you can't roundtrip successfully.

In [36]: np.int64(2**63 - 10) == np.int64(2**63 - 100)
Out[36]: False

In [37]: np.int64(2**63 - 10).astype(np.float64) == np.int64(2**63 - 100).astype(np.float64)
Out[37]: True
In [47]: ser.iloc[0] == 2 ** 63 - 10
Out[47]: True

In [48]: ser.iloc[1] == 2 ** 63 - 100
Out[48]: True

In [49]: ser.iloc[2] == 10.5
Out[49]: True

These are all true because the right operand of equality is promoted to float64 before the comparison is performed.

In [50]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)
Out[50]:
0 9223372036854775798
1 9223372036854775708
2 3
dtype: int64

In [51]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64).astype(np.float64).astype(np.int64)
Out[51]:
0 -9223372036854775808
1 -9223372036854775808

I characterise this as "lossy" because you didn't round-trip successfully. As noted, it isn't really a pandas issue per-se, but rather numpy (and thence promotion via C rules).

I note that numpy doesn't quite implement C/C++ promotion rules for integer and floating point types. Since it widens floating point types in addition between ints and floats.

C++:

typeof(1 + 1.0f) => float32
typeof(1L + 1.0f) => float32
typeof(1 + 1.0d) => float64
typeof(1L + 1.0d) => float64

These promotions are always in-range, since max(float32) > max(int64) but lossy since the mapping from integers to floating point types of the same width is not bijective.

numpy:

(np.int32(1) + np.float32(1)).dtype => float64
(np.int64(1) + np.float32(1)).dtype => float64
(np.int32(1) + np.float64(1)).dtype => float64
(np.int64(1) + np.float64(1)).dtype => float64

These promotions are bijective (if we ignore nan and inf) if the integer type is narrower than the floating point type. (So the int32 promotion can be undone losslessly).

These kind of automatic promotions are basically a fight I lost before I was born in old programming languages, though some more modern ones force you to always explicitly cast so that you're aware that you might be doing A Bad Thing.

@wence-
Copy link
Contributor Author

wence- commented Nov 4, 2022

Updated the code to generate tables for a bunch of different scenarios. You can now run and it produces a separate file for each scenario, which you can diff against each other to see the points of difference.

Mostly things work (with #11904), except #12072 means that equality is busted.

And, slice-based setitem is broken for length-one ranges if setting with an array rather than a scalar (that's #12073).

@seberg
Copy link
Contributor

seberg commented Nov 4, 2022

Sorry for the long post, I hope it doesn't derail or is off topic. I will mainly try to explain the state of NumPy and what NEP 50 does. Happy to re-read again without that in mind!
(I will always make time to talk about these things, because just talking about it, even tangentially, helps me clear up things and move forward.)

N.B. please avoid np.find_common_type it needs to be deprecated. np.result_type (for array inputs or if you need the value-based or future NEP 50 behavior) and np.promote_types are the weapons of choice.

First, there are three things that come together to understand the NumPy behavior:

  1. Integers are converted to long -> long long -> unsigned long long -> object by NumPy based on value. This is what happens if you do np.asarray(integer) and in many other contexts.
  2. Storing values may or may not go via np.asarray(), if it does not, NumPy will be more strict sometimes.
    In fact, I just deprecated storing out-of-bounds integers like to disallow np.asarray([-1], dtype=np.uint8). This will be part of the next NumPy release, unless someone complains (which could happen).
  3. Promotion/Common DType: i.e. what happens if you mix two dtypes
    • NumPy promotion rules are wonky, they try to be "safe", but make an exception by promoting in64 + float64 -> float64 (and worse int64 + uint64 -> float64). The reason for this is that they fall out of the "safe casting" concept. I disagree with this nowadays, while related, I feel it is better to see promotion/common-dtype and safe/non-lossy casting as distinct. For numerical dtypes, I do agree that round-tripping correctly implies that promotion is acceptable. (NumPy is implemented that way now after NEP 42.)
      The C promotion rules, etc. have the advantage of being e.g. strictly associative (EDIT) while ours are not: (int16 + uint16) + float32 == int32 + float32 == float64 != float32 == (float32 + int16) + uint16 (I/NumPy has a heuristic so that np.result_type finds float32 if it is a single operation, it works always, but you can construct failures if you add new dtypes in the mix.)
    • Current promotion inspects values (of "scalars" including 0-D arrays, but not when all are scalars). This allows an integer to match any dtype that will fit its value, and floats also to be "down-promoted" so long they are not approximately out of range.
      NEP 50 is the proposal to remove this. Instead, NEP 50 proposes to insert the Python integer, float, complex as undefined/lowest precision. For integers this means that np.uint8(1) + 300 will error, because we try to do uint8 + uint8 -> uint8 and that doesn't work.
      NEP 50 tries to describe this, the JAX diagram also helps. This makes the Python scalars "weak".

There is discussion about lossy conversion of Python scalars. Which touches something that I am currently struggling with to push NEP 50. So bringing it up, because if you clearly want the notion of "lossy", it might inform how I do this for NEP 50 (it is not central to NEP 50, and doesn't have to be; but I need to settle on something).

For NumPy dtypes we have that defined through the "casting safety" (which can be "no", "equiv", "safe", "same kind", "unsafe". For values we don't really define it, and I don't think the categories quite work.
int64 -> int8 is same-kind, but 10000 -> int8 seems "unsafe", or is it "safe" but fails when you try it? Whether or not I should try to introduce the notion "safety" to scalar assignments in NumPy is one of the things I am struggling with, because the notion of cast-safety doesn't seem to cleanly apply (but NumPy exposes it a lot).

@seberg
Copy link
Contributor

seberg commented Nov 4, 2022

NEP 50 itself does not touch the annoying fact that np.array([2**63, 2**61]) returns a float array. But, since NEP 50 is likely a NumPy 2.0, I think it does make sense to stop doing it and only ever go to the default integer and otherwise raise an error.

@mroeschke
Copy link
Contributor

I think generally from the pandas point of view (with respect to __setitem__), I would say the general philosophy is "the resulting dtype should be able to store the existing values & new value(s) without losing numeric precision".

Here are some examples (related to your original table) where we differ from numpy due to the above:

1 . "Don't truncate so upcast dtype" example

In [6]: ser = pd.Series([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [7]: ser.iloc[2] = 10.5

In [8]: ser
Out[8]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

In [9]: np_arr = np.array([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [10]: np_arr[2] = 10.5

In [11]: np_arr
Out[11]: array([9223372036854775798, 9223372036854775708,                  10])

In [14]: print(np.__version__) ; print(pd.__version__)
1.23.1
2.0.0.dev0+577.g7e9ca6e8af
  1. "Bend over backwards to accommodate the values" examples
In [5]: ser = pd.Series([1, 2, 3], dtype=np.int64)

In [6]: ser.iloc[2] = 2**80

In [7]: ser
Out[7]:
0                            1
1                            2
2    1208925819614629174706176
dtype: object

In [8]: arr = np.array([1, 2, 3], dtype=np.int64)

In [9]: arr
Out[9]: array([1, 2, 3])

In [10]: arr[2] = 1208925819614629174706176
a---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 arr[2] = 1208925819614629174706176

OverflowError: Python int too large to convert to C long

@wence-
Copy link
Contributor Author

wence- commented Nov 7, 2022

2. "Bend over backwards to accommodate the values" examples

Unless we want to implement bignums in libcudf, I think this is not something we can really contemplate.

@wence-
Copy link
Contributor Author

wence- commented Nov 7, 2022

I think generally from the pandas point of view (with respect to __setitem__), I would say the general philosophy is "the resulting dtype should be able to store the existing values & new value(s) without losing numeric precision".

This is a reasonable philosophy, though it does rather depend on what you mean by "losing numeric precision".

1 . "Don't truncate so upcast dtype" example

In [6]: ser = pd.Series([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [7]: ser.iloc[2] = 10.5

In [8]: ser
Out[8]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

This seems like a nice thing to do, except that it changes behaviour of the series:

In [22]: ser = pd.Series([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [23]: ser == ser.iloc[0]
Out[23]: 
0     True
1    False
2    False
dtype: bool

In [24]: ser.iloc[2] = 10.5

In [25]: ser == ser.iloc[0]
Out[25]: 
0     True
1     True
2    False
dtype: bool

I would argue that this choice did lose numeric precision, and particularly in existing values that weren't touched by the indexing.

@seberg
Copy link
Contributor

seberg commented Nov 7, 2022

So pandas (and cudf) use common DType to promote a column when necessary. That leads to two things:

  • The special rule for going to float64 does lose precision for int64/uint64.
  • Pandas will even allow going to object dtype.

The first issue is independent of NEP 50 and I am not sure there is much to do about it. You could consider disallowing it. It probably means making a clearer distinction between "common DType" (for storing into a single column) and "promotion" (for operations).
That is, the promotion in float_arr + int64_arr -> float_arr (i.e. adding two arrays) seems fine, but maybe storing the two into one container is not?
I make the distinction in NumPy, but not convenient to apply it broadly. My gut feeling is that it is probably best to live with the way it is: Yes, it loses precision, but it is usually OK.

For the second thing, NEP 50 might nudge towards making it an error in pandas: If you apply NEP 50 promotion rules, the column should keep whatever int dtype it has, and if the Python integer cannot be assigned it should just error. Auto-promoting to object dtype would then never work (which is also what will happen for np.int64(1) + 2**500).

@mroeschke
Copy link
Contributor

I would argue that this choice did lose numeric precision, and particularly in existing values that weren't touched by the indexing.

That's fair. Not 100% sure of the origin of promoting to float64 in pandas for this case, but I would venture to guess the numpy behavior of keeping int64 and truncating the 10.5 to 10 was a more suprising & more significant loss in precision behavior to pandas users.

For the second thing, NEP 50 might nudge towards making it an error in pandas:

This is also fair. Maybe this is a case of pandas being too accomodating, but we can always nudge users to astype(object) in these cases.

@wence-
Copy link
Contributor Author

wence- commented Nov 8, 2022

Not 100% sure of the origin of promoting to float64 in pandas for this case,

If you're going to do integral to floating type promotion and want to maintain accuracy in the largest number of cases (and don't care too much about memory footprint) the promoting to float64 is a reasonable choice.

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 14, 2022
@GregoryKimball GregoryKimball added the Python Affects Python cuDF API. label Nov 21, 2022
@vyasr vyasr removed the pandas label May 16, 2024
@vyasr vyasr added this to cuDF Python Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request improvement Improvement / enhancement to an existing function proposal Change current process or code Python Affects Python cuDF API. question Further information is requested
Projects
Status: Todo
Development

No branches or pull requests

6 participants