[ENH/QST]: Behaviour of type promotion in `setitem` #12039

wence- · 2022-11-01T18:01:16Z

Summary

CUDF is not consistent with Pandas (under a bunch of circumstances) in
its behaviour when upcasting during __setitem__. In some cases, we
might want to mimic pandas behaviour (though they are very keen to use
value-based type promotion). In others, where we have more structured
dtypes than pandas, we need to decide what to do (current behaviour is
internally inconsistent and buggy in a bunch of cases).

I summarise what I think the current state is (by way of experiment),
and then discuss some options. Opinions welcome!

cc: @vyasr, @mroeschke, @shwina

Pandas behaviour

Pandas version 1.5.1, MacOS (Apple Silicon)

Edit: updated code for generating more tables.

I should note that these tables are for single index __setitem__ (s.iloc[i] = value). I should check if the same behaviour also occurs for:

slice-based __setitem__ with single value s.iloc[:1] = [value]
slice-based __setitem__ with list of values s.iloc[:2] = [value for _ in range(2)]
mask-based __setitem__ with singleton value s.iloc[[True, False]] = [value]
mask-based __setitem__ with multiple values s.iloc[[True, False, True]] = [value, value]
index-based __setitem__ with single value s.iloc[[1]] = value
index-based __setitem__ with multiple values s.iloc[[1, 2]] = [value, value]

Code to generate tables

from __future__ import annotations

import os
from enum import Enum, IntEnum, auto
from itertools import filterfalse, repeat
from operator import not_
from pathlib import Path

import numpy as np
import pandas as pd
import typer

try:
    import cudf
    import cupy

    class Backend(str, Enum):
        PANDAS = "pandas"
        CUDF = "cudf"

except ImportError:

    class Backend(str, Enum):
        PANDAS = "pandas"


def numeric_series(values, dtype, *, pandas):
    if pandas:
        return pd.Series(values, dtype=dtype)
    else:
        return cudf.Series(values, dtype=dtype)


def format_val(v):
    try:
        dt = v.dtype
        return f"np.{dt.type.__name__}({v})"
    except AttributeError:
        return f"{v}"


class IndexType(IntEnum):
    SINGLE_INT = auto()
    SINGLETON_SLICE = auto()
    CONTIG_SLICE = auto()
    STRIDED_SLICE = auto()
    SINGLETON_MASK = auto()
    GENERAL_MASK = auto()
    SINGLETON_SCATTER = auto()
    GENERAL_SCATTER = auto()


def indexing(index_type: IndexType, n: int) -> tuple[int | slice | list, slice | list]:
    assert n >= 3
    if index_type == IndexType.SINGLE_INT:
        return n - 1, slice(0, n - 1, None)
    elif index_type == IndexType.SINGLETON_SLICE:
        return slice(1, 2, 1), [0, *range(2, n)]
    elif index_type == IndexType.CONTIG_SLICE:
        return slice(1, n - 2, 1), [0, *range(n - 2, n)]
    elif index_type == IndexType.STRIDED_SLICE:
        return slice(0, n, 2), slice(1, n, 2)
    elif index_type == IndexType.SINGLETON_MASK:
        yes = [False, True, *repeat(False, n - 2)]
        no = list(map(not_, yes))
        return yes, no
    elif index_type == IndexType.GENERAL_MASK:
        yes = [True, False, True, *repeat(False, n - 3)]
        no = list(map(not_, yes))
        return yes, no
    elif index_type == IndexType.SINGLETON_SCATTER:
        yes = [1]
        # Oh for Haskell-esque sections
        no = list(filterfalse(yes.__contains__, range(n)))
        return yes, no
    elif index_type == IndexType.GENERAL_SCATTER:
        yes = [0, 2]
        no = list(filterfalse(yes.__contains__, range(n)))
        return yes, no
    else:
        raise ValueError("Unhandled case")


def generate_table(f, initial_values, values_to_try, dtype, *, index_type, pandas):
    initial_values = np.asarray(initial_values, dtype=object)
    f.write("| Initial dtype | New value | Final dtype | Lossy? |\n")
    f.write("|---------------|-----------|-------------|--------|\n")

    yes, no = indexing(index_type, len(initial_values))
    for value in values_to_try:
        s = numeric_series(initial_values, dtype=dtype, pandas=pandas)
        otype = f"np.{type(s.dtype).__name__}"
        try:
            if index_type == IndexType.SINGLETON_SLICE:
                value = cupy.asarray([value])
            s.iloc[yes] = value
        except BaseException as e:
            f.write(f"| `{otype}` | `{format_val(value)}` | N/A | {e} |\n")
            continue
        ntype = f"np.{type(s.dtype).__name__}"
        expect = (np.asarray if pandas else cupy.asarray)(
            initial_values[no], dtype=dtype
        )
        original_lost_info = (s.iloc[no].astype(dtype) != expect).any()
        try:
            new_vals = s.iloc[yes].astype(value.dtype)
        except AttributeError:
            if pandas:
                new_vals = np.asarray(s.iloc[yes])
            else:
                new_vals = cupy.asarray(s.iloc[yes])
        new_lost_info = (new_vals != value).any()
        lossy = "Yes" if original_lost_info or new_lost_info else "No"
        f.write(f"| `{otype}` | `{format_val(value)}` | `{ntype}` | {lossy} |\n")


def generate_tables(output_directory: Path, backend: Backend, index_type: IndexType):
    integer_column_values_to_try = [
        10,
        np.int64(10),
        2**40,
        np.int64(2**40),
        2**80,
        10.5,
        np.float64(10),
        np.float64(10.5),
        np.float32(10),
        np.float32(10.5),
    ]
    float_column_values_to_try = [
        10,
        np.int64(10),
        2**40,
        np.int64(2**40),
        np.int32(2**31 - 100),
        np.int64(2**63 - 100),
        2**80 - 100,
        10.5,
        np.float64(10),
        np.float64(10.5),
        np.float64(np.finfo(np.float32).max.astype(np.float64) * 10),
        np.float32(10),
        np.float32(10.5),
    ]

    pandas = backend == Backend.PANDAS
    filename = f"{backend}-setitem-{index_type.name}.md"
    with open(output_directory / filename, "w") as f:
        if pandas:
            f.write(f"Pandas {pd.__version__} behaviour for {index_type!r}\n\n")
        else:
            f.write(f"CUDF {cudf.__version__} behaviour for {index_type!r}\n\n")

        generate_table(
            f,
            [2**31 - 10, 2**31 - 100, 3, 4, 5],
            integer_column_values_to_try,
            np.int32,
            index_type=index_type,
            pandas=pandas,
        )
        f.write("\n")
        generate_table(
            f,
            [2**63 - 10, 2**63 - 100, 3, 4, 5],
            integer_column_values_to_try,
            np.int64,
            index_type=index_type,
            pandas=pandas,
        )
        f.write("\n")
        generate_table(
            f,
            [np.finfo(np.float32).max, np.float32(np.inf), 3, 4, 5],
            float_column_values_to_try,
            np.float32,
            index_type=index_type,
            pandas=pandas,
        )
        f.write("\n")
        generate_table(
            f,
            [np.finfo(np.float64).max, np.float64(np.inf), 3, 4, 5],
            float_column_values_to_try,
            np.float64,
            index_type=index_type,
            pandas=pandas,
        )


def main(
    output_directory: Path = typer.Argument(Path("."), help="Output directory for results"),
    backend: Backend = typer.Option("pandas", help="Dataframe backend to test"),
):
    os.makedirs(output_directory, exist_ok=True)
    for index_type in IndexType.__members__.values():
        generate_tables(output_directory, backend, index_type)


if __name__ == "__main__":
    typer.run(main)

Numeric columns

Integer column dtypes

dtype width < max integer width

Initial values [2**31 - 10, 2**31 - 100, 3]. np.int32 is
representative of any integer type that is smaller than the max width.

Initial dtype	New value	Final dtype	Lossy?
`np.dtype[int32]`	`10`	`np.dtype[int32]`	No¹
`np.dtype[int32]`	`np.int64(10)`	`np.dtype[int32]`	No¹
`np.dtype[int32]`	`1099511627776`	`np.dtype[longlong]`	No²
`np.dtype[int32]`	`np.int64(1099511627776)`	`np.dtype[longlong]`	No²
`np.dtype[int32]`	`1208925819614629174706176`	`np.dtype[object_]`	No³
`np.dtype[int32]`	`10.5`	`np.dtype[float64]`	No⁴
`np.dtype[int32]`	`np.float64(10.0)`	`np.dtype[int32]`	No¹
`np.dtype[int32]`	`np.float64(10.5)`	`np.dtype[float64]`	No²
`np.dtype[int32]`	`np.float32(10.0)`	`np.dtype[int32]`	No¹
`np.dtype[int32]`	`np.float32(10.5)`	`np.dtype[float64]`	No⁵

dtype width == max integer width

Initial values [2 ** 63 - 10, 2 ** 63 - 100, 3]. These provoke edge
cases in upcasting because:

import numpy as np
np.find_common_type([], [np.int64, np.float64])
# => np.float64 Noooooo! Hates it
# Yes, I know this is the same as the integer to float promotion in
# C/C++, I'm allowed to hate that too.

Initial dtype	New value	Final dtype	Lossy?
`np.dtype[int64]`	`10`	`np.dtype[int64]`	No¹
`np.dtype[int64]`	`np.int64(10)`	`np.dtype[int64]`	No¹
`np.dtype[int64]`	`1099511627776`	`np.dtype[int64]`	No¹
`np.dtype[int64]`	`np.int64(1099511627776)`	`np.dtype[int64]`	No¹
`np.dtype[int64]`	`1208925819614629174706176`	`np.dtype[object_]`	No³
`np.dtype[int64]`	`10.5`	`np.dtype[float64]`	Yes⁶
`np.dtype[int64]`	`np.float64(10.0)`	`np.dtype[int64]`	No¹
`np.dtype[int64]`	`np.float64(10.5)`	`np.dtype[float64]`	Yes⁶
`np.dtype[int64]`	`np.float32(10.0)`	`np.dtype[int64]`	No¹
`np.dtype[int64]`	`np.float32(10.5)`	`np.dtype[float64]`	Yes⁶

Float column dtypes

dtype width < max float width

Initial values [np.finfo(np.float32).max, np.float32(np.inf), 3]

Initial dtype	New value	Final dtype	Lossy?
`np.dtype[float32]`	`10`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.int64(10)`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`1099511627776`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.int64(1099511627776)`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.int32(2147483548)`	`np.dtype[float64]`	No¹
`np.dtype[float32]`	`np.int64(9223372036854775708)`	`np.dtype[float32]`	Yes⁷
`np.dtype[float32]`	`1208925819614629174706076`	`np.dtype[object_]`	No³
`np.dtype[float32]`	`10.5`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.float64(10.0)`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.float64(10.5)`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.float64(3.4028234663852886e+39)`	`np.dtype[float64]`	No²
`np.dtype[float32]`	`np.float32(10.0)`	`np.dtype[float32]`	No¹
`np.dtype[float32]`	`np.float32(10.5)`	`np.dtype[float32]`	No¹

dtype width == max float width

Initial values [np.finfo(np.float64).max, np.float64(np.inf), 3]

Initial dtype	New value	Final dtype	Lossy?
`np.dtype[float64]`	`10`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.int64(10)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`1099511627776`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.int64(1099511627776)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.int32(2147483548)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.int64(9223372036854775708)`	`np.dtype[float64]`	Yes⁶
`np.dtype[float64]`	`1208925819614629174706076`	`np.dtype[object_]`	No³
`np.dtype[float64]`	`10.5`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.float64(10.0)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.float64(10.5)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.float64(3.4028234663852886e+39)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.float32(10.0)`	`np.dtype[float64]`	No¹
`np.dtype[float64]`	`np.float32(10.5)`	`np.dtype[float64]`	No¹

Everything else

Basically, you can put anything in a column and you get an object out,
but numpy types are converted to object first.

CUDF behaviour

CUDF trunk, and state in #11904.

Numeric columns

Integer column dtypes

dtype width < max integer width

Initial values [2**31 - 10, 2**31 - 100, 3]. np.int32 is
representative of any integer type that is smaller than the max width.

Initial dtype	New value	Final dtype (trunk)	Final dtype (#11904)	Lossy? (trunk)	Lossy? (#11904)
`np.dtype[int32]`	`10`	`np.dtype[int32]`⁸	`np.dtype[int64]`⁹	No	No
`np.dtype[int32]`	`np.int64(10)`	`np.dtype[int32]`⁸	`np.dtype[int64]`⁹	No	No
`np.dtype[int32]`	`1099511627776`	`np.dtype[int32]`⁸	`np.dtype[int64]`⁹	Yes	No
`np.dtype[int32]`	`np.int64(1099511627776)`	`np.dtype[int32]`⁸	`np.dtype[int64]`⁹	Yes	No
`np.dtype[int32]`	`1208925819614629174706176`	OverflowError	OverflowError	N/A	N/A
`np.dtype[int32]`	`10.5`	`np.dtype[int32]`⁸	`np.dtype[float64]`⁹	Yes	No
`np.dtype[int32]`	`np.float64(10.0)`	`np.dtype[int32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[int32]`	`np.float64(10.5)`	`np.dtype[int32]`⁸	`np.dtype[float64]`⁹	Yes	No
`np.dtype[int32]`	`np.float32(10.0)`	`np.dtype[int32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[int32]`	`np.float32(10.5)`	`np.dtype[int32]`⁸	`np.dtype[float64]`⁹	Yes	No

dtype width == max integer width

Initial values [2 ** 63 - 10, 2 ** 63 - 100, 3].

Initial dtype	New value	Final dtype (trunk)	Final dtype (#11904)	Lossy? (trunk)	Lossy? (#11904)
`np.dtype[int64]`	`10`	`np.dtype[int64]`⁸	`np.dtype[int64]`⁹	No	No
`np.dtype[int64]`	`np.int64(10)`	`np.dtype[int64]`⁸	`np.dtype[int64]`⁹	No	No
`np.dtype[int64]`	`1099511627776`	`np.dtype[int64]`⁸	`np.dtype[int64]`⁹	No	No
`np.dtype[int64]`	`np.int64(1099511627776)`	`np.dtype[int64]`⁸	`np.dtype[int64]`⁹	No	No
`np.dtype[int64]`	`1208925819614629174706176`	OverflowError	OverflowError	N/A	N/A
`np.dtype[int64]`	`10.5`	`np.dtype[int64]`⁸	`np.dtype[float64]`⁹	Yes	Yes⁶
`np.dtype[int64]`	`np.float64(10.0)`	`np.dtype[int64]`⁸	`np.dtype[float64]`⁹	No	Yes⁶
`np.dtype[int64]`	`np.float64(10.5)`	`np.dtype[int64]`⁸	`np.dtype[float64]`⁹	Yes	Yes⁶
`np.dtype[int64]`	`np.float32(10.0)`	`np.dtype[int64]`⁸	`np.dtype[float64]`⁹	No	Yes⁶
`np.dtype[int64]`	`np.float32(10.5)`	`np.dtype[int64]`⁸	`np.dtype[float64]`⁹	Yes	Yes⁶

Float column dtypes

dtype width < max float width

Initial values [np.finfo(np.float32).max, np.float32(np.inf), 3]

Initial dtype	New value	Final dtype (trunk)	Final dtype (#11904)	Lossy? (trunk)	Lossy? (#11904)
`np.dtype[float32]`	`10`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`np.int64(10)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`1099511627776`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`np.int64(1099511627776)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`np.int32(2147483548)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	Yes¹⁰	No
`np.dtype[float32]`	`np.int64(9223372036854775708)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	Yes¹⁰	Yes⁶
`np.dtype[float32]`	`1208925819614629174706076`	OverflowError	OverflowError	N/A	N/A
`np.dtype[float32]`	`10.5`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`np.float64(10.0)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`np.float64(10.5)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	No	No
`np.dtype[float32]`	`np.float64(3.4028234663852886e+39)`	`np.dtype[float32]`⁸	`np.dtype[float64]`⁹	Yes⁸	No
`np.dtype[float32]`	`np.float32(10.0)`	`np.dtype[float32]`⁸	`np.dtype[float32]`⁹	No	No
`np.dtype[float32]`	`np.float32(10.5)`	`np.dtype[float32]`⁸	`np.dtype[float32]`⁹	No	No

dtype width == max float width

Initial values [np.finfo(np.float64).max, np.float64(np.inf), 3]

Initial dtype	New value	Final dtype (trunk)	Final dtype (#11904)	Lossy? (trunk)	Lossy? (#11904)
`np.dtype[float64]`	`10`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.int64(10)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`1099511627776`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.int64(1099511627776)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.int32(2147483548)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.int64(9223372036854775708)`	`np.dtype[float64]`	`np.dtype[float64]`	Yes⁶	Yes⁶
`np.dtype[float64]`	`1208925819614629174706076`	OverflowError	OverflowError	N/A	N/A
`np.dtype[float64]`	`10.5`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.float64(10.0)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.float64(10.5)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.float64(3.4028234663852886e+39)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.float32(10.0)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No
`np.dtype[float64]`	`np.float32(10.5)`	`np.dtype[float64]`	`np.dtype[float64]`	No	No

Everything else

This is where it starts to get really messy. This section is a work
in progress. We should decide what we want the semantics to be,
because in most cases pandas doesn't have the same dtypes that CUDF does.

Inserting strings into numerical columns

This "works", for some value of "works" on #11904 if the string value
is parseable as the target dtype.

So

s = cudf.Series([1, 2, 3], dtype=int)
s.iloc[2] = "4" # works
s.iloc[2] = "0xf" # => ValueError: invalid literal for int() with base 10: '0xf'

And similarly for float strings and float dtypes.

This is probably a nice feature.

Inserting things into string columns

Works if the the "thing" is convertible to a string (so numbers work),
but Scalars with list or struct dtypes don't work.

I would argue that explicit casting from the user here is probably
better.

List columns

The new value must have an identical dtype to that of the target column.

Struct columns

The new value must have leaf dtypes that are considered compatible in
some sense, but then the leaves are downcast to the leaf dtypes of the
target column. So this is lossy and likely a bug:

 sr = cudf.Series([{"a": 1, "b": 2}])
 sr.iloc[0] = {"a": 10.5, "b": 2}
 sr[0] # => {"a": 10, "b": 2} (lost data in "a")

What I think we want (for composite columns)

For composite columns, if the dtype shapes match, I think the casting
rule should be to traverse to the leaf dtypes and promote using the
rules for non-composite columns. If shapes don't match, __setitem__
should not be allowed.

This, to me, exhibits principle of least surprise.

value is exact in the initial dtype ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³ ↩²⁴ ↩²⁵ ↩²⁶ ↩²⁷ ↩²⁸ ↩²⁹ ↩³⁰ ↩³¹
next largest numpy type that contains the value ↩ ↩² ↩³ ↩⁴
not representable in a numpy type, so coercion to object column ↩ ↩² ↩³ ↩⁴
default float type is float64 ↩
np.int32 is losslessly convertible to np.float64 ↩
np.int64 is not losslessly convertible np.float64 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
value is not losslessly representable, but also, expecting
np.float64! ↩
Bug fixed by Fix type casting in Series.__setitem__ #11904 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³ ↩²⁴ ↩²⁵ ↩²⁶ ↩²⁷ ↩²⁸ ↩²⁹ ↩³⁰ ↩³¹
CUDF doesn't inspect values, so type-based promotion (difference
from pandas) ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³ ↩²⁴ ↩²⁵ ↩²⁶ ↩²⁷ ↩²⁸ ↩²⁹ ↩³⁰
As for ⁶, but promotion from np.int32 to np.float32 is
also not lossless. ↩ ↩²

The text was updated successfully, but these errors were encountered:

shwina · 2022-11-02T17:20:07Z

What a fantastic analysis.

mroeschke · 2022-11-03T23:11:36Z

Just a comment about the Lossy classification:

One of the factors of Lossy="Yes" is if:

expect = (np.asarray if pandas else cupy.asarray)(
    initial_values[:new_location], dtype=dtype
)
original_lost_info = (s.iloc[:new_location].astype(dtype) != expect).any()

And in this example from your table had Lossy="Yes":

In [26]: ser = pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)

In [27]: ser
Out[27]:
0    9223372036854775798
1    9223372036854775708
2                      3
dtype: int64

In [29]: ser.iloc[2] = 10.5

In [32]: ser
Out[32]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

I think this is "lossy" because of an int64 -> float64 -> int64 conversion, not necessarily because there's a lossiness in value

In [47]: ser.iloc[0] == 2 ** 63 - 10
Out[47]: True

In [48]: ser.iloc[1] == 2 ** 63 - 100
Out[48]: True

In [49]: ser.iloc[2] == 10.5
Out[49]: True

In [50]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)
Out[50]:
0    9223372036854775798
1    9223372036854775708
2                      3
dtype: int64

In [51]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64).astype(np.float64).astype(np.int64)
Out[51]:
0   -9223372036854775808
1   -9223372036854775808
2                      3
dtype: int64

In [52]: np.array([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64).astype(np.float64).astype(np.int64)
Out[52]: array([-9223372036854775808, -9223372036854775808,                    3])

vyasr · 2022-11-03T23:39:15Z

What a fantastic analysis.

This is easily one of the most detailed issues I have ever seen!

wence- · 2022-11-04T10:15:37Z

Just a comment about the Lossy classification:

One of the factors of Lossy="Yes" is if:
expect = (np.asarray if pandas else cupy.asarray)(
    initial_values[:new_location], dtype=dtype
)
original_lost_info = (s.iloc[:new_location].astype(dtype) != expect).any()
And in this example from your table had Lossy="Yes":
In [26]: ser = pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)

In [27]: ser
Out[27]:
0    9223372036854775798
1    9223372036854775708
2                      3
dtype: int64

In [29]: ser.iloc[2] = 10.5

In [32]: ser
Out[32]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64
I think this is "lossy" because of an int64 -> float64 -> int64 conversion, not necessarily because there's a lossiness in value

Depends what you mean by lossy. float64 only has enough precision to exactly represent int53 (because the mantissa has 53 bits). Hence, mapping between int64 and float64 is neither surjective (you can't reach nan and inf) nor injective (multiple int64s can map to the same float64), so you can't roundtrip successfully.

In [36]: np.int64(2**63 - 10) == np.int64(2**63 - 100)
Out[36]: False

In [37]: np.int64(2**63 - 10).astype(np.float64) == np.int64(2**63 - 100).astype(np.float64)
Out[37]: True

In [47]: ser.iloc[0] == 2 ** 63 - 10
Out[47]: True

In [48]: ser.iloc[1] == 2 ** 63 - 100
Out[48]: True

In [49]: ser.iloc[2] == 10.5
Out[49]: True

These are all true because the right operand of equality is promoted to float64 before the comparison is performed.

In [50]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64)
Out[50]:
0 9223372036854775798
1 9223372036854775708
2 3
dtype: int64

In [51]: pd.Series([2 ** 63 - 10, 2 ** 63 - 100, 3], dtype=np.int64).astype(np.float64).astype(np.int64)
Out[51]:
0 -9223372036854775808
1 -9223372036854775808

I characterise this as "lossy" because you didn't round-trip successfully. As noted, it isn't really a pandas issue per-se, but rather numpy (and thence promotion via C rules).

I note that numpy doesn't quite implement C/C++ promotion rules for integer and floating point types. Since it widens floating point types in addition between ints and floats.

C++:

typeof(1 + 1.0f) => float32
typeof(1L + 1.0f) => float32
typeof(1 + 1.0d) => float64
typeof(1L + 1.0d) => float64

These promotions are always in-range, since max(float32) > max(int64) but lossy since the mapping from integers to floating point types of the same width is not bijective.

numpy:

(np.int32(1) + np.float32(1)).dtype => float64
(np.int64(1) + np.float32(1)).dtype => float64
(np.int32(1) + np.float64(1)).dtype => float64
(np.int64(1) + np.float64(1)).dtype => float64

These promotions are bijective (if we ignore nan and inf) if the integer type is narrower than the floating point type. (So the int32 promotion can be undone losslessly).

These kind of automatic promotions are basically a fight I lost before I was born in old programming languages, though some more modern ones force you to always explicitly cast so that you're aware that you might be doing A Bad Thing.

wence- · 2022-11-04T17:43:45Z

Updated the code to generate tables for a bunch of different scenarios. You can now run and it produces a separate file for each scenario, which you can diff against each other to see the points of difference.

Mostly things work (with #11904), except #12072 means that equality is busted.

And, slice-based setitem is broken for length-one ranges if setting with an array rather than a scalar (that's #12073).

seberg · 2022-11-04T19:27:40Z

Sorry for the long post, I hope it doesn't derail or is off topic. I will mainly try to explain the state of NumPy and what NEP 50 does. Happy to re-read again without that in mind!
(I will always make time to talk about these things, because just talking about it, even tangentially, helps me clear up things and move forward.)

N.B. please avoid np.find_common_type it needs to be deprecated. np.result_type (for array inputs or if you need the value-based or future NEP 50 behavior) and np.promote_types are the weapons of choice.

First, there are three things that come together to understand the NumPy behavior:

Integers are converted to long -> long long -> unsigned long long -> object by NumPy based on value. This is what happens if you do np.asarray(integer) and in many other contexts.
Storing values may or may not go via np.asarray(), if it does not, NumPy will be more strict sometimes.
In fact, I just deprecated storing out-of-bounds integers like to disallow np.asarray([-1], dtype=np.uint8). This will be part of the next NumPy release, unless someone complains (which could happen).
Promotion/Common DType: i.e. what happens if you mix two dtypes
- NumPy promotion rules are wonky, they try to be "safe", but make an exception by promoting in64 + float64 -> float64 (and worse int64 + uint64 -> float64). The reason for this is that they fall out of the "safe casting" concept. I disagree with this nowadays, while related, I feel it is better to see promotion/common-dtype and safe/non-lossy casting as distinct. For numerical dtypes, I do agree that round-tripping correctly implies that promotion is acceptable. (NumPy is implemented that way now after NEP 42.)
  The C promotion rules, etc. have the advantage of being e.g. strictly associative (EDIT) while ours are not: (int16 + uint16) + float32 == int32 + float32 == float64 != float32 == (float32 + int16) + uint16 (I/NumPy has a heuristic so that np.result_type finds float32 if it is a single operation, it works always, but you can construct failures if you add new dtypes in the mix.)
- Current promotion inspects values (of "scalars" including 0-D arrays, but not when all are scalars). This allows an integer to match any dtype that will fit its value, and floats also to be "down-promoted" so long they are not approximately out of range.
  NEP 50 is the proposal to remove this. Instead, NEP 50 proposes to insert the Python integer, float, complex as undefined/lowest precision. For integers this means that np.uint8(1) + 300 will error, because we try to do uint8 + uint8 -> uint8 and that doesn't work.
  NEP 50 tries to describe this, the JAX diagram also helps. This makes the Python scalars "weak".

There is discussion about lossy conversion of Python scalars. Which touches something that I am currently struggling with to push NEP 50. So bringing it up, because if you clearly want the notion of "lossy", it might inform how I do this for NEP 50 (it is not central to NEP 50, and doesn't have to be; but I need to settle on something).

For NumPy dtypes we have that defined through the "casting safety" (which can be "no", "equiv", "safe", "same kind", "unsafe". For values we don't really define it, and I don't think the categories quite work.
int64 -> int8 is same-kind, but 10000 -> int8 seems "unsafe", or is it "safe" but fails when you try it? Whether or not I should try to introduce the notion "safety" to scalar assignments in NumPy is one of the things I am struggling with, because the notion of cast-safety doesn't seem to cleanly apply (but NumPy exposes it a lot).

seberg · 2022-11-04T19:32:39Z

NEP 50 itself does not touch the annoying fact that np.array([2**63, 2**61]) returns a float array. But, since NEP 50 is likely a NumPy 2.0, I think it does make sense to stop doing it and only ever go to the default integer and otherwise raise an error.

mroeschke · 2022-11-04T21:43:29Z

I think generally from the pandas point of view (with respect to __setitem__), I would say the general philosophy is "the resulting dtype should be able to store the existing values & new value(s) without losing numeric precision".

Here are some examples (related to your original table) where we differ from numpy due to the above:

1 . "Don't truncate so upcast dtype" example

In [6]: ser = pd.Series([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [7]: ser.iloc[2] = 10.5

In [8]: ser
Out[8]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

In [9]: np_arr = np.array([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [10]: np_arr[2] = 10.5

In [11]: np_arr
Out[11]: array([9223372036854775798, 9223372036854775708,                  10])

In [14]: print(np.__version__) ; print(pd.__version__)
1.23.1
2.0.0.dev0+577.g7e9ca6e8af

"Bend over backwards to accommodate the values" examples

In [5]: ser = pd.Series([1, 2, 3], dtype=np.int64)

In [6]: ser.iloc[2] = 2**80

In [7]: ser
Out[7]:
0                            1
1                            2
2    1208925819614629174706176
dtype: object

In [8]: arr = np.array([1, 2, 3], dtype=np.int64)

In [9]: arr
Out[9]: array([1, 2, 3])

In [10]: arr[2] = 1208925819614629174706176
a---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 arr[2] = 1208925819614629174706176

OverflowError: Python int too large to convert to C long

wence- · 2022-11-07T11:29:04Z

2. "Bend over backwards to accommodate the values" examples

Unless we want to implement bignums in libcudf, I think this is not something we can really contemplate.

wence- · 2022-11-07T11:32:51Z

I think generally from the pandas point of view (with respect to __setitem__), I would say the general philosophy is "the resulting dtype should be able to store the existing values & new value(s) without losing numeric precision".

This is a reasonable philosophy, though it does rather depend on what you mean by "losing numeric precision".

1 . "Don't truncate so upcast dtype" example

In [6]: ser = pd.Series([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [7]: ser.iloc[2] = 10.5

In [8]: ser
Out[8]:
0    9.223372e+18
1    9.223372e+18
2    1.050000e+01
dtype: float64

This seems like a nice thing to do, except that it changes behaviour of the series:

In [22]: ser = pd.Series([2**63 - 10, 2**63 - 100, 3], dtype=np.int64)

In [23]: ser == ser.iloc[0]
Out[23]: 
0     True
1    False
2    False
dtype: bool

In [24]: ser.iloc[2] = 10.5

In [25]: ser == ser.iloc[0]
Out[25]: 
0     True
1     True
2    False
dtype: bool

I would argue that this choice did lose numeric precision, and particularly in existing values that weren't touched by the indexing.

seberg · 2022-11-07T13:25:40Z

So pandas (and cudf) use common DType to promote a column when necessary. That leads to two things:

The special rule for going to float64 does lose precision for int64/uint64.
Pandas will even allow going to object dtype.

The first issue is independent of NEP 50 and I am not sure there is much to do about it. You could consider disallowing it. It probably means making a clearer distinction between "common DType" (for storing into a single column) and "promotion" (for operations).
That is, the promotion in float_arr + int64_arr -> float_arr (i.e. adding two arrays) seems fine, but maybe storing the two into one container is not?
I make the distinction in NumPy, but not convenient to apply it broadly. My gut feeling is that it is probably best to live with the way it is: Yes, it loses precision, but it is usually OK.

For the second thing, NEP 50 might nudge towards making it an error in pandas: If you apply NEP 50 promotion rules, the column should keep whatever int dtype it has, and if the Python integer cannot be assigned it should just error. Auto-promoting to object dtype would then never work (which is also what will happen for np.int64(1) + 2**500).

mroeschke · 2022-11-07T18:45:54Z

I would argue that this choice did lose numeric precision, and particularly in existing values that weren't touched by the indexing.

That's fair. Not 100% sure of the origin of promoting to float64 in pandas for this case, but I would venture to guess the numpy behavior of keeping int64 and truncating the 10.5 to 10 was a more suprising & more significant loss in precision behavior to pandas users.

For the second thing, NEP 50 might nudge towards making it an error in pandas:

This is also fair. Maybe this is a case of pandas being too accomodating, but we can always nudge users to astype(object) in these cases.

wence- · 2022-11-08T09:37:20Z

Not 100% sure of the origin of promoting to float64 in pandas for this case,

If you're going to do integral to floating type promotion and want to maintain accuracy in the largest number of cases (and don't care too much about memory footprint) the promoting to float64 is a reasonable choice.

wence- added feature request New feature or request question Further information is requested proposal Change current process or code Needs Triage Need team to review and classify pandas improvement Improvement / enhancement to an existing function labels Nov 1, 2022

wence- mentioned this issue Nov 4, 2022

Fix type promotion edge cases in numerical binops #12074

Merged

3 tasks

GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 14, 2022

GregoryKimball added the Python Affects Python cuDF API. label Nov 21, 2022

wence- mentioned this issue Feb 16, 2023

[ENH]: Reworking of iloc and loc indexing #12793

Open

vyasr removed the pandas label May 16, 2024

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH/QST]: Behaviour of type promotion in `setitem` #12039

[ENH/QST]: Behaviour of type promotion in `setitem` #12039

wence- commented Nov 1, 2022 •

edited

Loading

shwina commented Nov 2, 2022

mroeschke commented Nov 3, 2022

vyasr commented Nov 3, 2022

wence- commented Nov 4, 2022

wence- commented Nov 4, 2022

seberg commented Nov 4, 2022 •

edited

Loading

seberg commented Nov 4, 2022

mroeschke commented Nov 4, 2022

wence- commented Nov 7, 2022

wence- commented Nov 7, 2022

seberg commented Nov 7, 2022

mroeschke commented Nov 7, 2022

wence- commented Nov 8, 2022

[ENH/QST]: Behaviour of type promotion in __setitem__ #12039

[ENH/QST]: Behaviour of type promotion in __setitem__ #12039

Comments

wence- commented Nov 1, 2022 • edited Loading

Summary

Pandas behaviour

Numeric columns

Integer column dtypes

dtype width < max integer width

dtype width == max integer width

Float column dtypes

dtype width < max float width

dtype width == max float width

Everything else

CUDF behaviour

Numeric columns

Integer column dtypes

dtype width < max integer width

dtype width == max integer width

Float column dtypes

dtype width < max float width

dtype width == max float width

Everything else

Inserting strings into numerical columns

Inserting things into string columns

List columns

Struct columns

What I think we want (for composite columns)

Footnotes

shwina commented Nov 2, 2022

mroeschke commented Nov 3, 2022

vyasr commented Nov 3, 2022

wence- commented Nov 4, 2022

wence- commented Nov 4, 2022

seberg commented Nov 4, 2022 • edited Loading

seberg commented Nov 4, 2022

mroeschke commented Nov 4, 2022

wence- commented Nov 7, 2022

wence- commented Nov 7, 2022

seberg commented Nov 7, 2022

mroeschke commented Nov 7, 2022

wence- commented Nov 8, 2022

[ENH/QST]: Behaviour of type promotion in `setitem` #12039

[ENH/QST]: Behaviour of type promotion in `setitem` #12039

wence- commented Nov 1, 2022 •

edited

Loading

seberg commented Nov 4, 2022 •

edited

Loading