API: handling of missing values in Index.contains #59765

jorisvandenbossche · 2024-09-09T18:58:55Z

The below table gives an overview of the result value for:

missing_value in idx

i.e. how Index.__contains__ handles various missing value sentinels as input for the different data types.

dtype	None	nan	<NA>	NaT
object-none	True	False	False	False
object-nan	False	True	False	False
object-NA	False	False	True	False
datetime	True	True	True	True
period	True	True	True	True
timedelta	True	True	True	True
float64	False	True	False	False
categorical	True	True	True	True
interval	True	True	True	False
nullable_int	False	False	True	False
nullable_float	False	False	True	False
string-python	False	False	False	False
string-pyarrow	False	False	False	False
str-python	False	False	False	False

The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype

But more in general, this is quite inconsistent:

For object dtype, we require exact match
For datetimelike and categorical, we match any missing-like
For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
For float we only match NaN
For nullable dtypes (int/float), we only match NA

The code to generate the table above:

import numpy as np
import pandas as pd

# from conftest.py
indices_dict = {
    "object-none": pd.Index(["a", None], dtype=object),
    "object-nan": pd.Index(["a", np.nan], dtype=object),
    "object-NA": pd.Index(["a", pd.NA], dtype=object),
    "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
    "period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
    "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
    "float64": pd.Index([2.0, np.nan], dtype="float64"),
    "categorical": pd.CategoricalIndex(["a", None]),
    "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
    "nullable_int": pd.Index([2, None], dtype="Int64"),
    "nullable_float": pd.Index([2.0, None], dtype="Float32"),
    "string-python": pd.Index(["a", None], dtype="string[python]"),
    "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
    "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}

results = []

for dtype, data in indices_dict.items():
    for val in [None, np.nan, pd.NA, pd.NaT]:
        res = val in data
        results.append((dtype, str(val), res))
        
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())

print(df_overview.astype(str).to_markdown())

cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2024-09-09T20:19:18Z

im not aware of a dedicated issue for this either. i think at one point I made a PR trying to make more of the EA subclasses use is_valid_na_for but that got tabled pending the nan-vs-na topic.

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False). Also Decimal("NaN") should be handled correctly.

jorisvandenbossche · 2024-09-10T07:41:08Z

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False).

Indeed, the np.timedelta64("NaT") and np.datetime64("NaT") only give True for timedelta/datetime index, respectively, and all other index dtypes return False for those, with one exception: categorical.

Also Decimal("NaN") should be handled correctly.

In the sense that it is not matched in general (again, except for categorical ..). But it seems also not be matched for object dtype with such decimal: Decimal("NaN") in pd.Index([Decimal("2.0"), Decimal("NaN")], dtype=object) gives False.

Expanded table:

dtype	None	nan	<NA>	NaT	np.datetime64('NaT')	np.timedelta64('NaT')	Decimal('NaN')
object-none	True	False	False	False	False	False	False
object-nan	False	True	False	False	False	False	False
object-NA	False	False	True	False	False	False	False
object-decimal-NaN	False	False	False	False	False	False	False
datetime	True	True	True	True	True	False	False
period	True	True	True	True	False	False	False
timedelta	True	True	True	True	False	True	False
float64	False	True	False	False	False	False	False
categorical	True	True	True	True	True	True	True
interval	True	True	True	False	False	False	False
nullable_int	False	False	True	False	False	False	False
nullable_float	False	False	True	False	False	False	False
string-python	False	False	False	False	False	False	False
string-pyarrow	False	False	False	False	False	False	False
str-python	False	False	False	False	False	False	False

import numpy as np
import pandas as pd
from decimal import Decimal

# from conftest.py
indices_dict = {
    "object-none": pd.Index(["a", None], dtype=object),
    "object-nan": pd.Index(["a", np.nan], dtype=object),
    "object-NA": pd.Index(["a", pd.NA], dtype=object),
    "object-decimal-NaN": pd.Index(["a", Decimal("NaN")], dtype=object),
    "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
    "period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
    "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
    "float64": pd.Index([2.0, np.nan], dtype="float64"),
    "categorical": pd.CategoricalIndex(["a", None]),
    "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
    "nullable_int": pd.Index([2, None], dtype="Int64"),
    "nullable_float": pd.Index([2.0, None], dtype="Float32"),
    "string-python": pd.Index(["a", None], dtype="string[python]"),
    "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
    "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}

results = []

for dtype, data in indices_dict.items():
    for val in [None, np.nan, pd.NA, pd.NaT, np.datetime64("NaT"), np.timedelta64("NaT"), Decimal("NaN")]:
        res = val in data
        results.append((dtype, repr(val), res))
        
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())

print(df_overview.astype(str).to_markdown())

jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Index Related to the Index class or subclasses API - Consistency Internal Consistency of API/Behavior labels Sep 9, 2024

jorisvandenbossche mentioned this issue Sep 23, 2024

BUG (string dtype): looking up missing value in the Index fails #59879

Closed

jorisvandenbossche mentioned this issue Nov 18, 2024

String dtype: use ObjectEngine for indexing for now correctness over performance #60329

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: handling of missing values in Index.contains #59765

API: handling of missing values in Index.contains #59765

jorisvandenbossche commented Sep 9, 2024

jbrockmendel commented Sep 9, 2024 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Sep 10, 2024 •

edited

Loading

API: handling of missing values in Index.__contains__ #59765

API: handling of missing values in Index.__contains__ #59765

Comments

jorisvandenbossche commented Sep 9, 2024

jbrockmendel commented Sep 9, 2024 • edited by jorisvandenbossche Loading

jorisvandenbossche commented Sep 10, 2024 • edited Loading

API: handling of missing values in Index.contains #59765

API: handling of missing values in Index.contains #59765

jbrockmendel commented Sep 9, 2024 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Sep 10, 2024 •

edited

Loading