Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Series.__contains__ #1480

Merged
merged 10 commits into from
Dec 3, 2024
Merged

feat: add Series.__contains__ #1480

merged 10 commits into from
Dec 3, 2024

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@github-actions github-actions bot added the enhancement New feature or request label Dec 1, 2024
def __contains__(self: Self, other: Any) -> bool:
if other is None:
return self._native_series.isna().any() # type: ignore[no-any-return]
return other in self._native_series
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fell for the example contained in our why section!

There are two ways of doing this:

  • return other in self.to_numpy(): yet I don't think this is a good idea for cudf
  • return self._native_series.isin({other}).any()

Copy link
Contributor

@AlessandroMiola AlessandroMiola Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask a general, possibly silly, question, just for the sake of better understanding? 😇

I see the point in implementing __contains__ for pandas-like and arrow and also what you mention here. Instead, I haven't properly got the details behind #1443 (comment), as I don't see func(1, pa.chunked_array([data])) returning False.

Should we implement __contains__ at narwhals/series.py to ensure the above behaviour is explicitly triggered via __contains__ (provided that what I'm stating above is correct)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I don't see func(1, pa.chunked_array([data])) returning False.

Damn that's true... but now I don't know why, and I wonder what is being called. The following returns false:

1 in pa.chunked_array([[1,2,3]])
False

Should we implement contains at narwhals/series.py to ensure the above behaviour is explicitly triggered via contains?

I am addressing it in the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn that's true... but now I don't know why, and I wonder what is being called.

Me neither, had issues in trying to figure it out

I am addressing it in the PR

yeah, I see it; was just asking to confirm my understanding as I couldn't indeed see where this was coming from, without the explicit reference to __contains__ 😃

Copy link
Member Author

@FBruzzesi FBruzzesi Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since we implement __iter__ the check works. Example:

class A:
    pass

class B:
    def __iter__(self):
        yield from [1,2,3]

0 in B(), 1 in B()
> (False, True)

1 in A()
> TypeError: argument of type 'A' is not iterable

When I wrote the comment, the result was false, and now it returns true due to the changes in #1471, lines. In fact before such change, iteration was returning arrow scalars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for the clarification!:)

Comment on lines 1025 to 1026
return maybe_extract_py_scalar( # type: ignore[no-any-return]
pc.is_in(pa.scalar(other, type=native_series.type), native_series),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if other is of a type not convertible to native_series.type, will this raise? I'm wondering if we should return False in such a case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like polars would also raise here:

In [2]: 'foo' in pl.Series([1,2,3])
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[2], line 1
----> 1 'foo' in pl.Series([1,2,3])

File ~/scratch/.venv/lib/python3.12/site-packages/polars/series/series.py:1233, in Series.__contains__(self, item)
   1231 if item is None:
   1232     return self.has_nulls()
-> 1233 return self.implode().list.contains(item).item()

File ~/scratch/.venv/lib/python3.12/site-packages/polars/series/utils.py:106, in call_expr.<locals>.wrapper(self, *args, **kwargs)
    104     expr = getattr(expr, namespace)
    105 f = getattr(expr, func.__name__)
--> 106 return s.to_frame().select_seq(f(*args, **kwargs)).to_series()

File ~/scratch/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9138, in DataFrame.select_seq(self, *exprs, **named_exprs)
   9115 def select_seq(
   9116     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   9117 ) -> DataFrame:
   9118     """
   9119     Select columns from this DataFrame.
   9120
   (...)
   9136     select
   9137     """
-> 9138     return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)

File ~/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2029, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2027 # Only for testing purposes
   2028 callback = _kwargs.get("post_opt_callback", callback)
-> 2029 return wrap_df(ldf.collect(callback))

InvalidOperationError: is_in operation not supported for dtypes `str` and `list[i64]`

probably ok to just follow them here then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still, probably good to have a test to check it raises? we can leave the expression unification for a separate topic

Copy link
Member Author

@FBruzzesi FBruzzesi Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a test and make sure that same exception is raised 👌

Copy link
Member Author

@FBruzzesi FBruzzesi Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is definitly wrong behaviour for arrow as the casting would convert from float to int, leading to the following:

import pyarrow as pa
import narwhals as nw

data = [100, 200, None]
s = nw.from_native(pa.table({"a": data}), eager_only=True)["a"]

100.3 in s
True

On the Exception side, I would not be sure how to catch the error for pandas as:

(pd.Series([1,2,3]) == "foo").any()
np.False_

In a way, I find polars behaviour more surprising

Copy link
Member

@MarcoGorelli MarcoGorelli Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh we could just dispatch to (s._compliant_series == other).any() for all backends 😄 that would still be better than the status quo

def __contains__(self: Self, other: Any) -> bool:
if other is None:
return self._native_series.isna().any() # type: ignore[no-any-return]
return self._native_series.isin({other}).any() # type: ignore[no-any-return]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this over return (self._native_series == other).any()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings 😂 I guess I went for series.isin(...) as I initially fell for the other in series

Copy link
Member

@MarcoGorelli MarcoGorelli Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fell for" 😄 love that

from a quick test, it looks a lot faster

In [13]: s = pd.Series(rng.integers(0, 10, size=1_000_000))

In [14]: %timeit s.isin({0}).any()
6.58 ms ± 687 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: %timeit (s == 0).any()
326 μs ± 45.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

from narwhals.expr import all_ as all # noqa: A004
from narwhals.expr import all_ as all
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok something must have gone wrong with the pre-commit sync :|

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 i'm so confused 😄

pc.is_in(other_, native_series),
return_py_scalar=True,
)
except (ArrowInvalid, ArrowNotImplementedError, ArrowTypeError) as exc:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be bare exception?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi ! and @AlessandroMiola for review

@MarcoGorelli MarcoGorelli merged commit e02685b into main Dec 3, 2024
24 checks passed
@FBruzzesi
Copy link
Member Author

Ok then 😁 thanks @MarcoGorelli! I though you were not so happy with it 🙈

@FBruzzesi FBruzzesi deleted the feat/series-contains branch December 3, 2024 12:10
@MarcoGorelli
Copy link
Member

ah sorry for giving that impression 😳

@MarcoGorelli
Copy link
Member

ah perhaps this comment i left didn't help

that would still be better than the status quo

by 'status quo' i meant the main branch, not the previous commit in this pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enh: implement __contains__ for Series
3 participants