-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Series.__contains__
#1480
Conversation
narwhals/_pandas_like/series.py
Outdated
def __contains__(self: Self, other: Any) -> bool: | ||
if other is None: | ||
return self._native_series.isna().any() # type: ignore[no-any-return] | ||
return other in self._native_series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fell for the example contained in our why section!
There are two ways of doing this:
return other in self.to_numpy()
: yet I don't think this is a good idea for cudfreturn self._native_series.isin({other}).any()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask a general, possibly silly, question, just for the sake of better understanding? 😇
I see the point in implementing __contains__
for pandas-like and arrow and also what you mention here. Instead, I haven't properly got the details behind #1443 (comment), as I don't see func(1, pa.chunked_array([data]))
returning False
.
Should we implement __contains__
at narwhals/series.py
to ensure the above behaviour is explicitly triggered via __contains__
(provided that what I'm stating above is correct)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I don't see func(1, pa.chunked_array([data])) returning False.
Damn that's true... but now I don't know why, and I wonder what is being called. The following returns false:
1 in pa.chunked_array([[1,2,3]])
False
Should we implement contains at narwhals/series.py to ensure the above behaviour is explicitly triggered via contains?
I am addressing it in the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn that's true... but now I don't know why, and I wonder what is being called.
Me neither, had issues in trying to figure it out
I am addressing it in the PR
yeah, I see it; was just asking to confirm my understanding as I couldn't indeed see where this was coming from, without the explicit reference to __contains__
😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, since we implement __iter__
the check works. Example:
class A:
pass
class B:
def __iter__(self):
yield from [1,2,3]
0 in B(), 1 in B()
> (False, True)
1 in A()
> TypeError: argument of type 'A' is not iterable
When I wrote the comment, the result was false, and now it returns true due to the changes in #1471, lines. In fact before such change, iteration was returning arrow scalars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks for the clarification!:)
narwhals/_arrow/series.py
Outdated
return maybe_extract_py_scalar( # type: ignore[no-any-return] | ||
pc.is_in(pa.scalar(other, type=native_series.type), native_series), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if other
is of a type not convertible to native_series.type
, will this raise? I'm wondering if we should return False
in such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like polars would also raise here:
In [2]: 'foo' in pl.Series([1,2,3])
---------------------------------------------------------------------------
InvalidOperationError Traceback (most recent call last)
Cell In[2], line 1
----> 1 'foo' in pl.Series([1,2,3])
File ~/scratch/.venv/lib/python3.12/site-packages/polars/series/series.py:1233, in Series.__contains__(self, item)
1231 if item is None:
1232 return self.has_nulls()
-> 1233 return self.implode().list.contains(item).item()
File ~/scratch/.venv/lib/python3.12/site-packages/polars/series/utils.py:106, in call_expr.<locals>.wrapper(self, *args, **kwargs)
104 expr = getattr(expr, namespace)
105 f = getattr(expr, func.__name__)
--> 106 return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
File ~/scratch/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9138, in DataFrame.select_seq(self, *exprs, **named_exprs)
9115 def select_seq(
9116 self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
9117 ) -> DataFrame:
9118 """
9119 Select columns from this DataFrame.
9120
(...)
9136 select
9137 """
-> 9138 return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
File ~/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2029, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
2027 # Only for testing purposes
2028 callback = _kwargs.get("post_opt_callback", callback)
-> 2029 return wrap_df(ldf.collect(callback))
InvalidOperationError: is_in operation not supported for dtypes `str` and `list[i64]`
probably ok to just follow them here then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still, probably good to have a test to check it raises? we can leave the expression unification for a separate topic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a test and make sure that same exception is raised 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is definitly wrong behaviour for arrow as the casting would convert from float to int, leading to the following:
import pyarrow as pa
import narwhals as nw
data = [100, 200, None]
s = nw.from_native(pa.table({"a": data}), eager_only=True)["a"]
100.3 in s
True
On the Exception side, I would not be sure how to catch the error for pandas as:
(pd.Series([1,2,3]) == "foo").any()
np.False_
In a way, I find polars behaviour more surprising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh we could just dispatch to (s._compliant_series == other).any()
for all backends 😄 that would still be better than the status quo
narwhals/_pandas_like/series.py
Outdated
def __contains__(self: Self, other: Any) -> bool: | ||
if other is None: | ||
return self._native_series.isna().any() # type: ignore[no-any-return] | ||
return self._native_series.isin({other}).any() # type: ignore[no-any-return] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this over return (self._native_series == other).any()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feelings 😂 I guess I went for series.isin(...)
as I initially fell for the other in series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fell for" 😄 love that
from a quick test, it looks a lot faster
In [13]: s = pd.Series(rng.integers(0, 10, size=1_000_000))
In [14]: %timeit s.isin({0}).any()
6.58 ms ± 687 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: %timeit (s == 0).any()
326 μs ± 45.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
from narwhals.expr import all_ as all # noqa: A004 | ||
from narwhals.expr import all_ as all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok something must have gone wrong with the pre-commit sync :|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 i'm so confused 😄
pc.is_in(other_, native_series), | ||
return_py_scalar=True, | ||
) | ||
except (ArrowInvalid, ArrowNotImplementedError, ArrowTypeError) as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be bare exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi ! and @AlessandroMiola for review
Ok then 😁 thanks @MarcoGorelli! I though you were not so happy with it 🙈 |
ah sorry for giving that impression 😳 |
ah perhaps this comment i left didn't help
by 'status quo' i meant the main branch, not the previous commit in this pr |
What type of PR is this? (check all applicable)
Related issues
__contains__
for Series #1469Checklist
If you have comments or can explain your changes, please do so below