Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

elephaint · 2024-11-25T08:45:49Z

Currently I often have the following code:

df_nw = nw.from_native(df)
is_pandas = nw.dependencies.is_pandas_dataframe(df)
.....
if is_pandas:
     do something different
......

What is difficult about this, is that I need to keep track of is_pandas variables throughout the code, send them in subfunctions, etc. If I have multiple DataFrames, I have multiple such is_pandas variables. Ideally, I'd be able to do something such as:

is_pandas = df_nw.is_native_pandas

i.e., having whether the underlying dataframe is pandas or not simply as a boolean attribute of the Narwhals DataFrame. That would allow me to use df_nw everywhere without requiring the auxiliary variables everywhere or first converting to native.

Of course, I know I can also do this everywhere: nw.dependencies.is_pandas_dataframe(df_nw.to_native())) but that feels convoluted.

What is the cleanest way to do this?

The text was updated successfully, but these errors were encountered:

FBruzzesi · 2024-11-25T09:09:04Z

Hey @elephaint , thanks for your request. This can certainly be a pain point for other libraries trying to adopt narwhals.

I would say that the answer is it depends.

We have a set of functionalities, namely maybe_align_index, maybe_get_index , maybe_set_index, maybe_reset_index and maybe_convert_dtypes, which are meant to help working with pandas objects without having to manually check all the times.

If that's not enough, nw.dependencies.is_pandas_dataframe(df_nw.to_native())) is an option, another one would be df_nw._compliant_frame._implementation is Implementation.PANDAS but it doesn't look less convoluted to me.

In plotly express, I had to do something similar, by adding a flag is_pd_like on the first encounter of the dataframe object, and passing that to various functions to branch out the logic.

MarcoGorelli · 2024-11-25T09:16:39Z

Thanks for the request!

I think currently the two documented way would be:

if nw.get_native_namespace(df) is nw.dependencies.get_pandas()
if nw.dependencies.is_pandas_dataframe(df.to_native())

I can see that it would be convenient to have something more ergonomic... 🤔 will think about this one. Thanks for having highlighted this

another one would be df_nw._compliant_frame._implementation is Implementation.PANDAS but it doesn't look less convoluted to me.

wait, this would be highly risky as it involves using private methods which may change at any time 😉 Better to stick with the public API, which we make some stability guarantees about

elephaint · 2024-11-25T12:43:30Z

Thanks for the discussion!

I think for now I'll go with nw.dependencies.is_pandas_dataframe(df.to_native())

Just to be clear - this is really a 'nice to have' but by no means very important to me, so don't make something crazy complex over this 😛

FBruzzesi · 2024-11-25T13:02:54Z

Just out of curiosity for now, could you point to such example that requires branching a specific path for pandas?

dangotbanned · 2024-11-30T11:22:28Z

Just out of curiosity for now, could you point to such example that requires branching a specific path for pandas?

@FBruzzesi I haven't had to branch for pandas yet, but I have two examples for pyarrow.

They both relate to extracting scalar(s):

Although for these, I'd be happy with an API like nw.to_py_scalar but for nw.Series (and maybe nw.DataFrame)

MarcoGorelli · 2024-11-30T11:27:34Z

thanks for sharing

Although for these, I'd be happy with an API like nw.to_py_scalar but for nw.Series (and maybe nw.DataFrame)

sorry not sure i follow, could you clarify please?

FBruzzesi · 2024-11-30T11:38:45Z

+1 on Marco question, I am not sure I get the point.

but I have two examples for pyarrow

I share the pain of working with pyarrow objects.

For your specific case here, I think we should implement __contains__ operator using native functionalities and without having to convert to python lists. Currently:

import narwhals as nw
import pandas as pd
import polars as pl
import pyarrow as pa

def func(value, series):
    return value in nw.from_native(series, series_only=True)

data = [1,2,3]
func(1, pd.Series(data))  # True
func(1, pl.Series(data))  # True
func(1, pa.chunked_array([data]))  # False

dangotbanned · 2024-11-30T11:44:43Z

thanks for sharing

Although for these, I'd be happy with an API like nw.to_py_scalar but for nw.Series (and maybe nw.DataFrame)

sorry not sure i follow, could you clarify please?

Sure @MarcoGorelli

So for the pandas and polars paths in these examples:

# 1
nw.DataFrame.item(0, ...) -> PythonDataType
# 2
nw.Series.__iter__() -> Iterator[PythonDataType]

But with pyarrow:

# 1 needs an extra nw.to_py_scalar
# 2 instead uses nw.Series.to_list, which is an eager version of what I'm after

MarcoGorelli · 2024-11-30T13:08:12Z

thanks!

for #1 - yup, we could return Python scalars for all PyArrow aggregations (like Polars) does, it would just require a bit of a refactor

for #2, I think you'd be better off using to_list for all backends. Polars example:

In [20]: s
Out[20]:
shape: (10_000_000,)
Series: '' [i64]
[
        4
        5
        7
        9
        0
        …
        8
        8
        2
        9
        4
]

In [21]: %timeit _ = set(s)
327 ms ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [22]: %timeit _ = set(s.to_list())
164 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Though both solutions are eager so I'm not sure I understand "which is an eager version of what I'm after"

MarcoGorelli · 2024-11-30T13:30:39Z

yup, we could return Python scalars for all PyArrow aggregations (like Polars) does, it would just require a bit of a refactor

actually, it's not as bad as I was expecting...gonna make a PR 😎

dangotbanned · 2024-11-30T14:36:35Z

for #2, I think you'd be better off using to_list for all backends. Polars example:

In [21]: %timeit _ = set(s)
327 ms ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [22]: %timeit _ = set(s.to_list())
164 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Though both solutions are eager so I'm not sure I understand "which is an eager version of what I'm after"

Thanks @MarcoGorelli, good call on the to_list suggestion!

Seems I forgot about the performance impact 🤦‍♂️ - as I knew this back in vega/altair#3501 (comment)

Always fun to learn a refactor can be simpler and more performant 😄

MarcoGorelli · 2024-11-30T16:15:35Z

actually, it's not as bad as I was expecting...gonna make a PR 😎

done, available in Narwhals 1.15.0 🚀 #1471

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

MarcoGorelli · 2024-12-06T17:03:44Z

Discussed today - we could expose DataFrame.implementation / Series.implementation / LazyFrame.implementation, and nw.Implementation, so that users can do

if df.implementation is nw.Implementation.PANDAS:
    ...

We could also implement Implementation.is_pandas_like so that users could do:

if df.implementation.is_pandas_like():

🤔 at which point, we could even have Implementation.is_pandas / Implementation.is_polars / etc...

I think we should do this, both to make code more readable, and because the current approach is error-prone #1437

MarcoGorelli added enhancement New feature or request needs discussion labels Nov 25, 2024

MarcoGorelli mentioned this issue Nov 30, 2024

enh: implement __contains__ for Series #1469

Closed

AlessandroMiola mentioned this issue Dec 1, 2024

feat: add Series.__contains__ #1480

Merged

10 tasks

dangotbanned added a commit to vega/altair that referenced this issue Dec 2, 2024

perf: Use to_list() for all backends

d20e9c1

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

MarcoGorelli added high priority and removed needs discussion labels Dec 6, 2024

MarcoGorelli self-assigned this Dec 7, 2024

MarcoGorelli mentioned this issue Dec 7, 2024

feat: Expose nw.Implementation, along with Implementation.is_pandas, Implementation.is_pandas_like, and more #1531

Merged

10 tasks

MarcoGorelli closed this as completed in #1531 Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

elephaint commented Nov 25, 2024 •

edited

Loading

FBruzzesi commented Nov 25, 2024

MarcoGorelli commented Nov 25, 2024

elephaint commented Nov 25, 2024

FBruzzesi commented Nov 25, 2024

dangotbanned commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024

FBruzzesi commented Nov 30, 2024 •

edited

Loading

dangotbanned commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024

dangotbanned commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024 •

edited

Loading

MarcoGorelli commented Dec 6, 2024 •

edited

Loading

Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

Comments

elephaint commented Nov 25, 2024 • edited Loading

FBruzzesi commented Nov 25, 2024

MarcoGorelli commented Nov 25, 2024

elephaint commented Nov 25, 2024

FBruzzesi commented Nov 25, 2024

dangotbanned commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024

FBruzzesi commented Nov 30, 2024 • edited Loading

dangotbanned commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024

dangotbanned commented Nov 30, 2024

MarcoGorelli commented Nov 30, 2024 • edited Loading

MarcoGorelli commented Dec 6, 2024 • edited Loading

elephaint commented Nov 25, 2024 •

edited

Loading

FBruzzesi commented Nov 30, 2024 •

edited

Loading

MarcoGorelli commented Nov 30, 2024 •

edited

Loading

MarcoGorelli commented Dec 6, 2024 •

edited

Loading