Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799

Merged
merged 21 commits into from
Jun 13, 2023

Conversation

datapythonista
Copy link
Member

@pandas-dev/pandas-core another PDEP. There is a short summary at the beginning of the proposal for people not immediately having the time to read the whole document.

@datapythonista datapythonista added the PDEP pandas enhancement proposal label Mar 5, 2023
make testing of them much more reliable and resistant to changes in pandas, and
allow connectors to be reused by other projects of the ecosystem.

In case a `from_` method returned something different than a PyArrow table,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritchie46 would be great to get your feedback in the whole proposal, but in particular to this part. Assuming Polars wants to benefit from generic readers and writers that can work with any dataframe project, what do you think about a PyArrow object? Since Polars is Arrow2 instead, would it make sense to make the readers/writers just expose the info to access the lower levels structs that the Arrow standards provide to exachange information? Not sure what the best option would be here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyarrow objects are perfect. We adhere to the arrow memory specification, the actual implementation of that specification doesn't really matter.

So yeah, pyarrow.table is good. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ritchie46, that sounds great. Just to be clear, my concern wasn't that you couldn't read Arrow objects, but that you may not want to have PyArrow as a dependency to be able to use this API, and I think returning a PyArrow table forces that. PyArrow has more than 30 dependencies (at least in conda-forge), some of them not so small, like libarrow itself, or libgoogle-cloud.

I guess PyArrow may be the best to start, but I wonder if maybe a minimal Python Arrow library that can hold the metadata to access the Arrow data, and then be converted to the PyArrow or Arrow2/Polars structure would be better for the long term.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small plug for 'nanoarrow', which just had a 0.1 release, has R bindings, and could have Python bindings (e.g., apache/arrow-nanoarrow#117 ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing, thanks for sharing @paleolimbot. I think this is exactly what we need, way better than using PyArrow and requiring it as a dependency for this connector subsystem! Just in time, let me know if at some point you need feedback or some testing of the Python bindings when they're ready. :)

from_duckdb = "pandas_duckdb:from_duckdb"
```

On import of the pandas module, it would read the entrypoint registry for the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make import pandas slower, not sure if it can be done in a lazy way. For people using autocomplete when coding for example, I think the DataFrame.io should be populated before __getattr__ is called. But maybe it can just be populated when any of __dir__ or __getattr__ is called, and we get the best of both worlds? Or too complex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea how big this slowdown would be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be minimal if the user has just couple of connectors with lightweight dependencies installed. But there could possibly be a situation where dozens of connectors exist, and some of them are slow to load (maybe they have imports on their dependencies that are also slow, or other reasons). Unless connectors implement lazy imports, the loading of all connectors including their dependencies would happen at import pandas.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation allows us some control over these methods so that there is a more consistent user experience between different engines. By having third parties solely responsible for the connector, I believe it is then completely out of our hands, which could make for a worse user experience.

This is making me think that for some important methods (to_excel in particular), the long term plan should be to leave the implementation within pandas (but moved to .io).

Comment on lines 165 to 168
`dataframe.io` group, and would dynamically create methods in the `DataFrame.io`
namespace for them. Method names would only be allowed to start with `from_` or
`to_`, and any other prefix would make pandas raise an exception. This would
guarantee a reasonably consistent API among third-party I/O connectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would conflicts work if there are two packages that want to register to_widget?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. A conflict only would happen if a user has both packages installed, so I guess in most cases wouldn't be a problem, even if two authors decide to use the same name.

I guess entrypoints don't require names to be unique, so I think it's up to us to decide. I'd personally raise an exception and let the user know that they installed two packages providing the same method, and they should uninstall one to disambiguate.

Comment on lines 252 to 254
The scope of the current proposal is limited to the addition of the
`DataFrame.io` namespace, and the automatic registration of functions defined
by third-party projects, if an entrypoint is defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on making this out of scope for the proposal itself, however I think it's also natural to discuss what the next steps may be when reasoning about this change. So I hope it's okay that the steps beyond this are still okay to discuss here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I think it makes more sense to make a final decision on possible migrations at a later point, when we have more information about this new system. But fully agree that it's worth discussing, and having an idea of other people expectations regarding this.

To me, some of the current functions, such as read_sas, read_spss, read_gbq, to_xarray and maybe others could be moved to this API as soon as the new system is in place, and make the current API raise a FutureWarning to let users know about the change.

For the more commonly used, I don't have a strong opinion. Maybe make them available via the new API too kind of soon, for users who want a more consistent API, but wait until any FutureWarning or anything?

@datapythonista
Copy link
Member Author

The current implementation allows us some control over these methods so that there is a more consistent user experience between different engines. By having third parties solely responsible for the connector, I believe it is then completely out of our hands, which could make for a worse user experience.

That's an interesting point. I personally think that the current proposal already guarantees a consistent behavior, leaving only the parameters up to the developers of the connectors. This would prevent things like using the builder/fluid pattern, that if I'm not wrong PySpark is using, and afaik users seem to like. Talking about something like .from_csv(fname).set_delimiter(',').ignore_header(True).... I'm not sure it's a good thing to prevent developers to implement this API. Maybe I'd actually let them return a lazy builder too.

Do you have a particular example that can show the kind of inconsistencies you're worried about? I find it difficult to understand the problem without an example.

This is making me think that for some important methods (to_excel in particular), the long term plan should be to leave the implementation within pandas (but moved to .io).

Not a strong opinion, but in my mind, in the long term I'd only have in pandas the connectors provided by PyArrow. That doesn't mean we can make a exception for Excel. I don't consider it so important, but of course that depends of the domain of each one. But the main reason I'd move it out is because there are 4 different engines, and I think each of them should be a separate project, so users could simply install whichever is more convenient for them. I don't think keeping 4 connectors or more for Excel in pandas is desirable, and I wouldn't make a choice on which is the "good" one.

As said, this for the long term, I wouldn't make any change to Excel for now. Does it make sense to you? Or what would be your preferred option for the long term?

`to_`, and any other prefix would make pandas raise an exception. This would
guarantee a reasonably consistent API among third-party I/O connectors.

Connectors would use Apache Arrow as the only interface to load data from and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get that you're excited about pyarrow support, but i am not on board with getting rid of numpy support

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't phrase this properly. I'll rephrase, but let me elaborate here for now. The idea is that third-party connectors produce an Arrow object, so we don't need to worry if they are using internal pandas APIs to build their data directly as pandas dataframes, that we later may change and break their code. But the fact that they communicate with us providing data as Arrow doesn't mean that when we then create the dataframe with it, we need to use Arrow as the backend. I think for now it makes more sense to build the dataframes with the stable types as we've been doing (unless the user sets the option dtype_backend to pyarrow, as discussed in #51433). Sorry this isn't clear in the proposal. Does this make sense?

@jorisvandenbossche
Copy link
Member

This is making me think that for some important methods (to_excel in particular), the long term plan should be to leave the implementation within pandas (but moved to .io).

... But the main reason I'd move it out is because there are 4 different engines, and I think each of them should be a separate project, so users could simply install whichever is more convenient for them. I don't think keeping 4 connectors or more for Excel in pandas is desirable, and I wouldn't make a choice on which is the "good" one.

Excel is actually an interesting example: first, AFAIK those 4 engines are not about "choice" (for reading at least), but each of them implement different file formats and so which engine gets used depends on the type of file you are reading.

But so that brings up the issue that if each of them would be implemented in a separate project, they can't all use the name "excel" (for OpenDocument spreadsheets it probably make sense to use another name, but for .xls vs .xlsx vs .xlsb files less so).
Of course, the ones that would want to use the same name could also be combined in a single external project.

(Sidenote: according to our user survey, at the time in 2019, excel was the second most used IO method after CSV. I suppose that since then other formats like Parquet will have become more popular, but I assume excel is still one of the most used methods)


In some cases, the pandas API provides `DataFrame.to_*` methods that are not
used to export the data to a disk or memory object, but instead to transform
the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: ive recently been thinking that DataFrame/Series methods that only operate on the index/columns might make sense to put in a accessor/namespace

live under that namespace.

Third-party packages would implement a
[setuptools entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: entrypoints are not (or no longer) setuptools specific, a possible general reference: https://packaging.python.org/en/latest/specifications/entry-points/ or https://docs.python.org/3/library/importlib.metadata.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll refer it as simply entrypoint or Python entrypoint.

@datapythonista
Copy link
Member Author

But so that brings up the issue that if each of them would be implemented in a separate project, they can't all use the name "excel" (for OpenDocument spreadsheets it probably make sense to use another name, but for .xls vs .xlsx vs .xlsb files less so). Of course, the ones that would want to use the same name could also be combined in a single external project.

I don't know much about the Excel connectors to be honest, but very good point regardless of the example. We could in theory solve it in the same way we solved it until now, that it's having two different levels, the format, and the engine. The problem with it (which this approach would make way more visible) is that all engines of a format must have the same exact parameters. Which we've been doing so far, but of course if anyone can implement connectors, we can't enforce anymore (unless we implement every format signature, which doesn't seem a good idea at all). The best idea I've got is that connectors find different names. For excel actually seems to be an easy case, using from_xls, from_xlsx,... and even more if the implementations being used are different. But for csv we already have 3 engines for the same format, something like from_csv_pyarrow, from_csv_c, from_csv_python or whatever should be used instead (of course we could use from_csv for one of them). Not the most beautiful clearly, but having the name explicit is good in some ways (users forced to know there are different options, code being clear on what is being used...).

Surely happy to hear other ideas.

(Sidenote: according to our user survey, at the time in 2019, excel was the second most used IO method after CSV. I suppose that since then other formats like Parquet will have become more popular, but I assume excel is still one of the most used methods)

Yes, very good point indeed. Not important for me, not even sure if I ever used it, but obviously important in general.

@jbrockmendel
Copy link
Member

Unless we want to move pd.read_foo to become DataFrame methods (IIRC there used to exist aliases there and we deprecated them) i think it would make sense for read_foo methods to get pinned in the pd.io namespace rather than DataFrame.io. Not a strong opinion though.

@datapythonista
Copy link
Member Author

datapythonista commented Mar 6, 2023

Unless we want to move pd.read_foo to become DataFrame methods (IIRC there used to exist aliases there and we deprecated them) i think it would make sense for read_foo methods to get pinned in the pd.io namespace rather than DataFrame.io. Not a strong opinion though.

To me a single namespace seems simpler, and having import functions as DataFrame methods feels natural (also to have them as pandas module function). But open to this, and also open to discuss other possible APIs. Not sure if it's crazy, but we can consider something like:

import pandas

df  = pandas.read('csv').engine('pyarrow').params(fname='dataset.csv', delimiter=';')

df.write('parquet').engine('fastparquet').params(fname='dataset.parquet')

This would have some advantages, like not needing an accessor, avoiding the conflicts in naming when more than one engine exists. Clearly more verbose, and a more difficult to get used to it coming from the current API, but worth considering this API too, and others that people can think of.

UPDATE: I Initially used pandas.from() and df.to in the example, but from is a reserved word in Python, so this should be something else, like read/write as I used now.

UPDATE 2: PySpark uses something similar to this, maybe nicer to use the format as the function name (we would still need to figure out how to set the engine in case disambiguation is needed):

spark_session.read.parquet("/tmp/output/people.parquet")

@jreback
Copy link
Contributor

jreback commented Mar 6, 2023

these builder patterns are way too verbose; -1 on anything but read_xxx and to_xxx

we don't need to disrupt the world - changing apis should receive very careful consideration.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a downside here from the perspective of providing typing support for users. Once we migrate the pandas standard connectors to the DataFrame.io namespace, we could provide typing stubs for those methods. But for any third-party library, you can't inject their stubs into the DataFrame.io namespace. (We currently have the same issue if people want to create custom accessors.)

A possible way around this is to define a base class PandasIO that defines the protocol for reading/writing data, and instead of doing DataFrame.io.read_foo(), you write a class called FooIO with a base class of PandasIO and then have something like:

df = FooIO.read(filename, other_args)
FooIO.write(df, filename, other_args)

The advantage of this from a typing perspective is that the library owners of FooIO can define the stubs for read() and write() and users get the advantage of typing support.

This also avoids issues related to adding methods to the DataFrame.io namespace, and possible conflicts of naming. It also allows competing implementations for the same type to be implemented. Finally, there would be no penalty for import pandas . You'd only import the specific library for the I/O parts you needed.

## PDEP Summary

This document proposes that third-party projects implementing I/O or memory
connectors, can register them using Python's entrypoint system, and make them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comma

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting all the typos/grammar corrections. Very helpful, I fixed all them!


pandas supports importing and exporting data from different formats using
I/O connectors, currently implemented in `pandas/io`, as well as connectors
to in-memory structure, like Python structures or other library formats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"structures" not "structure"

[2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html).
Those less popular formats include SPSS, SAS, Google BigQuery and
Stata. Note that only I/O formats (and not memory formats like records or xarray)
where included in the survey.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"were included", not "where included"

to be used, or not well supported.
- There is no standard API for I/O connectors, and users of them need to learn
each of them individually.
- Method chaining, is not possible with third-party I/O connectors to export
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comma


Connectors would use Apache Arrow as the only interface to load data from and
to pandas. This would prevent that changes to the pandas API affecting
connectors in any way. This would simplify the development of connectors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence: "This would prevent that changes to the pandas API affecting
connectors in any way. "

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased the whole paragraph, thanks for letting me know.

`DataFrame.io` namespace, and the automatic registration of functions defined
by third-party projects, if an entrypoint is defined.

Any changes to the current I/O of pandas are out of scope for this proposal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe be more explicit here that it is not just "current I/O", but includes methods like to_json() and from_json(), etc. So you could say that migrations of the current API to the DataFrame.io namespace are not part of this PDEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased, I think it should be more clear now.

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2023

Instead of adding the io accessor could we not maintain a function registry for read / write functions? That is how pyarrow / arrow for instance allow you to bring your own functions. I think something like:

import foolib
import pandas as pd

pd.register_read_function("foo", foolib.read_dataframe)
pd.read_foo("afile.foo")

maintains the read_ / write_ interfaces Jeff is looking for.

I'm am -1 on the import registry proposal described here - I think that pattern is too much of a blackbox to work reliably

@datapythonista
Copy link
Member Author

Instead of adding the io accessor could we not maintain a function registry for read / write functions? That is how pyarrow / arrow for instance allow you to bring your own functions.

Interesting point. Personally I think the custom registration of methods is a big barrier for many users. Probably not needed, since you could have just import connector_package, and it could make that registration for you. That's how projects implementing custom accessors already do it. I totally understand your point about the magic of entrypoints, felt the same exact way when I first discovered it. But a good idea or not it's the standard API of Python, and I think it's not magic if you know that it exists and how it works. So, while I totally understand you, I think the "magic" entrypoint API is the correct way to do things, and the other options are the hacks, even if many people may feel the other way round initially. Probably providing clear documentaiton about how it works, and maybe functionality to see what's been installed (e.g. DataFrame.io.list_connectors()) would make that magic become something natural and predictable. What do you think?

@WillAyd
Copy link
Member

WillAyd commented Mar 7, 2023

Do you know of any other packages that currently leverage the entry points like you describe?

@datapythonista
Copy link
Member Author

Do you know of any other packages that currently leverage the entry points like you describe?

Pytest, Ibis and Pylons that I know. There are lots using them to create "binaries", suck as Jupyter, Flake8, pytest... But that use case it's quite different.

@datapythonista
Copy link
Member Author

A possible way around this is to define a base class PandasIO that defines the protocol for reading/writing data, and instead of doing DataFrame.io.read_foo(), you write a class called FooIO with a base class of PandasIO and then have something like:

@Dr-Irv That sounds like a good idea. I don't think the class should be in pandas, since the connectors could potentially be used by other projects, but a base class, and maybe a generic test suite that let connector developers know if they are breaking the API or a convention in a simpler way can be something useful. And if that helps with typing, then even better.

@simonjayhawkins
Copy link
Member

I see that we basically have four API proposals at the moment

  1. entry_points to register .to_foo DataFrame methods and .from_foo methods and/or top level functions
  2. a way to register import functions using say pandas.read('csv').engine('pyarrow')
  3. maintain a function registry for read / write functions say pd.register_read_function("foo", foolib.read_dataframe) and then do pd.read_foo("afile.foo")`
  4. define a base class PandasIO that defines the protocol for reading/writing data

I think that we could maybe have a rejected ideas section in the PDEP outlining the ones that are not chosen.

At the risk of introducing a fifth option...

I see we have an issue open for entry points for the SQL engine #41728. Now in this PDEP we mention of DuckDB and the solution would be a from_duckdb method.

Now, personally, I've never used pandas.read_sql but see that we also have pandas.read_sql_query and read_sql_table methods.

I also see that @rhshadrach bought up the issue of a consistent API when it comes to function parameters.

So my question is...

Could there maybe a plug-in structure, where 3rd parties can add engines to any of our existing IO routines instead of directly into the .io namespace?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 7, 2023

Could there maybe a plug-in structure, where 3rd parties can add engines to any of our existing IO routines instead of directly into the .io namespace?

I think my proposal of a PandasIO base class does something similar. Each implemented subclass could live as a separate package that people install, and they directly reference the subclass to handle read/write operations.

It's similar in concept to how python handles I/O. Base class IOBase, with 3 ABC's on top of that, and then implementations like FileIO, BytesIO, etc. on top of that.

From a transition point of view, we could leave in the existing read_xxx() and to_yyy() routines, and switch their implementations to the new class structure, and decide at a later date whether to deprecate and eventually remove read_xxx() and to_yyy(). We could add a parameter to things like read_sql() that indicates the class to use, which would allow people to substitute in their own implementations rather than the one we provide.

@datapythonista
Copy link
Member Author

Thanks @simonjayhawkins for the summary and the proposal. Adding the same code samples for the different options on what they mean for the users, I think it makes it easier to see the differences and helps the discussion (let me know if I missed any option (the exact naming from/to, read/write can probably be discussed separately once we decide the API option):

# option 1 (original and current proposal in the PR)
df = pandas.DataFrame.io.from_myformat(...)
df.io.to_myformat(...)


# option 2 (same, but separate namespaces for from/to, proposed by Brock)
df = pandas.io.from_myformat(...)
df.io.to_myformat(...)


# option 3 (single methods using a builder pattern
df = pandas.read('myformat').engine('myengine').params(...)
df.write('myformat').engine('myengine').params(...)


# option 4 (register into the existing `pandas` and `DataFrame` namespaces)
df = pandas.from_myformat(...)
df.to_myformat(...)


# option 5 (register engines, not formats. This is different from the rest, 
# since signatures for all formats should be defined in pandas,
# and new formats can't be added by third-parties, proposed by Simon)
df = pandas_from_knownformat(engine='third-party-engine', ...)
df.to_knownformat(engine='third-party-engine', ...)


# option 6 (make connectors be subclasses of a pandas defined class, and use them directly, this can already be done, we would just standardize things, proposed by Irv)
import MyFormatConnector

df = MyFormatConnector.read(...)
MyFormatConnector.write(df)


# option 7 (register connectors directly. I see this as the same as 4,
# except that instead of using entrypoints the import would be used,
# since the registration is likely to happen inside the modules.
# This can be done already monkeypatching, the only addition would be to
# provide the register methods to make it official we're happy with it. Proposed by Will)
import myconnector

pandas.register_read_function("myformat", myconnector.read_dataframe)
pandas.register_write_function("myformat", myconnector.write_dataframe)

df = pandas.read_myformat(...)
df.to_myformat(...)

My opinion:

Some variation of 1, 2 or 3 seem reasonable to me.

4 and 7 seem a bit messy. Maybe in the long term if we remove many formats users will only have a small number of from/to methods, and my only concern is the magic of mixing core methods with third-party dynamic methods in the same namespace. But the transition period with the status quo plus new formats seems chaotic.

I think 5 doesn't solve the problem of users being able to use domain-specific formats in a standard way, and keeps the maintenance cost high (discussing which formats are supported...)

I think 6 is not very pandas consistent or friendly API, and also doesn't support method chaining.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 1, 2023

Do you mind expanding on how users will be affected exactly? I guess the main issue is that the type of the object returned by pandas.read_lance(...) will be unknown, and mypy won't be able to warn if I use the object in an incorrect way? Maybe autocomplete on the returned dataframe won't work in Jupyter? Or is that implemented using dir(obj). Thanks!

It's three things:

  1. Type checkers would see that pd.read_lance() and DataFrame.write_lance() are not valid methods.
  2. Because they are not valid methods, then the types of those methods would be unknown
  3. Whatever arguments were allowed for those 2 methods would not be type checked at all. So a user could write df = pd.read_lance(10) and the type checker wouldn't complain about it, assuming the first argument was supposed to be a string.

@datapythonista
Copy link
Member Author

Thanks for the info @Dr-Irv, that makes sense, I didn't think the injected methods themselves would error.

Does that mean that our accessors and anything in Python making use of __getattr__ is also broken in mypy? I guess users may be used to adding exceptions then, no?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 2, 2023

Does that mean that our accessors and anything in Python making use of __getattr__ is also broken in mypy? I guess users may be used to adding exceptions then, no?

No. For any accessors that we document, we put in explicit type definitions in pandas-stubs. But if you do pd.foobar(), the type checker will say that is invalid.

We can have the stubs include things that are implemented dynamically, because we know the definitions are always there (e.g. Series.str, Series.dt). What we can't do is have the stubs include things a user would hook into pandas in a dynamic fashion.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 10, 2023

A couple of notes based on the dev meeting discussion on 5/10/23:

@datapythonista
Copy link
Member Author

@pandas-dev/pandas-core I updated this PDEP to use the dataframe interchange API as the way connectors will share data with pandas. I think this should address the concerns about Arrrow, and simplifies the design and implementation. I also added to the proposal that the plugin system will be opt-in (this can be reconsidered in the future). Previously I changed the proposal to use the existing pandas API, and not change the method names or the namespaces.

You can see a proof of concept of what are the implications of this PDEP in the pandas code base.

Please let me know if there is any other concern, otherwise feel free to approve. Thanks!

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention that using these connectors would not provide any support for doing type checking.

I also have to wonder if we were starting from scratch (no legacy pd.read_* and DataFrame.to_* methods), what API we would actually design that would not do any method insertion into our existing classes.

@datapythonista
Copy link
Member Author

I think you should mention that using these connectors would not provide any support for doing type checking.

I also have to wonder if we were starting from scratch (no legacy pd.read_* and DataFrame.to_* methods), what API we would actually design that would not do any method insertion into our existing classes.

Thanks @Dr-Irv, good feedback. I added a limitations section to discuss about the lack of type checking and couple of other points probably worth mentioning.

Personally, if starting from scratch I'd unify the I/O connector methods in the DataFrame.io namespace, so it's isolated from the rest of the API. This was discussed in the early stages of this PDEP, but the preference was to keep the current API, and I think it's a good idea to keep this proposal independent from API changes of that magnitude. I also added a note in the limitations about this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 11, 2023

I would like to propose an alternative design that supports typing (if a user wants to use it), requires no registrations, and handles the conflict issue.

Idea is that we define an interface for reading and writing:

from abc import abstractmethod
import pandas as pd


class PandasDataIOInterface:
    """
    Abstract interface for reading and writing
    """

    @classmethod
    @abstractmethod
    def read(cls, *args, **kwargs) -> pd.DataFrame:
        ...

    @classmethod
    @abstractmethod
    def write(cls, df: pd.DataFrame, *args, **kwargs) -> str | None:
        ...

Users then create a subclass of PandasDataIOInterface that implements class methods for read() and write()

We define pandas.read_pandas_data_io() that takes as a parameter either the class name followed by all the parameters of the read() function (provides no typing checks) OR can take the result of calling the read() function of the implemented class (which would have type checking).

We define DataFrame.to_pandas_data_io() that takes as a parameter either the class name (provides no typing checks) following by all the parameters of the write() function OR can take a callable that calls the write() function.

This is illustrated in this code here. I could do a PR if you like. (Expand the "Details" to see the code):

from abc import abstractmethod
import inspect
from typing import Callable, ParamSpec, Concatenate, cast

import pandas as pd

P = ParamSpec("P")


class PandasDataIOInterface:
    """
    Abstract interface for reading and writing
    """

    @classmethod
    @abstractmethod
    def read(cls, *args, **kwargs) -> pd.DataFrame:
        ...

    @classmethod
    @abstractmethod
    def write(cls, df: pd.DataFrame, *args, **kwargs) -> str | None:
        ...


def to_pandas_data_io(
    self,
    interface: type[PandasDataIOInterface]
    | Callable[Concatenate[pd.DataFrame, P], str | None],
    *args,
    **kwargs
) -> str | None:
    """
    Public method of `pd.DataFrame` that will output the DataFrame
    to a writer

    Parameters
    ----------
    interface : type[PandasDataIOInterface] | Callable[Concatenate[pd.DataFrame, P], str  |  None]
        Can pass in the type of class that does reading and writing, or a callable
        that will take a `DataFrame` and any number of arguments, which are then
        passed to the `write()` method.

    Returns
    -------
    str | None
        Based on arguments of the wrtier
    """
    if callable(interface) and not inspect.isclass(interface):
        func = cast(Callable[..., str | None], interface)
        return func(self, *args, **kwargs)
    elif issubclass(interface, PandasDataIOInterface):
        return interface.write(self, *args, **kwargs)


def read_pandas_data_io(
    interface: type[PandasDataIOInterface] | pd.DataFrame, *args, **kwargs
) -> pd.DataFrame:
    """
    Public `pandas.read_pandas_data_io()` method for reaching in data

    Parameters
    ----------
    interface : type[PandasDataIOInterface] | pd.DataFrame
       Pass the class that handles the reading, or a function call that returns
       a DataFrame

    Returns
    -------
    pd.DataFrame
        DataFrame that is read in
    """
    if isinstance(interface, pd.DataFrame):
        return interface
    elif issubclass(interface, PandasDataIOInterface):
        return interface.read(*args, **kwargs)


class CSVCover(PandasDataIOInterface):
    """
    Example read and write interface - using CSV files, just uses the filename

    """

    @classmethod
    def read(cls, file: str) -> pd.DataFrame:
        return pd.read_csv(file)

    @classmethod
    def write(cls, df: pd.DataFrame, file: str) -> str | None:
        return df.to_csv(file)


# Patch in the `DataFrame.to_pandas_data_io` method (for illustration)
pd.DataFrame.to_pandas_data_io = to_pandas_data_io  # type: ignore

# Patch in the `pandas.read_pandas_data_io` method (for illustration)
pd.read_pandas_data_io = read_pandas_data_io  # type: ignore


# Note the `#type: ignore` statements are needed here because we haven't
# patched in things the right way.  If the methods were defined in the pandas stubs, you wouldn't need that.

# First example would not provide typing support, but might be easier to use
df = pd.read_pandas_data_io(CSVCover, "junk.csv")  # type: ignore
df.to_pandas_data_io(CSVCover, "junk2.csv")  # type: ignore


# Second example does provide typing support for the `read()` and `write()` methods
# of the implemented class
df2 = pd.read_pandas_data_io(CSVCover.read("junk.csv"))  # type: ignore
df2.to_pandas_data_io(lambda df: CSVCover.write(df, "junk3.csv"))  # type:ignore

Advantages of this way of doing things are the following, IMHO:

  1. No registration is needed. Users just import the class that does the reading and writing
  2. The class is never instantiated - it's all done through class methods
  3. Conflicts are resolved because the class names are unique. Or if there were two libraries that implemented the same class name, you could choose which package to import it from
  4. If you want to get type checking, it's available, and the syntax isn't too clumsy. If you don't want to get type checking, the syntax is still clean - you just specify the class and the arguments.
  5. It doesn't pollute the pandas or DataFrame method namespace.

In the future, we could then deprecate things like read_sas()/to_sas() and just create the class SASPandasIO that implements read() and wriite() and it moves the maintenance of this out of our code base pretty easily.

@jbrockmendel
Copy link
Member

It doesn't pollute the pandas or DataFrame method namespace.

I like this part!

```python
import pandas

pandas.load_io_plugins()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this explicit (require the user specify which io plugin to load)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I don't see how this is much different than using import pandas_myformat, which we can already do without the need of any change in pandas. We could still try to enforce a standard API, but we lose the easiness of use in my opinion.

@datapythonista
Copy link
Member Author

I would like to propose an alternative design that supports typing (if a user wants to use it), requires no registrations, and handles the conflict issue.

df = pd.read_pandas_data_io(CSVCover, "junk.csv")

I personally don't see the advantage of this approach. Maybe it's only a personal opinion, but the API provided by this PDEP feels very intuitive to me, as it's the same for new plugins as for existing core connectors. If we want to change the existing API to something new, why not simply do this, and we don't need to change anything in pandas at all:

import pandas_myformat

df = pandas_myformat.read()
df.pipe(myformat.write, fname='foo.myformat')

@datapythonista
Copy link
Member Author

This PDEP failed to get any support after several iterations and making it a minimal version of the original proposal with no API changes and no Arrow interface. I marked it as rejected, and I'll merge on green. If there is any objection to merge it as requested, happy to simply close the PR too.

If anyone has interest in something similar to this, feel free to open a new PDEP.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 13, 2023

I personally don't see the advantage of this approach.

I listed the advantages at the end of this comment: #51799 (comment)

Maybe it's only a personal opinion, but the API provided by this PDEP feels very intuitive to me, as it's the same for new plugins as for existing core connectors. If we want to change the existing API to something new, why not simply do this, and we don't need to change anything in pandas at all:

import pandas_myformat

df = pandas_myformat.read()
df.pipe(myformat.write, fname='foo.myformat')

The disadvantage of the above is lack of typing support when writing.

What I proposed addresses a number of the issues discussed above.

@datapythonista
Copy link
Member Author

I listed the advantages at the end of this comment: #51799 (comment)

Sorry, I wasn't very clear. I meant the advantages compared to my example later. If I'm not wrong, all the points in your list are also true if we just use a regular function for the reader, and pipe with a regular function for the writer. Is that right?

I'd still prefer to have a simpler/nicer API than any of these two options, but I'll merge this as rejected and leave this topic to someone else, it doesn't seem to be making any progress, so we better focus in something more productive.

@datapythonista datapythonista merged commit 027adc5 into pandas-dev:main Jun 13, 2023
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 13, 2023

Sorry, I wasn't very clear. I meant the advantages compared to my example later. If I'm not wrong, all the points in your list are also true if we just use a regular function for the reader, and pipe with a regular function for the writer. Is that right?

No, because you don't get typing support with using pipe with the writer.

canthonyscott pushed a commit to canthonyscott/pandas-anthony that referenced this pull request Jun 23, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.