-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799
Conversation
make testing of them much more reliable and resistant to changes in pandas, and | ||
allow connectors to be reused by other projects of the ecosystem. | ||
|
||
In case a `from_` method returned something different than a PyArrow table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ritchie46 would be great to get your feedback in the whole proposal, but in particular to this part. Assuming Polars wants to benefit from generic readers and writers that can work with any dataframe project, what do you think about a PyArrow object? Since Polars is Arrow2 instead, would it make sense to make the readers/writers just expose the info to access the lower levels structs that the Arrow standards provide to exachange information? Not sure what the best option would be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pyarrow objects are perfect. We adhere to the arrow memory specification, the actual implementation of that specification doesn't really matter.
So yeah, pyarrow.table is good. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ritchie46, that sounds great. Just to be clear, my concern wasn't that you couldn't read Arrow objects, but that you may not want to have PyArrow as a dependency to be able to use this API, and I think returning a PyArrow table forces that. PyArrow has more than 30 dependencies (at least in conda-forge), some of them not so small, like libarrow itself, or libgoogle-cloud.
I guess PyArrow may be the best to start, but I wonder if maybe a minimal Python Arrow library that can hold the metadata to access the Arrow data, and then be converted to the PyArrow or Arrow2/Polars structure would be better for the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small plug for 'nanoarrow', which just had a 0.1 release, has R bindings, and could have Python bindings (e.g., apache/arrow-nanoarrow#117 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing, thanks for sharing @paleolimbot. I think this is exactly what we need, way better than using PyArrow and requiring it as a dependency for this connector subsystem! Just in time, let me know if at some point you need feedback or some testing of the Python bindings when they're ready. :)
from_duckdb = "pandas_duckdb:from_duckdb" | ||
``` | ||
|
||
On import of the pandas module, it would read the entrypoint registry for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make import pandas
slower, not sure if it can be done in a lazy way. For people using autocomplete when coding for example, I think the DataFrame.io
should be populated before __getattr__
is called. But maybe it can just be populated when any of __dir__
or __getattr__
is called, and we get the best of both worlds? Or too complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any idea how big this slowdown would be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can be minimal if the user has just couple of connectors with lightweight dependencies installed. But there could possibly be a situation where dozens of connectors exist, and some of them are slow to load (maybe they have imports on their dependencies that are also slow, or other reasons). Unless connectors implement lazy imports, the loading of all connectors including their dependencies would happen at import pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation allows us some control over these methods so that there is a more consistent user experience between different engines. By having third parties solely responsible for the connector, I believe it is then completely out of our hands, which could make for a worse user experience.
This is making me think that for some important methods (to_excel
in particular), the long term plan should be to leave the implementation within pandas (but moved to .io
).
`dataframe.io` group, and would dynamically create methods in the `DataFrame.io` | ||
namespace for them. Method names would only be allowed to start with `from_` or | ||
`to_`, and any other prefix would make pandas raise an exception. This would | ||
guarantee a reasonably consistent API among third-party I/O connectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would conflicts work if there are two packages that want to register to_widget
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. A conflict only would happen if a user has both packages installed, so I guess in most cases wouldn't be a problem, even if two authors decide to use the same name.
I guess entrypoints don't require names to be unique, so I think it's up to us to decide. I'd personally raise an exception and let the user know that they installed two packages providing the same method, and they should uninstall one to disambiguate.
The scope of the current proposal is limited to the addition of the | ||
`DataFrame.io` namespace, and the automatic registration of functions defined | ||
by third-party projects, if an entrypoint is defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on making this out of scope for the proposal itself, however I think it's also natural to discuss what the next steps may be when reasoning about this change. So I hope it's okay that the steps beyond this are still okay to discuss here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. I think it makes more sense to make a final decision on possible migrations at a later point, when we have more information about this new system. But fully agree that it's worth discussing, and having an idea of other people expectations regarding this.
To me, some of the current functions, such as read_sas, read_spss, read_gbq, to_xarray and maybe others could be moved to this API as soon as the new system is in place, and make the current API raise a FutureWarning to let users know about the change.
For the more commonly used, I don't have a strong opinion. Maybe make them available via the new API too kind of soon, for users who want a more consistent API, but wait until any FutureWarning or anything?
That's an interesting point. I personally think that the current proposal already guarantees a consistent behavior, leaving only the parameters up to the developers of the connectors. This would prevent things like using the builder/fluid pattern, that if I'm not wrong PySpark is using, and afaik users seem to like. Talking about something like Do you have a particular example that can show the kind of inconsistencies you're worried about? I find it difficult to understand the problem without an example.
Not a strong opinion, but in my mind, in the long term I'd only have in pandas the connectors provided by PyArrow. That doesn't mean we can make a exception for Excel. I don't consider it so important, but of course that depends of the domain of each one. But the main reason I'd move it out is because there are 4 different engines, and I think each of them should be a separate project, so users could simply install whichever is more convenient for them. I don't think keeping 4 connectors or more for Excel in pandas is desirable, and I wouldn't make a choice on which is the "good" one. As said, this for the long term, I wouldn't make any change to Excel for now. Does it make sense to you? Or what would be your preferred option for the long term? |
`to_`, and any other prefix would make pandas raise an exception. This would | ||
guarantee a reasonably consistent API among third-party I/O connectors. | ||
|
||
Connectors would use Apache Arrow as the only interface to load data from and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i get that you're excited about pyarrow support, but i am not on board with getting rid of numpy support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't phrase this properly. I'll rephrase, but let me elaborate here for now. The idea is that third-party connectors produce an Arrow object, so we don't need to worry if they are using internal pandas APIs to build their data directly as pandas dataframes, that we later may change and break their code. But the fact that they communicate with us providing data as Arrow doesn't mean that when we then create the dataframe with it, we need to use Arrow as the backend. I think for now it makes more sense to build the dataframes with the stable types as we've been doing (unless the user sets the option dtype_backend
to pyarrow
, as discussed in #51433). Sorry this isn't clear in the proposal. Does this make sense?
Excel is actually an interesting example: first, AFAIK those 4 engines are not about "choice" (for reading at least), but each of them implement different file formats and so which engine gets used depends on the type of file you are reading. But so that brings up the issue that if each of them would be implemented in a separate project, they can't all use the name "excel" (for OpenDocument spreadsheets it probably make sense to use another name, but for .xls vs .xlsx vs .xlsb files less so). (Sidenote: according to our user survey, at the time in 2019, excel was the second most used IO method after CSV. I suppose that since then other formats like Parquet will have become more popular, but I assume excel is still one of the most used methods) |
|
||
In some cases, the pandas API provides `DataFrame.to_*` methods that are not | ||
used to export the data to a disk or memory object, but instead to transform | ||
the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: ive recently been thinking that DataFrame/Series methods that only operate on the index/columns might make sense to put in a accessor/namespace
live under that namespace. | ||
|
||
Third-party packages would implement a | ||
[setuptools entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: entrypoints are not (or no longer) setuptools specific, a possible general reference: https://packaging.python.org/en/latest/specifications/entry-points/ or https://docs.python.org/3/library/importlib.metadata.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll refer it as simply entrypoint or Python entrypoint.
I don't know much about the Excel connectors to be honest, but very good point regardless of the example. We could in theory solve it in the same way we solved it until now, that it's having two different levels, the format, and the engine. The problem with it (which this approach would make way more visible) is that all engines of a format must have the same exact parameters. Which we've been doing so far, but of course if anyone can implement connectors, we can't enforce anymore (unless we implement every format signature, which doesn't seem a good idea at all). The best idea I've got is that connectors find different names. For excel actually seems to be an easy case, using Surely happy to hear other ideas.
Yes, very good point indeed. Not important for me, not even sure if I ever used it, but obviously important in general. |
Unless we want to move pd.read_foo to become DataFrame methods (IIRC there used to exist aliases there and we deprecated them) i think it would make sense for read_foo methods to get pinned in the pd.io namespace rather than DataFrame.io. Not a strong opinion though. |
To me a single namespace seems simpler, and having import functions as DataFrame methods feels natural (also to have them as pandas module function). But open to this, and also open to discuss other possible APIs. Not sure if it's crazy, but we can consider something like: import pandas
df = pandas.read('csv').engine('pyarrow').params(fname='dataset.csv', delimiter=';')
df.write('parquet').engine('fastparquet').params(fname='dataset.parquet') This would have some advantages, like not needing an accessor, avoiding the conflicts in naming when more than one engine exists. Clearly more verbose, and a more difficult to get used to it coming from the current API, but worth considering this API too, and others that people can think of. UPDATE: I Initially used UPDATE 2: PySpark uses something similar to this, maybe nicer to use the format as the function name (we would still need to figure out how to set the engine in case disambiguation is needed): spark_session.read.parquet("/tmp/output/people.parquet") |
these builder patterns are way too verbose; -1 on anything but read_xxx and to_xxx we don't need to disrupt the world - changing apis should receive very careful consideration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a downside here from the perspective of providing typing support for users. Once we migrate the pandas standard connectors to the DataFrame.io
namespace, we could provide typing stubs for those methods. But for any third-party library, you can't inject their stubs into the DataFrame.io
namespace. (We currently have the same issue if people want to create custom accessors.)
A possible way around this is to define a base class PandasIO
that defines the protocol for reading/writing data, and instead of doing DataFrame.io.read_foo()
, you write a class called FooIO
with a base class of PandasIO
and then have something like:
df = FooIO.read(filename, other_args)
FooIO.write(df, filename, other_args)
The advantage of this from a typing perspective is that the library owners of FooIO
can define the stubs for read()
and write()
and users get the advantage of typing support.
This also avoids issues related to adding methods to the DataFrame.io
namespace, and possible conflicts of naming. It also allows competing implementations for the same type to be implemented. Finally, there would be no penalty for import pandas
. You'd only import the specific library for the I/O parts you needed.
## PDEP Summary | ||
|
||
This document proposes that third-party projects implementing I/O or memory | ||
connectors, can register them using Python's entrypoint system, and make them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reporting all the typos/grammar corrections. Very helpful, I fixed all them!
|
||
pandas supports importing and exporting data from different formats using | ||
I/O connectors, currently implemented in `pandas/io`, as well as connectors | ||
to in-memory structure, like Python structures or other library formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"structures" not "structure"
[2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). | ||
Those less popular formats include SPSS, SAS, Google BigQuery and | ||
Stata. Note that only I/O formats (and not memory formats like records or xarray) | ||
where included in the survey. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"were included", not "where included"
to be used, or not well supported. | ||
- There is no standard API for I/O connectors, and users of them need to learn | ||
each of them individually. | ||
- Method chaining, is not possible with third-party I/O connectors to export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comma
|
||
Connectors would use Apache Arrow as the only interface to load data from and | ||
to pandas. This would prevent that changes to the pandas API affecting | ||
connectors in any way. This would simplify the development of connectors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence: "This would prevent that changes to the pandas API affecting
connectors in any way. "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased the whole paragraph, thanks for letting me know.
`DataFrame.io` namespace, and the automatic registration of functions defined | ||
by third-party projects, if an entrypoint is defined. | ||
|
||
Any changes to the current I/O of pandas are out of scope for this proposal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe be more explicit here that it is not just "current I/O", but includes methods like to_json()
and from_json()
, etc. So you could say that migrations of the current API to the DataFrame.io
namespace are not part of this PDEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased, I think it should be more clear now.
Instead of adding the io accessor could we not maintain a function registry for read / write functions? That is how pyarrow / arrow for instance allow you to bring your own functions. I think something like:
maintains the read_ / write_ interfaces Jeff is looking for. I'm am -1 on the import registry proposal described here - I think that pattern is too much of a blackbox to work reliably |
Interesting point. Personally I think the custom registration of methods is a big barrier for many users. Probably not needed, since you could have just |
Do you know of any other packages that currently leverage the entry points like you describe? |
Pytest, Ibis and Pylons that I know. There are lots using them to create "binaries", suck as Jupyter, Flake8, pytest... But that use case it's quite different. |
@Dr-Irv That sounds like a good idea. I don't think the class should be in pandas, since the connectors could potentially be used by other projects, but a base class, and maybe a generic test suite that let connector developers know if they are breaking the API or a convention in a simpler way can be something useful. And if that helps with typing, then even better. |
I see that we basically have four API proposals at the moment
I think that we could maybe have a rejected ideas section in the PDEP outlining the ones that are not chosen. At the risk of introducing a fifth option... I see we have an issue open for entry points for the SQL engine #41728. Now in this PDEP we mention of DuckDB and the solution would be a Now, personally, I've never used I also see that @rhshadrach bought up the issue of a consistent API when it comes to function parameters. So my question is... Could there maybe a plug-in structure, where 3rd parties can add engines to any of our existing IO routines instead of directly into the |
I think my proposal of a It's similar in concept to how python handles I/O. Base class From a transition point of view, we could leave in the existing |
Thanks @simonjayhawkins for the summary and the proposal. Adding the same code samples for the different options on what they mean for the users, I think it makes it easier to see the differences and helps the discussion (let me know if I missed any option (the exact naming from/to, read/write can probably be discussed separately once we decide the API option): # option 1 (original and current proposal in the PR)
df = pandas.DataFrame.io.from_myformat(...)
df.io.to_myformat(...)
# option 2 (same, but separate namespaces for from/to, proposed by Brock)
df = pandas.io.from_myformat(...)
df.io.to_myformat(...)
# option 3 (single methods using a builder pattern
df = pandas.read('myformat').engine('myengine').params(...)
df.write('myformat').engine('myengine').params(...)
# option 4 (register into the existing `pandas` and `DataFrame` namespaces)
df = pandas.from_myformat(...)
df.to_myformat(...)
# option 5 (register engines, not formats. This is different from the rest,
# since signatures for all formats should be defined in pandas,
# and new formats can't be added by third-parties, proposed by Simon)
df = pandas_from_knownformat(engine='third-party-engine', ...)
df.to_knownformat(engine='third-party-engine', ...)
# option 6 (make connectors be subclasses of a pandas defined class, and use them directly, this can already be done, we would just standardize things, proposed by Irv)
import MyFormatConnector
df = MyFormatConnector.read(...)
MyFormatConnector.write(df)
# option 7 (register connectors directly. I see this as the same as 4,
# except that instead of using entrypoints the import would be used,
# since the registration is likely to happen inside the modules.
# This can be done already monkeypatching, the only addition would be to
# provide the register methods to make it official we're happy with it. Proposed by Will)
import myconnector
pandas.register_read_function("myformat", myconnector.read_dataframe)
pandas.register_write_function("myformat", myconnector.write_dataframe)
df = pandas.read_myformat(...)
df.to_myformat(...) My opinion: Some variation of 1, 2 or 3 seem reasonable to me. 4 and 7 seem a bit messy. Maybe in the long term if we remove many formats users will only have a small number of from/to methods, and my only concern is the magic of mixing core methods with third-party dynamic methods in the same namespace. But the transition period with the status quo plus new formats seems chaotic. I think 5 doesn't solve the problem of users being able to use domain-specific formats in a standard way, and keeps the maintenance cost high (discussing which formats are supported...) I think 6 is not very pandas consistent or friendly API, and also doesn't support method chaining. |
It's three things:
|
Thanks for the info @Dr-Irv, that makes sense, I didn't think the injected methods themselves would error. Does that mean that our accessors and anything in Python making use of |
No. For any accessors that we document, we put in explicit type definitions in We can have the stubs include things that are implemented dynamically, because we know the definitions are always there (e.g. |
A couple of notes based on the dev meeting discussion on 5/10/23:
|
@pandas-dev/pandas-core I updated this PDEP to use the dataframe interchange API as the way connectors will share data with pandas. I think this should address the concerns about Arrrow, and simplifies the design and implementation. I also added to the proposal that the plugin system will be opt-in (this can be reconsidered in the future). Previously I changed the proposal to use the existing pandas API, and not change the method names or the namespaces. You can see a proof of concept of what are the implications of this PDEP in the pandas code base. Please let me know if there is any other concern, otherwise feel free to approve. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should mention that using these connectors would not provide any support for doing type checking.
I also have to wonder if we were starting from scratch (no legacy pd.read_*
and DataFrame.to_*
methods), what API we would actually design that would not do any method insertion into our existing classes.
Co-authored-by: Irv Lustig <[email protected]>
Thanks @Dr-Irv, good feedback. I added a limitations section to discuss about the lack of type checking and couple of other points probably worth mentioning. Personally, if starting from scratch I'd unify the I/O connector methods in the |
I would like to propose an alternative design that supports typing (if a user wants to use it), requires no registrations, and handles the conflict issue. Idea is that we define an interface for reading and writing: from abc import abstractmethod
import pandas as pd
class PandasDataIOInterface:
"""
Abstract interface for reading and writing
"""
@classmethod
@abstractmethod
def read(cls, *args, **kwargs) -> pd.DataFrame:
...
@classmethod
@abstractmethod
def write(cls, df: pd.DataFrame, *args, **kwargs) -> str | None:
... Users then create a subclass of We define We define This is illustrated in this code here. I could do a PR if you like. (Expand the "Details" to see the code): from abc import abstractmethod
import inspect
from typing import Callable, ParamSpec, Concatenate, cast
import pandas as pd
P = ParamSpec("P")
class PandasDataIOInterface:
"""
Abstract interface for reading and writing
"""
@classmethod
@abstractmethod
def read(cls, *args, **kwargs) -> pd.DataFrame:
...
@classmethod
@abstractmethod
def write(cls, df: pd.DataFrame, *args, **kwargs) -> str | None:
...
def to_pandas_data_io(
self,
interface: type[PandasDataIOInterface]
| Callable[Concatenate[pd.DataFrame, P], str | None],
*args,
**kwargs
) -> str | None:
"""
Public method of `pd.DataFrame` that will output the DataFrame
to a writer
Parameters
----------
interface : type[PandasDataIOInterface] | Callable[Concatenate[pd.DataFrame, P], str | None]
Can pass in the type of class that does reading and writing, or a callable
that will take a `DataFrame` and any number of arguments, which are then
passed to the `write()` method.
Returns
-------
str | None
Based on arguments of the wrtier
"""
if callable(interface) and not inspect.isclass(interface):
func = cast(Callable[..., str | None], interface)
return func(self, *args, **kwargs)
elif issubclass(interface, PandasDataIOInterface):
return interface.write(self, *args, **kwargs)
def read_pandas_data_io(
interface: type[PandasDataIOInterface] | pd.DataFrame, *args, **kwargs
) -> pd.DataFrame:
"""
Public `pandas.read_pandas_data_io()` method for reaching in data
Parameters
----------
interface : type[PandasDataIOInterface] | pd.DataFrame
Pass the class that handles the reading, or a function call that returns
a DataFrame
Returns
-------
pd.DataFrame
DataFrame that is read in
"""
if isinstance(interface, pd.DataFrame):
return interface
elif issubclass(interface, PandasDataIOInterface):
return interface.read(*args, **kwargs)
class CSVCover(PandasDataIOInterface):
"""
Example read and write interface - using CSV files, just uses the filename
"""
@classmethod
def read(cls, file: str) -> pd.DataFrame:
return pd.read_csv(file)
@classmethod
def write(cls, df: pd.DataFrame, file: str) -> str | None:
return df.to_csv(file)
# Patch in the `DataFrame.to_pandas_data_io` method (for illustration)
pd.DataFrame.to_pandas_data_io = to_pandas_data_io # type: ignore
# Patch in the `pandas.read_pandas_data_io` method (for illustration)
pd.read_pandas_data_io = read_pandas_data_io # type: ignore
# Note the `#type: ignore` statements are needed here because we haven't
# patched in things the right way. If the methods were defined in the pandas stubs, you wouldn't need that.
# First example would not provide typing support, but might be easier to use
df = pd.read_pandas_data_io(CSVCover, "junk.csv") # type: ignore
df.to_pandas_data_io(CSVCover, "junk2.csv") # type: ignore
# Second example does provide typing support for the `read()` and `write()` methods
# of the implemented class
df2 = pd.read_pandas_data_io(CSVCover.read("junk.csv")) # type: ignore
df2.to_pandas_data_io(lambda df: CSVCover.write(df, "junk3.csv")) # type:ignore Advantages of this way of doing things are the following, IMHO:
In the future, we could then deprecate things like |
I like this part! |
```python | ||
import pandas | ||
|
||
pandas.load_io_plugins() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this explicit (require the user specify which io plugin to load)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I don't see how this is much different than using import pandas_myformat
, which we can already do without the need of any change in pandas. We could still try to enforce a standard API, but we lose the easiness of use in my opinion.
I personally don't see the advantage of this approach. Maybe it's only a personal opinion, but the API provided by this PDEP feels very intuitive to me, as it's the same for new plugins as for existing core connectors. If we want to change the existing API to something new, why not simply do this, and we don't need to change anything in pandas at all: import pandas_myformat
df = pandas_myformat.read()
df.pipe(myformat.write, fname='foo.myformat') |
This PDEP failed to get any support after several iterations and making it a minimal version of the original proposal with no API changes and no Arrow interface. I marked it as rejected, and I'll merge on green. If there is any objection to merge it as requested, happy to simply close the PR too. If anyone has interest in something similar to this, feel free to open a new PDEP. |
I listed the advantages at the end of this comment: #51799 (comment)
The disadvantage of the above is lack of typing support when writing. What I proposed addresses a number of the issues discussed above. |
Sorry, I wasn't very clear. I meant the advantages compared to my example later. If I'm not wrong, all the points in your list are also true if we just use a regular function for the reader, and I'd still prefer to have a simpler/nicer API than any of these two options, but I'll merge this as rejected and leave this topic to someone else, it doesn't seem to be making any progress, so we better focus in something more productive. |
No, because you don't get typing support with using pipe with the writer. |
@pandas-dev/pandas-core another PDEP. There is a short summary at the beginning of the proposal for people not immediately having the time to read the whole document.