-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799
Changes from 7 commits
5dbdde9
730df18
23b934f
de3a17b
da784ec
f475350
4a8ba96
6ad6a9d
5cb47d9
68ca3de
150d1d1
6eea8a8
5665dc7
40ebacc
aed569f
eb7c6f0
14a2f4a
8050853
5cb23dd
2af8577
ccb9674
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,395 @@ | ||||||||||||||
# PDEP-9: Allow third-party projects to register pandas connectors with a standard API | ||||||||||||||
|
||||||||||||||
- Created: 5 March 2023 | ||||||||||||||
- Status: Under discussion | ||||||||||||||
- Discussion: [#51799](https://github.com/pandas-dev/pandas/pull/51799) | ||||||||||||||
- Author: [Marc Garcia](https://github.com/datapythonista) | ||||||||||||||
- Revision: 1 | ||||||||||||||
|
||||||||||||||
## PDEP Summary | ||||||||||||||
|
||||||||||||||
This document proposes that third-party projects implementing I/O or memory | ||||||||||||||
connectors can register them using Python's entrypoint system, and make them | ||||||||||||||
available to pandas users with the existing I/O interface. For example: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
import pandas | ||||||||||||||
|
||||||||||||||
df = pandas.DataFrame.read_duckdb("SELECT * FROM 'my_dataset.parquet';") | ||||||||||||||
|
||||||||||||||
df.to_deltalake('/delta/my_dataset') | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
This would allow to easily extend the existing number of connectors, adding | ||||||||||||||
support to new formats and database engines, data lake technologies, | ||||||||||||||
out-of-core connectors, the new ADBC interface and at the same time reduce the | ||||||||||||||
maintenance cost of the pandas core. | ||||||||||||||
|
||||||||||||||
## Current state | ||||||||||||||
|
||||||||||||||
pandas supports importing and exporting data from different formats using | ||||||||||||||
I/O connectors, currently implemented in `pandas/io`, as well as connectors | ||||||||||||||
to in-memory structures, like Python structures or other library formats. | ||||||||||||||
In many cases, those connectors wrap an existing Python library, while in | ||||||||||||||
some others, pandas implements the logic to read and write to a particular | ||||||||||||||
format. | ||||||||||||||
|
||||||||||||||
In some cases, different engines exist for the same format. The API to use | ||||||||||||||
those connectors is `pandas.read_<format>(engine='<engine-name>', ...)` to | ||||||||||||||
import data, and `DataFrame.to_<format>(engine='<engine-name>', ...)` to | ||||||||||||||
export data. | ||||||||||||||
|
||||||||||||||
For objects exported to memory (like a Python dict) the API is the same as | ||||||||||||||
for I/O, `DataFrame.to_<format>(...)`. For formats imported from objects in | ||||||||||||||
memory, the API is different using the `from_` prefix instead of `read_`, | ||||||||||||||
`DataFrame.from_<format>(...)`. | ||||||||||||||
|
||||||||||||||
In some cases, the pandas API provides `DataFrame.to_*` methods that are not | ||||||||||||||
used to export the data to a disk or memory object, but instead to transform | ||||||||||||||
the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related: ive recently been thinking that DataFrame/Series methods that only operate on the index/columns might make sense to put in a accessor/namespace |
||||||||||||||
|
||||||||||||||
Dependencies of the connectors are not loaded by default, and will be | ||||||||||||||
imported when the connector is used. If the dependencies are not installed, | ||||||||||||||
an `ImportError` is raised. | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
>>> pandas.read_gbq(query) | ||||||||||||||
Traceback (most recent call last): | ||||||||||||||
... | ||||||||||||||
ImportError: Missing optional dependency 'pandas-gbq'. | ||||||||||||||
pandas-gbq is required to load data from Google BigQuery. | ||||||||||||||
See the docs: https://pandas-gbq.readthedocs.io. | ||||||||||||||
Use pip or conda to install pandas-gbq. | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
### Supported formats | ||||||||||||||
|
||||||||||||||
The list of formats can be found in the | ||||||||||||||
[IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html). | ||||||||||||||
A more detailed table, including in memory objects, and I/O connectors in the | ||||||||||||||
DataFrame styler is presented next: | ||||||||||||||
|
||||||||||||||
| Format | Reader | Writer | Engines | | ||||||||||||||
|--------------|--------|--------|-----------------------------------------------------------------------------------| | ||||||||||||||
| CSV | X | X | `c`, `python`, `pyarrow` | | ||||||||||||||
| FWF | X | | `c`, `python`, `pyarrow` | | ||||||||||||||
| JSON | X | X | `ujson`, `pyarrow` | | ||||||||||||||
| HTML | X | X | `lxml`, `bs4/html5lib` (parameter `flavor`) | | ||||||||||||||
| LaTeX | | X | | | ||||||||||||||
| XML | X | X | `lxml`, `etree` (parameter `parser`) | | ||||||||||||||
| Clipboard | X | X | | | ||||||||||||||
| Excel | X | X | `xlrd`, `openpyxl`, `odf`, `pyxlsb` (each engine supports different file formats) | | ||||||||||||||
| HDF5 | X | X | | | ||||||||||||||
| Feather | X | X | | | ||||||||||||||
| Parquet | X | X | `pyarrow`, `fastparquet` | | ||||||||||||||
| ORC | X | X | | | ||||||||||||||
| Stata | X | X | | | ||||||||||||||
| SAS | X | | | | ||||||||||||||
| SPSS | X | | | | ||||||||||||||
| Pickle | X | X | | | ||||||||||||||
| SQL | X | X | `sqlalchemy`, `dbapi2` (inferred from the type of the `con` parameter) | | ||||||||||||||
| BigQuery | X | X | | | ||||||||||||||
| dict | X | X | | | ||||||||||||||
| records | X | X | | | ||||||||||||||
| string | | X | | | ||||||||||||||
| markdown | | X | | | ||||||||||||||
| xarray | | X | | | ||||||||||||||
|
||||||||||||||
At the time of writing this document, the `io/` module contains | ||||||||||||||
close to 100,000 lines of Python, C and Cython code. | ||||||||||||||
|
||||||||||||||
There is no objective criteria for when a format is included | ||||||||||||||
in pandas, and the list above is mostly the result of a developer | ||||||||||||||
being interested in implementing the connectors for a certain | ||||||||||||||
format in pandas. | ||||||||||||||
|
||||||||||||||
The number of existing formats available for data that can be processed with | ||||||||||||||
pandas is constantly increasing, and its difficult for pandas to keep up to | ||||||||||||||
date even with popular formats. It could possibly make sense to have connectors | ||||||||||||||
to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others. | ||||||||||||||
|
||||||||||||||
At the same time, some of the formats are not frequently used as shown in the | ||||||||||||||
[2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). | ||||||||||||||
Those less popular formats include SPSS, SAS, Google BigQuery and | ||||||||||||||
Stata. Note that only I/O formats (and not memory formats like records or xarray) | ||||||||||||||
were included in the survey. | ||||||||||||||
|
||||||||||||||
The maintenance cost of supporting all formats is not only in maintaining the | ||||||||||||||
code and reviewing pull requests, but also it has a significant cost in time | ||||||||||||||
spent on CI systems installing dependencies, compiling code, running tests, etc. | ||||||||||||||
|
||||||||||||||
In some cases, the main maintainers of some of the connectors are not part of | ||||||||||||||
the pandas core development team, but people specialized in one of the formats | ||||||||||||||
without commit rights. | ||||||||||||||
|
||||||||||||||
## Proposal | ||||||||||||||
|
||||||||||||||
While the current pandas approach has worked reasonably well, it is difficult | ||||||||||||||
to find a stable solution where the maintenance incurred in pandas is not | ||||||||||||||
too big, while at the same time users can interact with all different formats | ||||||||||||||
and representations they are interested in, in an easy and intuitive way. | ||||||||||||||
|
||||||||||||||
Third-party packages are already able to implement connectors to pandas, but | ||||||||||||||
there are some limitations to it: | ||||||||||||||
|
||||||||||||||
- Given the large number of formats supported by pandas itself, third-party | ||||||||||||||
connectors are likely seen as second class citizens, not important enough | ||||||||||||||
to be used, or not well supported. | ||||||||||||||
- There is no standard API for external I/O connectors, and users need | ||||||||||||||
to learn each of them individually. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO the current proposal does not adequately address this and the other two limitations are perhaps less important. |
||||||||||||||
- Method chaining is not possible with third-party I/O connectors to export | ||||||||||||||
data, unless authors monkey patch the `DataFrame` class, which should not | ||||||||||||||
be encouraged. | ||||||||||||||
|
||||||||||||||
This document proposes to open the development of pandas I/O connectors to | ||||||||||||||
third-party libraries in a standard way that overcomes those limitations. | ||||||||||||||
|
||||||||||||||
### Proposal implementation | ||||||||||||||
|
||||||||||||||
Implementing this proposal would not require major changes to pandas, and | ||||||||||||||
the API defined next would be used. | ||||||||||||||
|
||||||||||||||
#### User API | ||||||||||||||
|
||||||||||||||
Users will be able to install third-party packages implementing pandas | ||||||||||||||
connectors using the standard packaging tools (pip, conda, etc.). These | ||||||||||||||
connectors should implement an entrypoint that pandas will use to | ||||||||||||||
automatically create the corresponding methods `pandas.read_*` and | ||||||||||||||
`pandas.DataFrame.to_*`. Arbitrary function or method names will not | ||||||||||||||
be created by this interface, only the `read_*` and `to_*` pattern will | ||||||||||||||
be allowed. By simply installing the appropriate packages users will | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 3rd party connectors for formats imported from objects in |
||||||||||||||
be able to use code like this: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
import pandas | ||||||||||||||
|
||||||||||||||
df = pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") | ||||||||||||||
|
||||||||||||||
df.to_hive(hive_conn, "hive_table") | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
This API allows for method chaining: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
(pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") | ||||||||||||||
.to_hive(hive_conn, "hive_table")) | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
The total number of I/O functions and methods is expected to be small, as users | ||||||||||||||
in general use only a small subset of formats. The number could actually be | ||||||||||||||
reduced from the current state if the less popular formats (such as SAS, SPSS, | ||||||||||||||
BigQuery, etc.) are removed from the pandas core into third-party packages. | ||||||||||||||
Moving these connectors is not part of this proposal, and will be discussed | ||||||||||||||
later in a separate proposal. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we could have the interface as experimental, move some or all of the pandas own connectors to the new interface and then finalize the interface thereafter. (Thinking here about EA where we have a published interface and yet still special case our own internal EAs) |
||||||||||||||
|
||||||||||||||
#### Plugin registration | ||||||||||||||
|
||||||||||||||
Third-party packages would implement an | ||||||||||||||
[entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) | ||||||||||||||
to define the connectors that they implement, under a group `dataframe.io`. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was intentionally not making specific to pandas. :) One of the nice things of this interface is that connectors could be reused by any software. From other dataframe libraries, to databases, to plotting libraries, to your own system that works directly with Arrow/pandas... Imagine a connector is implemented for SAS files, and the connector returns an Arrow table. This could be reused by even a spreadsheet that ones to support users loading data from SAS. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what you're describing is an extension API for pyarrow |
||||||||||||||
|
||||||||||||||
For example, a hypothetical project `pandas_duckdb` implementing a `read_duckdb` | ||||||||||||||
function, could use `pyproject.toml` to define the next entry point: | ||||||||||||||
|
||||||||||||||
```toml | ||||||||||||||
[project.entry-points."dataframe.io"] | ||||||||||||||
from_duckdb = "pandas_duckdb:read_duckdb" | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
On import of the pandas module, it would read the entrypoint registry for the | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any idea how big this slowdown would be? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it can be minimal if the user has just couple of connectors with lightweight dependencies installed. But there could possibly be a situation where dozens of connectors exist, and some of them are slow to load (maybe they have imports on their dependencies that are also slow, or other reasons). Unless connectors implement lazy imports, the loading of all connectors including their dependencies would happen at |
||||||||||||||
`dataframe.io` group, and would dynamically create methods in the `pandas` and | ||||||||||||||
`pandas.DataFrame` namespace for them. Method not starting with `read_` or `to_` | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm open to include it here, but I'd personally leave As said, I'm surely open to add it if there is interest and there are actual uses cases, but I wouldn't include it in this proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, sorry, I just realized now that we have Do you think it can make sense that the connectors don't know anything about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should probably mention the Series behavior in the proposal. I wonder whether our current Series methods would roundtrip if we |
||||||||||||||
would make pandas raise an exception. This would guarantee a reasonably | ||||||||||||||
consistent API among third-party connectors. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The method names are IMHO the trivial part of api consistency. The parameters are more important when addressing "users need to learn each of them individually" and getting the expected behavior consistent with other methods when specifying parameters. I don't think this along with a set of guidelines can "guarantee" this. |
||||||||||||||
|
||||||||||||||
#### Internal API | ||||||||||||||
|
||||||||||||||
Connectors would use one of two different interface options: a pandas `DataFrame` | ||||||||||||||
or an Apache Arrow table. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if u insist on enumerating thr advantages / disadvantages (dubious but ok) ; you are listing data frames first then you proceed to talk about arrow talk about them in the order listed |
||||||||||||||
|
||||||||||||||
The Apache Arrow format would allow that connectors do not need to use pandas to | ||||||||||||||
create the data, making them more robust and less likely to break when changes to | ||||||||||||||
pandas internals happen. It would also allow other possible consumers of the | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is pretty dubious - a connector should not be using pandas internals at all (yes there are exceptions but not generally) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xref #52419 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback, this made me realize of some poor wording. Let me use a particular example, I'll clarify the text later. A third-party developer implements a connector for a database, and in the connector he decides to create numpy arrays for each column fetched, and then call In the example the connector will break and will need to be updated. If we used Arrow as the interface between the connectors and pandas, the changes to the constructor won't affect the connector. Any change in pandas won't affect the connector. If we end up in a situation with tens of connectors, that seems to be a bigger issue. Probably I should have called this That's why I think Arrow is a better and more robust interface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This shifts the problem to Arrow, if Arrow would decide to change the mechanism how to create a Table, the connector would face the same problem. If we think this is really necessary, we could relatively easy define a schema in which we'd expect the NumPy arrays and keep this stable, e.g. the dict proposal from above. |
||||||||||||||
connectors to not have pandas as a dependency. Testing also becomes simpler by | ||||||||||||||
using Apache Arrow, since connectors can be tested independently, and pandas does | ||||||||||||||
not need to be tested for each connector. If the Apache Arrow specification is | ||||||||||||||
respected in both sides, the communication between connectors and pandas is | ||||||||||||||
guaranteed to work. If pandas eventually has a hard dependency on an Apache | ||||||||||||||
Arrow implementation, this should be the preferred interface. | ||||||||||||||
|
||||||||||||||
Allowing connectors to use pandas dataframes directly makes users not have to | ||||||||||||||
depend on PyArrow for connectors that do not use an Apache Arrow object. It also | ||||||||||||||
helps move existing connectors to this new API, since they are using pandas | ||||||||||||||
dataframes as an exchange object now. It has the disadvantages stated in the | ||||||||||||||
previous paragraph, and a future proposal may be created to discuss deprecating | ||||||||||||||
pandas dataframes as a possible interface for connectors. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure why this would ever be true There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I left this open as a possibility. As said in the other comment, Arrow seems more robust. I can delete this comment, but I thought it could be worth noting that this can be discussed in the future. |
||||||||||||||
|
||||||||||||||
In case a `read` method returned something different than a PyArrow table, | ||||||||||||||
pandas would raise an exception. pandas would expect all `to_` methods to have | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do the |
||||||||||||||
`table: pyarrow.Table | pandas.DataFrame` as the first parameter, and it would | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pandas.Series should also have to_ methods too? In which case the pyarrow object should be a pyarrow.Array? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would be very weird for a pd.read_foo to return anything other than a pandas object There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the comment, this is helpful. I don't think I've been very clear with something. I fully agree with what you say, and The idea for connectors that return an Arrow table is that pandas will call them, get the table, convert it to the dataframe, and then return the dataframe to the user. Let's say we'll probably have a decorator to decorate any connector function that will validate that we're happy with it (naming conventions...) and will handle the transformation of Arrow to pandas for the connectors that use Arrow. It could possibly add generic parameters I guess too, or whatever we want. Does this make more sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to say anything about whats going on inside a connector? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is not guaranteed. I believe when reading in chunk(w/ chunksize), read_csv returns an TextFileReader(basically an iterator) instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Third-parties may also be interested in doing this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The proposal doesn't, does it? And the decorator I described would be in pandas, not in the connector.
Very good point. Surely worth considering eventually, but feels like we're close to rejecting this proposal, not sure if a good time to add more things to it. ;) And even if we move forward with it, I'd start by defining a proposal to load all data at once, which is being challenging enough, and would leave this for a follow up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is the decorator described/mentioned in the actual PDEP? As I understand it there is a ton of discussion of arrow in the PDEP that is only relevant to the implementations of connectors. This PDEP would be better if you stripped out the arrow stuff. Then if/when it is implemented, you can implement this decorator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see this proposal adding any value if it doesn't even implement an optional Arrow interface. I can already implement a third-party connector that returns a pandas DataFrame and injects it into the A key idea of this proposal is that the connectors are also usable by other projects of the ecosystem. And if there is not even agreement in supporting Arrow optionally (I still have a strong preference by using only Arrow, even if I changed the proposal), I don't think this is going anywhere. |
||||||||||||||
raise an exception otherwise. The `table` parameter would be exposed as the | ||||||||||||||
`self` parameter of the `to_*` method in pandas. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||
|
||||||||||||||
In case the Apache Arrow interface is used, metadata not supported by Apache | ||||||||||||||
Arrow may be provided by users. For example the column to use for row indices | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems quite a restriction over some of the other options such as for example the dataframe interchange protocol where these issues have already been considered and a solution proposed. If the Arrow standard does not meet our requirements, does this not make our api proposal another protocol? |
||||||||||||||
or the data type backend to use in the object being created. This would be | ||||||||||||||
managed independently from the connectors. Given the previous example, a new | ||||||||||||||
argument `index_col` could be added directly into pandas to the function or | ||||||||||||||
method automatically generated from the entrypoint. Since this would apply to | ||||||||||||||
all functions and methods automatically generated, it would also improve the | ||||||||||||||
consistency of pandas connectors. For example: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
def read_myformat(index_col=None, *args, **kwargs): | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be expanded to include many other parameters to improve consistency? (does *args work with index_col being optional?) |
||||||||||||||
# The third-party connector doesn't need to know about functionality | ||||||||||||||
# specific to pandas like the row index | ||||||||||||||
arrow_table = third_party_connector.from_myformat(*args, **kwargs) | ||||||||||||||
|
||||||||||||||
df = convert_arrow_table_to_pandas_dataframe(arrow_table) | ||||||||||||||
|
||||||||||||||
# Transformations to the dataframe with custom parameters is possible | ||||||||||||||
if index_col is not None: | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the connector have control (if it wanted to) of this if a DataFrame is the exchange object instead of a Arrow Table? |
||||||||||||||
df = df.set_index(index_col) | ||||||||||||||
|
||||||||||||||
return df | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
#### Connector guidelines | ||||||||||||||
|
||||||||||||||
In order to provide a better and more consistent experience to users, guidelines | ||||||||||||||
will be created to unify terminology and behavior. Some of the topics to unify are | ||||||||||||||
defined next. | ||||||||||||||
|
||||||||||||||
**Existence and naming of columns**, since many connectors are likely to provide | ||||||||||||||
similar features, like loading only a subset of columns in the data, or dealing | ||||||||||||||
with file names. Examples of recommendations to connector developers: | ||||||||||||||
|
||||||||||||||
- `columns`: Use this argument to let the user load a subset of columns. Allow a | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should have specific requirements here rather than just guidelines otherwise this is going to be a mess |
||||||||||||||
list or tuple. | ||||||||||||||
- `path`: Use this argument if the dataset is a file in the file disk. Allow a string, | ||||||||||||||
a `pathlib.Path` object, or a file descriptor. For a string object, allow URLs that | ||||||||||||||
will be automatically download, compressed files that will be automatically | ||||||||||||||
uncompressed, etc. A library can be provided to deal with those in an easier and | ||||||||||||||
more consistent way. | ||||||||||||||
- `schema`: For datasets that don't have a schema (e.g. `csv`), allow providing an | ||||||||||||||
Apache Arrow schema instance, and automatically infer types if not provided. | ||||||||||||||
|
||||||||||||||
Note that the above are only examples of guidelines for illustration, and not | ||||||||||||||
a proposal of the guidelines. | ||||||||||||||
|
||||||||||||||
**Guidelines to avoid name conflicts**. Since it is expected that more than one | ||||||||||||||
implementation exists for certain formats, as it already happens, guidelines on | ||||||||||||||
how to name connectors would be created. The easiest approach is probably to use | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible for pandas to raise if it detects that 2 entrypoints are trying to register the same named There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, fully agree, I thought it was in the proposal that pandas would raise if more than one connector with the same name exists, but seems like it's not there, I can add it. But I do think some naming guidelines could still be helpful. Like recommending to use something like |
||||||||||||||
the format `from_<format>_<implementation-id>` / `to_<format>_<implementation-id>`. | ||||||||||||||
|
||||||||||||||
For example a `csv` loader based on PyArrow could be named as `from_csv_pyarrow`, | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we currently use e.g. would we have For backwards compatibility, I expect we would retain the If that is the case, I think we should encourage the use of |
||||||||||||||
and an implementation that does not infer types and raises an exception in case | ||||||||||||||
of mistake or ambiguity it could be named `from_csv_strict`. Exact guidelines | ||||||||||||||
would be developed independently from this proposal. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this proposal needs to address the format/engine issue and not defer it. |
||||||||||||||
|
||||||||||||||
**Connector registry and documentation**. To simplify the discovery of connectors | ||||||||||||||
and its documentation, connector developers can be encourage to register their | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
projects in a central location, and to use a standard structure for documentation. | ||||||||||||||
This would allow the creation of a unified website to find the available | ||||||||||||||
connectors, and their documentation. It would also allow to customize the | ||||||||||||||
documentation for specific implementations, and include their final API, and | ||||||||||||||
include arguments specific to the implementation. In the case of pandas, it | ||||||||||||||
would allow to add arguments such as `index_col` to all loader methods, and | ||||||||||||||
to potentially build the API reference of certain third-party connectors as part | ||||||||||||||
as the pandas own documentation. That may or may not be a good idea, but | ||||||||||||||
standardizing the documentation of connectors would allow it. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmmm. A big part of the argument for this proposal is the maintenance burden on the core project. I'm not sure what is being proposed here? Would the pandas team manage this "central location"? |
||||||||||||||
|
||||||||||||||
### Connector examples | ||||||||||||||
|
||||||||||||||
This section lists specific examples of connectors that could immediately | ||||||||||||||
benefit from this proposal. | ||||||||||||||
|
||||||||||||||
**PyArrow** currently provides `Table.from_pandas` and `Table.to_pandas`. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need to get clarify an earlier comment about allowing DataFrame as the interchange object both ways for this to make sense. |
||||||||||||||
With the new interface, it could also register `DataFrame.from_pyarrow` | ||||||||||||||
and `DataFrame.to_pyarrow`, so pandas users can use the converters with | ||||||||||||||
the interface they are used to, when PyArrow is installed in the environment. | ||||||||||||||
Better integration with PyArrow tables was discussed in | ||||||||||||||
[#51760](https://github.com/pandas-dev/pandas/issues/51760). | ||||||||||||||
|
||||||||||||||
_Current API_: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
pyarrow.Table.from_pandas(table.to_pandas() | ||||||||||||||
.query('my_col > 0')) | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
_Proposed API_: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
(pandas.DataFrame.io.from_pyarrow(table) | ||||||||||||||
.query('my_col > 0') | ||||||||||||||
.io.to_pyarrow()) | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
**Polars**, **Vaex** and other dataframe frameworks could benefit from | ||||||||||||||
third-party projects that make the interoperability with pandas use a | ||||||||||||||
more explicitly API. Integration with Polars was requested in | ||||||||||||||
[#47368](https://github.com/pandas-dev/pandas/issues/47368). | ||||||||||||||
|
||||||||||||||
_Current API_: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
polars.DataFrame(df.to_pandas() | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example is so similar to the above that I'm not sure it adds value. Can they be combined? |
||||||||||||||
.query('my_col > 0')) | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
_Proposed API_: | ||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
(pandas.DataFrame.io.from_polars(df) | ||||||||||||||
.query('my_col > 0') | ||||||||||||||
.io.to_polars()) | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
**DuckDB** provides an out-of-core engine able to push predicates before | ||||||||||||||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
the data is loaded, making much better use of memory and significantly | ||||||||||||||
decreasing loading time. pandas, because of its eager nature is not able | ||||||||||||||
to easily implement this itself, but could benefit from a DuckDB loader. | ||||||||||||||
The loader can already be implemented inside pandas (it has already been | ||||||||||||||
proposed in [#45678](https://github.com/pandas-dev/pandas/issues/45678), | ||||||||||||||
or as a third-party extension with an arbitrary API. But this proposal would | ||||||||||||||
let the creation of a third-party extension with a standard and intuitive API: | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no guarantee? from comments in that and related issues it does not appear there are any objections to using entrypoints for the read_sql method. This would create a much more consistent interface. I don't think DuckDB should be used as any justification for this proposal at this time. (similarly the Polars issue has been resolved? So again does not help justify this proposal IMHO) |
||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
pandas.DataFrame.io.from_duckdb("SELECT * | ||||||||||||||
FROM 'dataset.parquet' | ||||||||||||||
WHERE my_col > 0") | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
**Out-of-core algorithms** push some operations like filtering or grouping | ||||||||||||||
to the loading of the data. While this is not currently possible, connectors | ||||||||||||||
implementing out-of-core algorithms could be developed using this interface. | ||||||||||||||
|
||||||||||||||
**Big data** systems such as Hive, Iceberg, Presto, etc. could benefit | ||||||||||||||
from a standard way to load data to pandas. Also regular **SQL databases** | ||||||||||||||
that can return their query results as Arrow, would benefit from better | ||||||||||||||
and faster connectors than the existing ones based on SQL Alchemy and | ||||||||||||||
Python structures. | ||||||||||||||
|
||||||||||||||
Any other format, including **domain-specific formats** could easily | ||||||||||||||
implement pandas connectors with a clear an intuitive API. | ||||||||||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
## Proposal extensions | ||||||||||||||
|
||||||||||||||
The scope of the current proposal is limited to the registration of functions | ||||||||||||||
defined by third-party projects, if an entrypoint is defined. | ||||||||||||||
|
||||||||||||||
Any changes to the current connectors of pandas (e.g. `read_csv`, | ||||||||||||||
`from_records`, etc.) or their migration to the new system are out of scope for | ||||||||||||||
this proposal, but the next tasks can be considered for future work and proposals: | ||||||||||||||
|
||||||||||||||
- Move out of the pandas repository and into their own third-party projects | ||||||||||||||
some of the existing I/O connectors. This would require a transition period | ||||||||||||||
to let users know that future versions of pandas will require a dependency | ||||||||||||||
installed for a particular connector to exist. | ||||||||||||||
- Implement with the new interface some of the data structures that the | ||||||||||||||
`DataFrame` constructor accepts. | ||||||||||||||
|
||||||||||||||
## PDEP-9 History | ||||||||||||||
|
||||||||||||||
- 5 March 2023: Initial version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.