Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799

Merged
merged 21 commits into from
Jun 13, 2023
Merged
Changes from 7 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
5dbdde9
PDEP-9: pandas I/O connectors as extensions
datapythonista Feb 27, 2023
730df18
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Mar 5, 2023
23b934f
Final draft to be proposed
datapythonista Mar 5, 2023
de3a17b
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Mar 7, 2023
da784ec
Address comments from code reviews, mostly by extending the proposal …
datapythonista Mar 7, 2023
f475350
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Apr 6, 2023
4a8ba96
Keep current I/O API and allow pandas as an interface
datapythonista Apr 6, 2023
6ad6a9d
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Apr 7, 2023
5cb47d9
Rejecting
datapythonista Apr 7, 2023
68ca3de
Reorder interfaces
datapythonista Apr 7, 2023
150d1d1
Update web/pandas/pdeps/0009-io-extensions.md
datapythonista Apr 29, 2023
6eea8a8
Use dataframe interchange protocol
datapythonista May 30, 2023
5665dc7
Merge branch 'pdep9' of github.com:datapythonista/pandas into pdep9
datapythonista May 30, 2023
40ebacc
typo
datapythonista May 30, 2023
aed569f
Merge branch 'main' into pdep9
datapythonista May 30, 2023
eb7c6f0
Make users load modules explicitly
datapythonista May 30, 2023
14a2f4a
Merge branch 'pdep9' of github.com:datapythonista/pandas into pdep9
datapythonista May 30, 2023
8050853
Update web/pandas/pdeps/0009-io-extensions.md
datapythonista Jun 7, 2023
5cb23dd
Add limitations section
datapythonista Jun 7, 2023
2af8577
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Jun 13, 2023
ccb9674
Rejecting PDEP
datapythonista Jun 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
395 changes: 395 additions & 0 deletions web/pandas/pdeps/0009-io-extensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,395 @@
# PDEP-9: Allow third-party projects to register pandas connectors with a standard API

- Created: 5 March 2023
- Status: Under discussion
- Discussion: [#51799](https://github.com/pandas-dev/pandas/pull/51799)
- Author: [Marc Garcia](https://github.com/datapythonista)
- Revision: 1

## PDEP Summary

This document proposes that third-party projects implementing I/O or memory
connectors can register them using Python's entrypoint system, and make them
available to pandas users with the existing I/O interface. For example:

```python
import pandas

df = pandas.DataFrame.read_duckdb("SELECT * FROM 'my_dataset.parquet';")

df.to_deltalake('/delta/my_dataset')
```

This would allow to easily extend the existing number of connectors, adding
support to new formats and database engines, data lake technologies,
out-of-core connectors, the new ADBC interface and at the same time reduce the
maintenance cost of the pandas core.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maintenance cost of the pandas core.
maintenance cost of the pandas core codebase.


## Current state

pandas supports importing and exporting data from different formats using
I/O connectors, currently implemented in `pandas/io`, as well as connectors
to in-memory structures, like Python structures or other library formats.
In many cases, those connectors wrap an existing Python library, while in
some others, pandas implements the logic to read and write to a particular
format.

In some cases, different engines exist for the same format. The API to use
those connectors is `pandas.read_<format>(engine='<engine-name>', ...)` to
import data, and `DataFrame.to_<format>(engine='<engine-name>', ...)` to
export data.

For objects exported to memory (like a Python dict) the API is the same as
for I/O, `DataFrame.to_<format>(...)`. For formats imported from objects in
memory, the API is different using the `from_` prefix instead of `read_`,
`DataFrame.from_<format>(...)`.

In some cases, the pandas API provides `DataFrame.to_*` methods that are not
used to export the data to a disk or memory object, but instead to transform
the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: ive recently been thinking that DataFrame/Series methods that only operate on the index/columns might make sense to put in a accessor/namespace


Dependencies of the connectors are not loaded by default, and will be
imported when the connector is used. If the dependencies are not installed,
an `ImportError` is raised.

```python
>>> pandas.read_gbq(query)
Traceback (most recent call last):
...
ImportError: Missing optional dependency 'pandas-gbq'.
pandas-gbq is required to load data from Google BigQuery.
See the docs: https://pandas-gbq.readthedocs.io.
Use pip or conda to install pandas-gbq.
```

### Supported formats

The list of formats can be found in the
[IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html).
A more detailed table, including in memory objects, and I/O connectors in the
DataFrame styler is presented next:

| Format | Reader | Writer | Engines |
|--------------|--------|--------|-----------------------------------------------------------------------------------|
| CSV | X | X | `c`, `python`, `pyarrow` |
| FWF | X | | `c`, `python`, `pyarrow` |
| JSON | X | X | `ujson`, `pyarrow` |
| HTML | X | X | `lxml`, `bs4/html5lib` (parameter `flavor`) |
| LaTeX | | X | |
| XML | X | X | `lxml`, `etree` (parameter `parser`) |
| Clipboard | X | X | |
| Excel | X | X | `xlrd`, `openpyxl`, `odf`, `pyxlsb` (each engine supports different file formats) |
| HDF5 | X | X | |
| Feather | X | X | |
| Parquet | X | X | `pyarrow`, `fastparquet` |
| ORC | X | X | |
| Stata | X | X | |
| SAS | X | | |
| SPSS | X | | |
| Pickle | X | X | |
| SQL | X | X | `sqlalchemy`, `dbapi2` (inferred from the type of the `con` parameter) |
| BigQuery | X | X | |
| dict | X | X | |
| records | X | X | |
| string | | X | |
| markdown | | X | |
| xarray | | X | |

At the time of writing this document, the `io/` module contains
close to 100,000 lines of Python, C and Cython code.

There is no objective criteria for when a format is included
in pandas, and the list above is mostly the result of a developer
being interested in implementing the connectors for a certain
format in pandas.

The number of existing formats available for data that can be processed with
pandas is constantly increasing, and its difficult for pandas to keep up to
date even with popular formats. It could possibly make sense to have connectors
to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others.

At the same time, some of the formats are not frequently used as shown in the
[2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html).
Those less popular formats include SPSS, SAS, Google BigQuery and
Stata. Note that only I/O formats (and not memory formats like records or xarray)
were included in the survey.

The maintenance cost of supporting all formats is not only in maintaining the
code and reviewing pull requests, but also it has a significant cost in time
spent on CI systems installing dependencies, compiling code, running tests, etc.

In some cases, the main maintainers of some of the connectors are not part of
the pandas core development team, but people specialized in one of the formats
without commit rights.

## Proposal

While the current pandas approach has worked reasonably well, it is difficult
to find a stable solution where the maintenance incurred in pandas is not
too big, while at the same time users can interact with all different formats
and representations they are interested in, in an easy and intuitive way.

Third-party packages are already able to implement connectors to pandas, but
there are some limitations to it:

- Given the large number of formats supported by pandas itself, third-party
connectors are likely seen as second class citizens, not important enough
to be used, or not well supported.
- There is no standard API for external I/O connectors, and users need
to learn each of them individually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the current proposal does not adequately address this and the other two limitations are perhaps less important.

- Method chaining is not possible with third-party I/O connectors to export
data, unless authors monkey patch the `DataFrame` class, which should not
be encouraged.

This document proposes to open the development of pandas I/O connectors to
third-party libraries in a standard way that overcomes those limitations.

### Proposal implementation

Implementing this proposal would not require major changes to pandas, and
the API defined next would be used.

#### User API

Users will be able to install third-party packages implementing pandas
connectors using the standard packaging tools (pip, conda, etc.). These
connectors should implement an entrypoint that pandas will use to
automatically create the corresponding methods `pandas.read_*` and
`pandas.DataFrame.to_*`. Arbitrary function or method names will not
be created by this interface, only the `read_*` and `to_*` pattern will
be allowed. By simply installing the appropriate packages users will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3rd party connectors for formats imported from objects in
memory should be able to register the from_ prefix instead of read_ to be consistent with pandas own connectors?

be able to use code like this:

```python
import pandas

df = pandas.read_duckdb("SELECT * FROM 'dataset.parquet';")

df.to_hive(hive_conn, "hive_table")
```

This API allows for method chaining:

```python
(pandas.read_duckdb("SELECT * FROM 'dataset.parquet';")
.to_hive(hive_conn, "hive_table"))
```

The total number of I/O functions and methods is expected to be small, as users
in general use only a small subset of formats. The number could actually be
reduced from the current state if the less popular formats (such as SAS, SPSS,
BigQuery, etc.) are removed from the pandas core into third-party packages.
Moving these connectors is not part of this proposal, and will be discussed
later in a separate proposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could have the interface as experimental, move some or all of the pandas own connectors to the new interface and then finalize the interface thereafter.

(Thinking here about EA where we have a published interface and yet still special case our own internal EAs)


#### Plugin registration

Third-party packages would implement an
[entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins)
to define the connectors that they implement, under a group `dataframe.io`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pandas.dataframe.io so it's scoped specially to pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intentionally not making specific to pandas. :)

One of the nice things of this interface is that connectors could be reused by any software. From other dataframe libraries, to databases, to plotting libraries, to your own system that works directly with Arrow/pandas...

Imagine a connector is implemented for SAS files, and the connector returns an Arrow table. This could be reused by even a spreadsheet that ones to support users loading data from SAS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you're describing is an extension API for pyarrow


For example, a hypothetical project `pandas_duckdb` implementing a `read_duckdb`
function, could use `pyproject.toml` to define the next entry point:

```toml
[project.entry-points."dataframe.io"]
from_duckdb = "pandas_duckdb:read_duckdb"
```

On import of the pandas module, it would read the entrypoint registry for the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make import pandas slower, not sure if it can be done in a lazy way. For people using autocomplete when coding for example, I think the DataFrame.io should be populated before __getattr__ is called. But maybe it can just be populated when any of __dir__ or __getattr__ is called, and we get the best of both worlds? Or too complex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea how big this slowdown would be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be minimal if the user has just couple of connectors with lightweight dependencies installed. But there could possibly be a situation where dozens of connectors exist, and some of them are slow to load (maybe they have imports on their dependencies that are also slow, or other reasons). Unless connectors implement lazy imports, the loading of all connectors including their dependencies would happen at import pandas.

`dataframe.io` group, and would dynamically create methods in the `pandas` and
`pandas.DataFrame` namespace for them. Method not starting with `read_` or `to_`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.Series should also have to_ methods too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to include it here, but I'd personally leave Series out of this proposal. If later we see it's useful, we can surely review this proposal or create a new one for it. But personally, I don't see too many use cases, the current Series.to_ methods are mostly internal (to_period, to_timestap, besides to_numpy and to_list), and it can be confusing adding third-party things to those in my opinion.

As said, I'm surely open to add it if there is interest and there are actual uses cases, but I wouldn't include it in this proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry, I just realized now that we have to_csv, to_xarray and most of them also for Series, never used one of those. I got confused because in the API documentation the internal ones and the rest are not together...

Do you think it can make sense that the connectors don't know anything about Series, and our Series methods simply call to_frame() before calling the connector? Or do you see important cases where there should be an important distinction between a Series and a one-column DataFrame? I can think of JSON as a case where there would be a subtle difference, but not much more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably mention the Series behavior in the proposal. I wonder whether our current Series methods would roundtrip if we to_frame them.

would make pandas raise an exception. This would guarantee a reasonably
consistent API among third-party connectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method names are IMHO the trivial part of api consistency. The parameters are more important when addressing "users need to learn each of them individually" and getting the expected behavior consistent with other methods when specifying parameters. I don't think this along with a set of guidelines can "guarantee" this.


#### Internal API

Connectors would use one of two different interface options: a pandas `DataFrame`
or an Apache Arrow table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u insist on enumerating thr advantages / disadvantages (dubious but ok) ; you are listing data frames first then you proceed to talk about arrow

talk about them in the order listed


The Apache Arrow format would allow that connectors do not need to use pandas to
create the data, making them more robust and less likely to break when changes to
pandas internals happen. It would also allow other possible consumers of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty dubious - a connector should not be using pandas internals at all (yes there are exceptions but not generally)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #52419

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, this made me realize of some poor wording. Let me use a particular example, I'll clarify the text later.

A third-party developer implements a connector for a database, and in the connector he decides to create numpy arrays for each column fetched, and then call DataFrame({col_name_1: data_col_1, ...}). And some months later we decide that the allowing so many kinds of inputs for the DataFrame constructor is a bad practice, and we stop allowing them in favor of a different API (e.g. pandas.read_dict_of_numpy_arrays()).

In the example the connector will break and will need to be updated. If we used Arrow as the interface between the connectors and pandas, the changes to the constructor won't affect the connector. Any change in pandas won't affect the connector. If we end up in a situation with tens of connectors, that seems to be a bigger issue.

Probably I should have called this changes to the pandas APIs more than pandas internals, but I had in mind that connector developers could possibly use private APIs too.

That's why I think Arrow is a better and more robust interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shifts the problem to Arrow, if Arrow would decide to change the mechanism how to create a Table, the connector would face the same problem.

If we think this is really necessary, we could relatively easy define a schema in which we'd expect the NumPy arrays and keep this stable, e.g. the dict proposal from above.

connectors to not have pandas as a dependency. Testing also becomes simpler by
using Apache Arrow, since connectors can be tested independently, and pandas does
not need to be tested for each connector. If the Apache Arrow specification is
respected in both sides, the communication between connectors and pandas is
guaranteed to work. If pandas eventually has a hard dependency on an Apache
Arrow implementation, this should be the preferred interface.

Allowing connectors to use pandas dataframes directly makes users not have to
depend on PyArrow for connectors that do not use an Apache Arrow object. It also
helps move existing connectors to this new API, since they are using pandas
dataframes as an exchange object now. It has the disadvantages stated in the
previous paragraph, and a future proposal may be created to discuss deprecating
pandas dataframes as a possible interface for connectors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this would ever be true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this open as a possibility. As said in the other comment, Arrow seems more robust. I can delete this comment, but I thought it could be worth noting that this can be discussed in the future.


In case a `read` method returned something different than a PyArrow table,
pandas would raise an exception. pandas would expect all `to_` methods to have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do the read methods not also allow a pandas.DataFrame?

`table: pyarrow.Table | pandas.DataFrame` as the first parameter, and it would
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.Series should also have to_ methods too? In which case the pyarrow object should be a pyarrow.Array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be very weird for a pd.read_foo to return anything other than a pandas object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, this is helpful. I don't think I've been very clear with something.

I fully agree with what you say, and pandas.read_* should always return a pandas DataFrame. This refers to connector.read_* which is an internal connector function what users won't use directly.

The idea for connectors that return an Arrow table is that pandas will call them, get the table, convert it to the dataframe, and then return the dataframe to the user. Let's say we'll probably have a decorator to decorate any connector function that will validate that we're happy with it (naming conventions...) and will handle the transformation of Arrow to pandas for the connectors that use Arrow. It could possibly add generic parameters I guess too, or whatever we want.

Does this make more sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to say anything about whats going on inside a connector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with what you say, and pandas.read_* should always return a pandas DataFrame. This refers to connector.read_* which is an internal connector function what users won't use directly.

This is not guaranteed. I believe when reading in chunk(w/ chunksize), read_csv returns an TextFileReader(basically an iterator) instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third-parties may also be interested in doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to say anything about whats going on inside a connector?

The proposal doesn't, does it? And the decorator I described would be in pandas, not in the connector.

Third-parties may also be interested in doing this.

Very good point. Surely worth considering eventually, but feels like we're close to rejecting this proposal, not sure if a good time to add more things to it. ;) And even if we move forward with it, I'd start by defining a proposal to load all data at once, which is being challenging enough, and would leave this for a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal doesn't, does it? And the decorator I described would be in pandas, not in the connector.

Is the decorator described/mentioned in the actual PDEP? As I understand it there is a ton of discussion of arrow in the PDEP that is only relevant to the implementations of connectors.

This PDEP would be better if you stripped out the arrow stuff. Then if/when it is implemented, you can implement this decorator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this proposal adding any value if it doesn't even implement an optional Arrow interface.

I can already implement a third-party connector that returns a pandas DataFrame and injects it into the pandas module or the DataFrame class. The only difference would be that it'd save an import.

A key idea of this proposal is that the connectors are also usable by other projects of the ecosystem. And if there is not even agreement in supporting Arrow optionally (I still have a strong preference by using only Arrow, even if I changed the proposal), I don't think this is going anywhere.

raise an exception otherwise. The `table` parameter would be exposed as the
`self` parameter of the `to_*` method in pandas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self is the object the method is attached to. So this to me is confusing. Perhaps we need to also pad out the marshalling from the connector function to the DataFrame.to_* methods just like you've done for read_myformat below.


In case the Apache Arrow interface is used, metadata not supported by Apache
Arrow may be provided by users. For example the column to use for row indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems quite a restriction over some of the other options such as for example the dataframe interchange protocol where these issues have already been considered and a solution proposed.

If the Arrow standard does not meet our requirements, does this not make our api proposal another protocol?

or the data type backend to use in the object being created. This would be
managed independently from the connectors. Given the previous example, a new
argument `index_col` could be added directly into pandas to the function or
method automatically generated from the entrypoint. Since this would apply to
all functions and methods automatically generated, it would also improve the
consistency of pandas connectors. For example:

```python
def read_myformat(index_col=None, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be expanded to include many other parameters to improve consistency?

(does *args work with index_col being optional?)

# The third-party connector doesn't need to know about functionality
# specific to pandas like the row index
arrow_table = third_party_connector.from_myformat(*args, **kwargs)

df = convert_arrow_table_to_pandas_dataframe(arrow_table)

# Transformations to the dataframe with custom parameters is possible
if index_col is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the connector have control (if it wanted to) of this if a DataFrame is the exchange object instead of a Arrow Table?

df = df.set_index(index_col)

return df
```

#### Connector guidelines

In order to provide a better and more consistent experience to users, guidelines
will be created to unify terminology and behavior. Some of the topics to unify are
defined next.

**Existence and naming of columns**, since many connectors are likely to provide
similar features, like loading only a subset of columns in the data, or dealing
with file names. Examples of recommendations to connector developers:

- `columns`: Use this argument to let the user load a subset of columns. Allow a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have specific requirements here rather than just guidelines

otherwise this is going to be a mess

list or tuple.
- `path`: Use this argument if the dataset is a file in the file disk. Allow a string,
a `pathlib.Path` object, or a file descriptor. For a string object, allow URLs that
will be automatically download, compressed files that will be automatically
uncompressed, etc. A library can be provided to deal with those in an easier and
more consistent way.
- `schema`: For datasets that don't have a schema (e.g. `csv`), allow providing an
Apache Arrow schema instance, and automatically infer types if not provided.

Note that the above are only examples of guidelines for illustration, and not
a proposal of the guidelines.

**Guidelines to avoid name conflicts**. Since it is expected that more than one
implementation exists for certain formats, as it already happens, guidelines on
how to name connectors would be created. The easiest approach is probably to use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for pandas to raise if it detects that 2 entrypoints are trying to register the same named read_ and to_ function? If so, I am not sure if it's necessary to enforce naming conventions necessarily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fully agree, I thought it was in the proposal that pandas would raise if more than one connector with the same name exists, but seems like it's not there, I can add it.

But I do think some naming guidelines could still be helpful. Like recommending to use something like read_csv_strict instead of read_strict_csv... Or whatever we want to recommend that helps consistency.

the format `from_<format>_<implementation-id>` / `to_<format>_<implementation-id>`.

For example a `csv` loader based on PyArrow could be named as `from_csv_pyarrow`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we currently use engine or similar for the existing connectors, what is the thinking if we do move our connectors to the new interface?

e.g. would we have read_xml_lxml and read_xml_etree?

For backwards compatibility, I expect we would retain the parser parameter instead?

If that is the case, I think we should encourage the use of engine and parser keywords instead for 3rd parties and therefore may make sense to consider entrypoints to our existing connectors as well.

and an implementation that does not infer types and raises an exception in case
of mistake or ambiguity it could be named `from_csv_strict`. Exact guidelines
would be developed independently from this proposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposal needs to address the format/engine issue and not defer it.


**Connector registry and documentation**. To simplify the discovery of connectors
and its documentation, connector developers can be encourage to register their
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and its documentation, connector developers can be encourage to register their
and its documentation, connector developers can be encouraged to register their

projects in a central location, and to use a standard structure for documentation.
This would allow the creation of a unified website to find the available
connectors, and their documentation. It would also allow to customize the
documentation for specific implementations, and include their final API, and
include arguments specific to the implementation. In the case of pandas, it
would allow to add arguments such as `index_col` to all loader methods, and
to potentially build the API reference of certain third-party connectors as part
as the pandas own documentation. That may or may not be a good idea, but
standardizing the documentation of connectors would allow it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. A big part of the argument for this proposal is the maintenance burden on the core project. I'm not sure what is being proposed here? Would the pandas team manage this "central location"?


### Connector examples

This section lists specific examples of connectors that could immediately
benefit from this proposal.

**PyArrow** currently provides `Table.from_pandas` and `Table.to_pandas`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to get clarify an earlier comment about allowing DataFrame as the interchange object both ways for this to make sense.

With the new interface, it could also register `DataFrame.from_pyarrow`
and `DataFrame.to_pyarrow`, so pandas users can use the converters with
the interface they are used to, when PyArrow is installed in the environment.
Better integration with PyArrow tables was discussed in
[#51760](https://github.com/pandas-dev/pandas/issues/51760).

_Current API_:

```python
pyarrow.Table.from_pandas(table.to_pandas()
.query('my_col > 0'))
```

_Proposed API_:

```python
(pandas.DataFrame.io.from_pyarrow(table)
.query('my_col > 0')
.io.to_pyarrow())
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(pandas.DataFrame.io.from_pyarrow(table)
.query('my_col > 0')
.io.to_pyarrow())
(pandas.from_pyarrow(table)
.query('my_col > 0')
.to_pyarrow())

```

**Polars**, **Vaex** and other dataframe frameworks could benefit from
third-party projects that make the interoperability with pandas use a
more explicitly API. Integration with Polars was requested in
[#47368](https://github.com/pandas-dev/pandas/issues/47368).

_Current API_:

```python
polars.DataFrame(df.to_pandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is so similar to the above that I'm not sure it adds value. Can they be combined?

.query('my_col > 0'))
```

_Proposed API_:

```python
(pandas.DataFrame.io.from_polars(df)
.query('my_col > 0')
.io.to_polars())
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(pandas.DataFrame.io.from_polars(df)
.query('my_col > 0')
.io.to_polars())
(pandas.from_polars(df)
.query('my_col > 0')
.to_polars())

```

**DuckDB** provides an out-of-core engine able to push predicates before
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
the data is loaded, making much better use of memory and significantly
decreasing loading time. pandas, because of its eager nature is not able
to easily implement this itself, but could benefit from a DuckDB loader.
The loader can already be implemented inside pandas (it has already been
proposed in [#45678](https://github.com/pandas-dev/pandas/issues/45678),
or as a third-party extension with an arbitrary API. But this proposal would
let the creation of a third-party extension with a standard and intuitive API:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no guarantee? from comments in that and related issues it does not appear there are any objections to using entrypoints for the read_sql method. This would create a much more consistent interface.

I don't think DuckDB should be used as any justification for this proposal at this time. (similarly the Polars issue has been resolved? So again does not help justify this proposal IMHO)


```python
pandas.DataFrame.io.from_duckdb("SELECT *
FROM 'dataset.parquet'
WHERE my_col > 0")
```

**Out-of-core algorithms** push some operations like filtering or grouping
to the loading of the data. While this is not currently possible, connectors
implementing out-of-core algorithms could be developed using this interface.

**Big data** systems such as Hive, Iceberg, Presto, etc. could benefit
from a standard way to load data to pandas. Also regular **SQL databases**
that can return their query results as Arrow, would benefit from better
and faster connectors than the existing ones based on SQL Alchemy and
Python structures.

Any other format, including **domain-specific formats** could easily
implement pandas connectors with a clear an intuitive API.
datapythonista marked this conversation as resolved.
Show resolved Hide resolved

## Proposal extensions

The scope of the current proposal is limited to the registration of functions
defined by third-party projects, if an entrypoint is defined.

Any changes to the current connectors of pandas (e.g. `read_csv`,
`from_records`, etc.) or their migration to the new system are out of scope for
this proposal, but the next tasks can be considered for future work and proposals:

- Move out of the pandas repository and into their own third-party projects
some of the existing I/O connectors. This would require a transition period
to let users know that future versions of pandas will require a dependency
installed for a particular connector to exist.
- Implement with the new interface some of the data structures that the
`DataFrame` constructor accepts.

## PDEP-9 History

- 5 March 2023: Initial version