From 5dbdde9fb38c56e87c1e576eb555845c82a90754 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Mon, 27 Feb 2023 02:04:35 +0100 Subject: [PATCH 01/13] PDEP-9: pandas I/O connectors as extensions --- web/pandas/pdeps/0009-io-extensions.md | 223 +++++++++++++++++++++++++ 1 file changed, 223 insertions(+) create mode 100644 web/pandas/pdeps/0009-io-extensions.md diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md new file mode 100644 index 0000000000000..2b66c46fcd376 --- /dev/null +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -0,0 +1,223 @@ +# PDEP-9: Implement pandas I/O connectors as extensions + +- Created: 26 February 2023 +- Status: Draft +- Discussion: [#XXXX](https://github.com/pandas-dev/pandas/pull/XXXX) +- Author: [Marc Garcia](https://github.com/datapythonista) +- Revision: 1 + +## Introduction + +pandas supports importing and exporting data from different formats using +connectors, currently implemented in `pandas/io`. In many cases, those +connectors wrap an existing Python library, while in some others, pandas +implements the format logic. + +In some cases, different engines exist for the same format. The API to use +those connectors is `pandas.read_(engine='', ...)` to +import data, and `DataFrame.to_(engine='', ...)` to +export data. + +For objects exported to memory (like a Python dict) the API is the same as +for I/O, `DataFrame.to_(...)`. For formats imported from objects in +memory, the API is different, `DataFrame.from_(...)`. + +In some cases, the pandas API provides `DataFrame.to_*` methods that are not +used to export the data to a disk or memory object, but instead to transform +the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`. + +Dependencies of the I/O connectors are not loaded by default, and will be +imported when the connector is used. If the dependencies are not installed, +an `ImportError` is raised. + +```python +>>> pandas.read_gbq(query) +Traceback (most recent call last): + ... +ImportError: Missing optional dependency 'pandas-gbq'. +pandas-gbq is required to load data from Google BigQuery. +See the docs: https://pandas-gbq.readthedocs.io. +Use pip or conda to install pandas-gbq. +``` + +### Supported formats + +The list of formats can be found in the [IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html). +A more detailed table, including in memory objects, with +the engines and dependencies is presented next. + +| Format | Reader | Writer | Engines | Dependencies | +|--------------|--------|--------|--------------------------|--------------| +| CSV | X | X | `c`, `python`, `pyarrow` | `pyarrow` | +| FWF | X | | | | +| JSON | X | X | +| HTML | X | X | +| LaTeX | | X | +| XML | X | X | +| Clipboard | X | X | +| Excel | X | X | +| HDF5 | X | X | +| Feather | X | X | +| Parquet | X | X | +| ORC | X | X | +| Stata | X | X | +| SAS | X | | +| SPSS | X | | +| Pickle | X | X | | | +| SQL | X | X | +| BigQuery | | | +| dict | X | X | +| records | X | X | +| string | | X | +| markdown | | X | +| xarray | | X | + +### Inclusion criteria + +There is no objective criteria for when a format is included +in pandas, and the list above is mostly the result of developers +being interested in implementing the connectors for a certain +format in pandas. + +The number of existing formats is constantly increasing, and its +difficult for pandas to keep up to date even with popular formats. +It could probably make sense to have connectors to pyarrow, +pyspark, Iceberg, DuckDB, Polars, and others. + +At the same time, some of the formats are not frequently used as +shown in the [2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). +Those less popular formats include SPSS, SAS, Google BigQuery and +Stata. Note that only I/O formats (and not memory formats like +records or xarray) where included in the survey. + +## Proposal + +The main proposal in this PDEP is to open the development of pandas +connectors to third-parties. This would not only allow the development +of new connectors in a faster and easier way, without the intervention of +the pandas team, but also remove from the pandas code base a number of the +existing connectors, simplifying the code, the CI and the builds. +While a limited set of core connectors could live in the pandas code base, +most of the existing connectors would be moved to third-party projects. + +The user experience would remain similar to the existing one, but making +better use of namespaces, and adding consistency. Any pandas connector +(regardless of being implemented as a third-party module or not) would define +a Python entrypoint specifying the format they connect to, the operations +they support (read and/or write) and the name of the engine to be used. +On load, pandas would access this registry of connectors, and would create +the corresponding import and export methods. + +To use the connectors for the format, users would install the third-party +connector package, instead of installing the required dependencies as they +need to do now. + +### Python API + +The Python API can be improved from the current one to make better use +of namespaces, and avoid inconsistencies. The proposed API is: + +```python +import pandas + +df = pandas.DataFrame.io.read_(engine='', ...) + +df.io.write_(engine='', ...) +``` +The `engine` parameter would only be required when more than an engine +is available for a format. This is similar to the the current API, that +would use the default engine if not specified. + +For example: + +```python +import pandas + +df = pandas.DataFrame.io.read_hdf5('input.hdf5') + +df.io.write_parquet('output.parquet') +``` + +All the I/O connectors would be accessed via `DataFrame.io`, significantly +reducing the number of items in the namespace of the `pandas` module, and +the `DataFrame` class. Introspection would make it fast and simple to +list the existing connectors `dir(pandas.DataFrame.io)`. + +The API is more intuitive than the current one, as it would be used for +both in memory formats and disk formats, and does not mix read/to (users +in general would expect read/write, from/to, import/export, input/output, +and not a mix of those pairs). + +### Ecosystem of connectors + +In the same way Python can be extended with third-party modules, pandas +would be extendable with I/O plugins. This has some advantages: + +- **Supression of the pandas maintainers bottleneck.** Everybody would be + able to develop and promote their own I/O connectors, without the + approval or intervention of pandas maintainers. +- **Lower the entry barrier to pandas code.** Since pandas is a huge and + mature project, writing code in pandas itself is complex. Several + linters and autoformatters are required, policies like adding release + notes need to be followed. Proper testing must be implemented. + CI is slow and takes hours to complete. pandas needs to be compiled + due to its C extensions. All those would not be necessary, and + creating new I/O connectors would be faster and simpler. +- **CI and packaging simplification.** pandas has currently around 20 + dependencies required by connectors. And a significant number of + tests, some of them requiring a high level of customization (such as + an available database server to test `read_sql`, or a virtual + clipboard to test `read_clipboard`). Moving connectors out of + pandas would make the CI faster, and the number of problems caused + by updates in dependencies smaller. +- **Competition and alternatives for I/O operations.** Some of the + supported formats allow for different approaches in terms of + implementation. For example, `csv` connectors can be optimized + for performance and reliability, or for easiness of use. When + building a production pipeline, users would often appreciate a + loader that requires an expected schema, loads faster because of + it, and fails if the file contains errors. While Jupyter users + may prefer inference and magic that helps them write code faster. +- **Reusability with other projects.** In some cases, it can make + sense to load a format into for example Apache Arrow, and then + convert it to a pandas `DataFrame` in the connector. It could + also be quite simple when that is implemented to return a Vaex + or a Polars object. Having connectors as third-party packages + would allow to implement this, as opposed as our current + connectors. This reusability would not only benefit other + dataframe projects, but it would also have better maintained + connectors, as they will be shared by a larger ecosystem. + +## Disadvantages + +The main disadvantages to implementing this PDEP are: + +- **Backward compatibility**. +- **More verbose API.** +- **Fragmented documentation.** + +## Transition period + +This proposal involves some important changes regarding user +facing code. + +The implementation of connectors as third-party packages is quite +small for users, who would just need to install `pandas-xarray` +instead of `xarray` to be able to use `DataFrame.to_xarray`. Also, +the `ImportError` message users would get in case it was not +properly installed, can provide the required information for users +to install the right package without issues. + +The part that requires more careful management and a long transition +period is the change to the Python API proposed here. The +new API does not overlap with the old one (everything would be in +the new `DataFrame.io` accessor). This allows to easily implement +both the new and old API in parallel, raising `FutureWarning` +warnings in the old API, so users can slowly adapt their code, +and get used to the new API. Since the changes affect all pandas +users, keeping the old behavior until at least pandas 4.0 seems +a reasonable transition period. + +## PDEP-9 History + +- 26 February 2023: Initial version From 23b934f0fbd5b6d8833ce6345d192e973bba2506 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Sun, 5 Mar 2023 16:18:06 +0000 Subject: [PATCH 02/13] Final draft to be proposed --- web/pandas/pdeps/0009-io-extensions.md | 319 ++++++++++++++----------- 1 file changed, 183 insertions(+), 136 deletions(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 2b66c46fcd376..83efc1ef86292 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -1,17 +1,34 @@ -# PDEP-9: Implement pandas I/O connectors as extensions +# PDEP-9: Allow third-party projects to register pandas connectors with a standard API -- Created: 26 February 2023 +- Created: 5 March 2023 - Status: Draft - Discussion: [#XXXX](https://github.com/pandas-dev/pandas/pull/XXXX) - Author: [Marc Garcia](https://github.com/datapythonista) - Revision: 1 -## Introduction +## PDEP Summary + +This document proposes that third-party projects implementing I/O or memory +connectors, can register them using Python's entrypoint system, and make them +available to pandas users with a standard interface in a dedicated namespace +`DataFrame.io`. For example: + +```python +import pandas + +df = pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';") + +df.io.to_hive(hive_conn, "hive_table") +``` + +## Current state pandas supports importing and exporting data from different formats using -connectors, currently implemented in `pandas/io`. In many cases, those -connectors wrap an existing Python library, while in some others, pandas -implements the format logic. +I/O connectors, currently implemented in `pandas/io`, as well as connectors +to in-memory structure, like Python structures or other library formats. +In many cases, those connectors wrap an existing Python library, while in +some others, pandas implements the logic to read and write to a particular +format. In some cases, different engines exist for the same format. The API to use those connectors is `pandas.read_(engine='', ...)` to @@ -20,13 +37,14 @@ export data. For objects exported to memory (like a Python dict) the API is the same as for I/O, `DataFrame.to_(...)`. For formats imported from objects in -memory, the API is different, `DataFrame.from_(...)`. +memory, the API is different using the `from_` prefix instead of `read_`, +`DataFrame.from_(...)`. In some cases, the pandas API provides `DataFrame.to_*` methods that are not used to export the data to a disk or memory object, but instead to transform the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`. -Dependencies of the I/O connectors are not loaded by default, and will be +Dependencies of the connectors are not loaded by default, and will be imported when the connector is used. If the dependencies are not installed, an `ImportError` is raised. @@ -42,14 +60,15 @@ Use pip or conda to install pandas-gbq. ### Supported formats -The list of formats can be found in the [IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html). -A more detailed table, including in memory objects, with -the engines and dependencies is presented next. +The list of formats can be found in the +[IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html). +A more detailed table, including in memory objects, and I/O connectors in the +DataFrame styler is presented next: -| Format | Reader | Writer | Engines | Dependencies | -|--------------|--------|--------|--------------------------|--------------| -| CSV | X | X | `c`, `python`, `pyarrow` | `pyarrow` | -| FWF | X | | | | +| Format | Reader | Writer | +|--------------|--------|--------| +| CSV | X | X | +| FWF | X | | | JSON | X | X | | HTML | X | X | | LaTeX | | X | @@ -63,7 +82,7 @@ the engines and dependencies is presented next. | Stata | X | X | | SAS | X | | | SPSS | X | | -| Pickle | X | X | | | +| Pickle | X | X | | SQL | X | X | | BigQuery | | | | dict | X | X | @@ -72,152 +91,180 @@ the engines and dependencies is presented next. | markdown | | X | | xarray | | X | -### Inclusion criteria +At the time of writing this document, the `io/` module contains +close to 100,000 lines of Python, C and Cython code. There is no objective criteria for when a format is included -in pandas, and the list above is mostly the result of developers +in pandas, and the list above is mostly the result of a developer being interested in implementing the connectors for a certain format in pandas. -The number of existing formats is constantly increasing, and its -difficult for pandas to keep up to date even with popular formats. -It could probably make sense to have connectors to pyarrow, -pyspark, Iceberg, DuckDB, Polars, and others. +The number of existing formats available for data that can be processed with +pandas is constantly increasing, and its difficult for pandas to keep up to +date even with popular formats. It could possibly make sense to have connectors +to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others. -At the same time, some of the formats are not frequently used as -shown in the [2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). +At the same time, some of the formats are not frequently used as shown in the +[2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). Those less popular formats include SPSS, SAS, Google BigQuery and -Stata. Note that only I/O formats (and not memory formats like -records or xarray) where included in the survey. +Stata. Note that only I/O formats (and not memory formats like records or xarray) +where included in the survey. + +The maintenance cost of supporting all formats is not only in maintaining the +code and reviewing pull requests, but also it has a significant cost in time +spent on CI systems installing dependencies, compiling code, running tests, etc. + +In some cases, the main maintainers of some of the connectors are not part of +the pandas core development team, but people specialized in one of the formats +without commit rights. ## Proposal -The main proposal in this PDEP is to open the development of pandas -connectors to third-parties. This would not only allow the development -of new connectors in a faster and easier way, without the intervention of -the pandas team, but also remove from the pandas code base a number of the -existing connectors, simplifying the code, the CI and the builds. -While a limited set of core connectors could live in the pandas code base, -most of the existing connectors would be moved to third-party projects. +While the current pandas approach has worked reasonably well, it is difficult +to find a stable solution where the maintenance incurred in pandas is not +too big, while at the same time users can interact with all different formats +and representations they are interested in, in an easy and intuitive way. + +Third-party packages are already able to implement connectors to pandas, but +there are some limitations to it: + +- Given the large number of formats supported by pandas itself, third-party + connectors are likely seen as second class citizens, not important enough + to be used, or not well supported. +- There is no standard API for I/O connectors, and users of them need to learn + each of them individually. +- Method chaining, is not possible with third-party I/O connectors to export + data, unless authors monkey patch the `DataFrame` class, which should not + be encouraged. + +This document proposes to open the development of pandas I/O connectors to +third-party libraries in a standard way that overcomes those limitations. + +### Proposal implementation + +Implementing this proposal would not require major changes to pandas, and +the API defined next would be used. + +A new `.io` accessor would be created for the `DataFrame` class, where all +I/O connector methods from third-parties would be loaded. Nothing else would +live under that namespace. + +Third-party packages would implement a +[setuptools entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) +to define the connectors that they implement, under a group `dataframe.io`. + +For example, a hypothetical project `pandas_duckdb` implementing a `from_duckdb` +function, could use `pyproject.toml` to define the next entry point: + +```toml +[project.entry-points."dataframe.io"] +from_duckdb = "pandas_duckdb:from_duckdb" +``` -The user experience would remain similar to the existing one, but making -better use of namespaces, and adding consistency. Any pandas connector -(regardless of being implemented as a third-party module or not) would define -a Python entrypoint specifying the format they connect to, the operations -they support (read and/or write) and the name of the engine to be used. -On load, pandas would access this registry of connectors, and would create -the corresponding import and export methods. +On import of the pandas module, it would read the entrypoint registry for the +`dataframe.io` group, and would dynamically create methods in the `DataFrame.io` +namespace for them. Method names would only be allowed to start with `from_` or +`to_`, and any other prefix would make pandas raise an exception. This would +guarantee a reasonably consistent API among third-party I/O connectors. -To use the connectors for the format, users would install the third-party -connector package, instead of installing the required dependencies as they -need to do now. +Connectors would use Apache Arrow as the only interface to load data from and +to pandas. This would prevent that changes to the pandas API affecting +connectors in any way. This would simplify the development of connectors, +make testing of them much more reliable and resistant to changes in pandas, and +allow connectors to be reused by other projects of the ecosystem. -### Python API +In case a `from_` method returned something different than a PyArrow table, +pandas would raise an exception. pandas would expect all `to_` methods to have +`table: pyarrow.Table` as the first parameter, and it would raise an exception +otherwise. The `table` parameter would be exposed as the `self` parameter in +pandas, when the original function is registered as a method of the `.io` +accessor. -The Python API can be improved from the current one to make better use -of namespaces, and avoid inconsistencies. The proposed API is: +### Connector examples + +This section lists specific examples of connectors that could immediately +benefit from this proposal. + +**PyArrow** currently provides `Table.from_pandas` and `Table.to_pandas`. +With the new interface, it could also register `DataFrame.from_pyarrow` +and `DataFrame.to_pyarrow`, so pandas users can use the converters with +the interface they are used to, when PyArrow is installed in the environment. + +_Current API_: ```python -import pandas +pyarrow.Table.from_pandas(table.to_pandas() + .query('my_col > 0')) +``` -df = pandas.DataFrame.io.read_(engine='', ...) +_Proposed API_: -df.io.write_(engine='', ...) +```python +(pandas.DataFrame.io.from_pyarrow(table) + .query('my_col > 0') + .io.to_pyarrow()) ``` -The `engine` parameter would only be required when more than an engine -is available for a format. This is similar to the the current API, that -would use the default engine if not specified. -For example: +**Polars**, **Vaex** and other dataframe frameworks could benefit from +third-party projects that make the interoperability with pandas use a +more explicitly API. + +_Current API_: ```python -import pandas +polars.DataFrame(df.to_pandas() + .query('my_col > 0')) +``` -df = pandas.DataFrame.io.read_hdf5('input.hdf5') +_Proposed API_: -df.io.write_parquet('output.parquet') +```python +(pandas.DataFrame.io.from_polars(df) + .query('my_col > 0') + .io.to_polars()) ``` -All the I/O connectors would be accessed via `DataFrame.io`, significantly -reducing the number of items in the namespace of the `pandas` module, and -the `DataFrame` class. Introspection would make it fast and simple to -list the existing connectors `dir(pandas.DataFrame.io)`. - -The API is more intuitive than the current one, as it would be used for -both in memory formats and disk formats, and does not mix read/to (users -in general would expect read/write, from/to, import/export, input/output, -and not a mix of those pairs). - -### Ecosystem of connectors - -In the same way Python can be extended with third-party modules, pandas -would be extendable with I/O plugins. This has some advantages: - -- **Supression of the pandas maintainers bottleneck.** Everybody would be - able to develop and promote their own I/O connectors, without the - approval or intervention of pandas maintainers. -- **Lower the entry barrier to pandas code.** Since pandas is a huge and - mature project, writing code in pandas itself is complex. Several - linters and autoformatters are required, policies like adding release - notes need to be followed. Proper testing must be implemented. - CI is slow and takes hours to complete. pandas needs to be compiled - due to its C extensions. All those would not be necessary, and - creating new I/O connectors would be faster and simpler. -- **CI and packaging simplification.** pandas has currently around 20 - dependencies required by connectors. And a significant number of - tests, some of them requiring a high level of customization (such as - an available database server to test `read_sql`, or a virtual - clipboard to test `read_clipboard`). Moving connectors out of - pandas would make the CI faster, and the number of problems caused - by updates in dependencies smaller. -- **Competition and alternatives for I/O operations.** Some of the - supported formats allow for different approaches in terms of - implementation. For example, `csv` connectors can be optimized - for performance and reliability, or for easiness of use. When - building a production pipeline, users would often appreciate a - loader that requires an expected schema, loads faster because of - it, and fails if the file contains errors. While Jupyter users - may prefer inference and magic that helps them write code faster. -- **Reusability with other projects.** In some cases, it can make - sense to load a format into for example Apache Arrow, and then - convert it to a pandas `DataFrame` in the connector. It could - also be quite simple when that is implemented to return a Vaex - or a Polars object. Having connectors as third-party packages - would allow to implement this, as opposed as our current - connectors. This reusability would not only benefit other - dataframe projects, but it would also have better maintained - connectors, as they will be shared by a larger ecosystem. - -## Disadvantages - -The main disadvantages to implementing this PDEP are: - -- **Backward compatibility**. -- **More verbose API.** -- **Fragmented documentation.** - -## Transition period - -This proposal involves some important changes regarding user -facing code. - -The implementation of connectors as third-party packages is quite -small for users, who would just need to install `pandas-xarray` -instead of `xarray` to be able to use `DataFrame.to_xarray`. Also, -the `ImportError` message users would get in case it was not -properly installed, can provide the required information for users -to install the right package without issues. - -The part that requires more careful management and a long transition -period is the change to the Python API proposed here. The -new API does not overlap with the old one (everything would be in -the new `DataFrame.io` accessor). This allows to easily implement -both the new and old API in parallel, raising `FutureWarning` -warnings in the old API, so users can slowly adapt their code, -and get used to the new API. Since the changes affect all pandas -users, keeping the old behavior until at least pandas 4.0 seems -a reasonable transition period. +**DuckDB** provides an out-of-core engine able to push predicates before +the data is loaded, making much better use of memory and significantly +decreasing loading time. pandas, because of its eager nature is not able +to easily implement this itself, but could benefit from a DuckDB loader. +The loader can already be implemented inside pandas, or as a third-party +extension with an arbitrary API. But this proposal would let the creation +of a third-party extension with a standard and intuitive API: + +```python +pandas.DataFrame.io.from_duckdb("SELECT * + FROM 'dataset.parquet' + WHERE my_col > 0") +``` + +**Big data** systems such as Hive, Iceberg, Presto, etc. could benefit +from a standard way to load data to pandas. Also regular **SQL databases** +that can return their query results as Arrow, would benefit from better +and faster connectors than the existing ones based on SQL Alchemy and +Python structures. + +Any other format, including **domain-specific formats** could easily +implement pandas connectors with a clear an intuitive API. + +## Proposal extensions + +The scope of the current proposal is limited to the addition of the +`DataFrame.io` namespace, and the automatic registration of functions defined +by third-party projects, if an entrypoint is defined. + +Any changes to the current I/O of pandas are out of scope for this proposal, +but the next tasks can be considered for future work and proposals: + +- Migrate I/O connectors currently implemented in pandas to the new interface. + This would require a transition period where users would be warned that + existing `DataFrame.read_*` may have been moved to `DataFrame.io.from_*`, + and that the old API will stop working in a future version. +- Move out of the pandas repository and into their own third-party projects + some of the existing I/O connectors. +- Implement with the new interface some of the data structures that the + `DataFrame` constructor accepts. ## PDEP-9 History -- 26 February 2023: Initial version +- 5 March 2023: Initial version From da784ececb04aa36992c6d7f3d4e357e008e2cfe Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 7 Mar 2023 15:06:56 +0000 Subject: [PATCH 03/13] Address comments from code reviews, mostly by extending the proposal implementation section --- web/pandas/pdeps/0009-io-extensions.md | 234 +++++++++++++++++++------ 1 file changed, 185 insertions(+), 49 deletions(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 83efc1ef86292..797649f81a2d3 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -1,15 +1,15 @@ # PDEP-9: Allow third-party projects to register pandas connectors with a standard API - Created: 5 March 2023 -- Status: Draft -- Discussion: [#XXXX](https://github.com/pandas-dev/pandas/pull/XXXX) +- Status: Under discussion +- Discussion: [#51799](https://github.com/pandas-dev/pandas/pull/51799) - Author: [Marc Garcia](https://github.com/datapythonista) - Revision: 1 ## PDEP Summary This document proposes that third-party projects implementing I/O or memory -connectors, can register them using Python's entrypoint system, and make them +connectors can register them using Python's entrypoint system, and make them available to pandas users with a standard interface in a dedicated namespace `DataFrame.io`. For example: @@ -25,7 +25,7 @@ df.io.to_hive(hive_conn, "hive_table") pandas supports importing and exporting data from different formats using I/O connectors, currently implemented in `pandas/io`, as well as connectors -to in-memory structure, like Python structures or other library formats. +to in-memory structures, like Python structures or other library formats. In many cases, those connectors wrap an existing Python library, while in some others, pandas implements the logic to read and write to a particular format. @@ -65,31 +65,31 @@ The list of formats can be found in the A more detailed table, including in memory objects, and I/O connectors in the DataFrame styler is presented next: -| Format | Reader | Writer | -|--------------|--------|--------| -| CSV | X | X | -| FWF | X | | -| JSON | X | X | -| HTML | X | X | -| LaTeX | | X | -| XML | X | X | -| Clipboard | X | X | -| Excel | X | X | -| HDF5 | X | X | -| Feather | X | X | -| Parquet | X | X | -| ORC | X | X | -| Stata | X | X | -| SAS | X | | -| SPSS | X | | -| Pickle | X | X | -| SQL | X | X | -| BigQuery | | | -| dict | X | X | -| records | X | X | -| string | | X | -| markdown | | X | -| xarray | | X | +| Format | Reader | Writer | Engines | +|--------------|--------|--------|-----------------------------------------------------------------------------------| +| CSV | X | X | `c`, `python`, `pyarrow` | +| FWF | X | | `c`, `python`, `pyarrow` | +| JSON | X | X | `ujson`, `pyarrow` | +| HTML | X | X | `lxml`, `bs4/html5lib` (parameter `flavor`) | +| LaTeX | | X | | +| XML | X | X | `lxml`, `etree` (parameter `parser`) | +| Clipboard | X | X | | +| Excel | X | X | `xlrd`, `openpyxl`, `odf`, `pyxlsb` (each engine supports different file formats) | +| HDF5 | X | X | | +| Feather | X | X | | +| Parquet | X | X | `pyarrow`, `fastparquet` | +| ORC | X | X | | +| Stata | X | X | | +| SAS | X | | | +| SPSS | X | | | +| Pickle | X | X | | +| SQL | X | X | `sqlalchemy`, `dbapi2` (inferred from the type of the `con` parameter) | +| BigQuery | X | X | | +| dict | X | X | | +| records | X | X | | +| string | | X | | +| markdown | | X | | +| xarray | | X | | At the time of writing this document, the `io/` module contains close to 100,000 lines of Python, C and Cython code. @@ -108,7 +108,7 @@ At the same time, some of the formats are not frequently used as shown in the [2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). Those less popular formats include SPSS, SAS, Google BigQuery and Stata. Note that only I/O formats (and not memory formats like records or xarray) -where included in the survey. +were included in the survey. The maintenance cost of supporting all formats is not only in maintaining the code and reviewing pull requests, but also it has a significant cost in time @@ -133,7 +133,7 @@ there are some limitations to it: to be used, or not well supported. - There is no standard API for I/O connectors, and users of them need to learn each of them individually. -- Method chaining, is not possible with third-party I/O connectors to export +- Method chaining is not possible with third-party I/O connectors to export data, unless authors monkey patch the `DataFrame` class, which should not be encouraged. @@ -145,12 +145,38 @@ third-party libraries in a standard way that overcomes those limitations. Implementing this proposal would not require major changes to pandas, and the API defined next would be used. +#### User API + A new `.io` accessor would be created for the `DataFrame` class, where all I/O connector methods from third-parties would be loaded. Nothing else would -live under that namespace. +live under that namespace. All methods in the `DataFrame.io` namespace would +be consistently named `from_*` to import data from other sources to pandas +and `to_*` to export data from pandas to other formats. For example: + +```python +import pandas + +df = pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';") + +df.io.to_hive(hive_conn, "hive_table") +``` + +This API allows for method chaining: + +```python +(pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';") + .io.to_hive(hive_conn, "hive_table")) +``` + +By using a dedicated `.io` namespace, the amount of functions and methods in +the already big namespaces of the `pandas` module and the `pandas.DataFrame` +class will not be increased, and core pandas functionality would not be mixed +with functionality registered via third-party plugins. -Third-party packages would implement a -[setuptools entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) +#### Plugin registration + +Third-party packages would implement an +[entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) to define the connectors that they implement, under a group `dataframe.io`. For example, a hypothetical project `pandas_duckdb` implementing a `from_duckdb` @@ -163,15 +189,48 @@ from_duckdb = "pandas_duckdb:from_duckdb" On import of the pandas module, it would read the entrypoint registry for the `dataframe.io` group, and would dynamically create methods in the `DataFrame.io` -namespace for them. Method names would only be allowed to start with `from_` or -`to_`, and any other prefix would make pandas raise an exception. This would -guarantee a reasonably consistent API among third-party I/O connectors. +namespace for them. Method not starting with `from_` or `to_` would make pandas +raise an exception. This would guarantee a reasonably consistent API among +third-party connectors. + +#### Internal API -Connectors would use Apache Arrow as the only interface to load data from and -to pandas. This would prevent that changes to the pandas API affecting -connectors in any way. This would simplify the development of connectors, -make testing of them much more reliable and resistant to changes in pandas, and -allow connectors to be reused by other projects of the ecosystem. +Connectors would use Apache Arrow as the only interface to interact with pandas +or any other project using them. In case an internal representation other than +Arrow is used, like pandas using NumPy-backed data, the data will be converted to it +while creating the dataframe object in pandas, not in the connector. Consider the +next example (only for illustration, it does not use the exact proposed API, +which would be created dynamically): + +```python +class DataFrame: + class io: + @classmethod + def from_myformat(cls, *args, **kwargs): + arrow_table = third_party_connector.from_myformat(*args, **kwargs) + + # Final objects do not necessarily need to use Arrow + return convert_arrow_table_to_pandas_dataframe(arrow_table) +``` + +By standardizing the exchange format to Apache Arrow, there are some advantages: + +- The connectors can be reused by a wider variety of projects. Other dataframe + libraries could use them (e.g. Polars, Vaex, etc.), or other projects in the ecosystem + that could consume data as Arrow objects (e.g. matplotlib, scikit-learn, etc.) +- Connectors don't need to use internal pandas APIs (like extension array objects). + If they used them changes to those internal APIs would cause connectors to break. + By using Arrow as the exchange format, connectors shouldn't be affected by any + change to pandas, and pandas development can be executed without worrying about + any impact to the connectors ecosystem. +- The previous point would also simplify testing of both pandas and the connectors. + When implementations are very coupled, it's easy that changes to one project impact + the other, and it's common that a project tests another downstream project in its + test suite. This is the currently the case in pandas with the xarray, and Google + Big Query connectors, as well as other downstream libraries such as Dask or + scikit-learn. By using Arrow and fully decoupling the implementations, + testing of connectors would not be needed in pandas, and connectors wouldn not + need to test compatibility with pandas or other consumers either. In case a `from_` method returned something different than a PyArrow table, pandas would raise an exception. pandas would expect all `to_` methods to have @@ -180,6 +239,78 @@ otherwise. The `table` parameter would be exposed as the `self` parameter in pandas, when the original function is registered as a method of the `.io` accessor. +Metadata not supported by Apache Arrow may be provided by users. For example the +column to use for row indices or the data type backend to use in the object being +created. This would be managed independently from the connectors. Given the previous +example, a new argument `index_col` could be added directly into pandas: + +```python +class DataFrame: + class io: + @classmethod + def from_myformat(cls, index_col=None, *args, **kwargs): + # The third-party connector doesn't need to know about functionality + # specific to pandas like the row index + arrow_table = third_party_connector.from_myformat(*args, **kwargs) + + df = convert_arrow_table_to_pandas_dataframe(arrow_table) + + # Transformations to the dataframe with custom parameters is possible + if index_col is not None: + df = df.set_index(index_col) + + return df +``` + +Since the methods of `pandas.DataFrame` would be dynamically created, those custom +arguments would be generic and added to all `from_` and/or `to_` connectors at once, +avoiding code duplication and creating a more consistent API. + +#### Connector guidelines + +In order to provide a better and more consistent experience to users, guidelines +will be created to unify terminology and behavior. Some of the topics to unify are +defined next. + +**Existence and naming of columns**, since many connectors are likely to provide +similar features, like loading only a subset of columns in the data, or dealing +with file names. Examples of recommendations to connector developers: + +- `columns`: Use this argument to let the user load a subset of columns. Allow a + list or tuple. +- `path`: Use this argument if the dataset is a file in the file disk. Allow a string, + a `pathlib.Path` object, or a file descriptor. For a string object, allow URLs that + will be automatically download, compressed files that will be automatically + uncompressed, etc. A library can be provided to deal with those in an easier and + more consistent way. +- `schema`: For datasets that don't have a schema (e.g. `csv`), allow providing an + Apache Arrow schema instance, and automatically infer types if not provided. + +Note that the above are only examples of guidelines for illustration, and not +a proposal of the guidelines. + +**Guidelines to avoid name conflicts**. Since it is expected that more than one +implementation exists for certain formats, as it already happens, guidelines on +how to name connectors would be created. The easiest approach is probably to use +the format `from__` / `to__`. + +For example a `csv` loader based on PyArrow could be named as `from_csv_pyarrow`, +and an implementation that does not infer types and raises an exception in case +of mistake or ambiguity it could be named `from_csv_strict`. Exact guidelines +would be developed independently from this proposal. + +**Connector registry and documentation**. To simplify the discovery of connectors +and its documentation, connector developers can be encourage to register their +projects in a central location, and to use a standard structure for documentation. +This would allow the creation of a unified website to find the available +connectors, and their documentation. It would also allow to customize the +documentation for specific implementations, and include their final API, and +include arguments specific to the implementation. In the case of pandas, it +would allow to add arguments such as `index_col` to all loader methods, and +to potentially build the API reference of certain third-party connectors as part +as the pandas own documentation. That may or may not be a good idea, but +standardizing the documentation of connectors would allow it. + ### Connector examples This section lists specific examples of connectors that could immediately @@ -189,6 +320,8 @@ benefit from this proposal. With the new interface, it could also register `DataFrame.from_pyarrow` and `DataFrame.to_pyarrow`, so pandas users can use the converters with the interface they are used to, when PyArrow is installed in the environment. +Better integration with PyArrow tables was discussed in +[#51760](https://github.com/pandas-dev/pandas/issues/51760). _Current API_: @@ -207,7 +340,8 @@ _Proposed API_: **Polars**, **Vaex** and other dataframe frameworks could benefit from third-party projects that make the interoperability with pandas use a -more explicitly API. +more explicitly API. Integration with Polars was requested in +[#47368](https://github.com/pandas-dev/pandas/issues/47368). _Current API_: @@ -228,9 +362,10 @@ _Proposed API_: the data is loaded, making much better use of memory and significantly decreasing loading time. pandas, because of its eager nature is not able to easily implement this itself, but could benefit from a DuckDB loader. -The loader can already be implemented inside pandas, or as a third-party -extension with an arbitrary API. But this proposal would let the creation -of a third-party extension with a standard and intuitive API: +The loader can already be implemented inside pandas (it has already been +proposed in [#45678](https://github.com/pandas-dev/pandas/issues/45678), +or as a third-party extension with an arbitrary API. But this proposal would +let the creation of a third-party extension with a standard and intuitive API: ```python pandas.DataFrame.io.from_duckdb("SELECT * @@ -253,10 +388,11 @@ The scope of the current proposal is limited to the addition of the `DataFrame.io` namespace, and the automatic registration of functions defined by third-party projects, if an entrypoint is defined. -Any changes to the current I/O of pandas are out of scope for this proposal, -but the next tasks can be considered for future work and proposals: +Any changes to the current connectors of pandas (e.g. `read_csv`, +`from_records`, etc.) or their migration to the new system are out of scope for +this proposal, but the next tasks can be considered for future work and proposals: -- Migrate I/O connectors currently implemented in pandas to the new interface. +- Migrate connectors currently implemented in pandas to the new interface. This would require a transition period where users would be warned that existing `DataFrame.read_*` may have been moved to `DataFrame.io.from_*`, and that the old API will stop working in a future version. From 4a8ba96c766af50a9e69f8190cf1b6571ff1b682 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Thu, 6 Apr 2023 12:05:31 +0400 Subject: [PATCH 04/13] Keep current I/O API and allow pandas as an interface --- web/pandas/pdeps/0009-io-extensions.md | 181 ++++++++++++------------- 1 file changed, 85 insertions(+), 96 deletions(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 797649f81a2d3..3a5a1226694c1 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -10,17 +10,21 @@ This document proposes that third-party projects implementing I/O or memory connectors can register them using Python's entrypoint system, and make them -available to pandas users with a standard interface in a dedicated namespace -`DataFrame.io`. For example: +available to pandas users with the existing I/O interface. For example: ```python import pandas -df = pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';") +df = pandas.DataFrame.read_duckdb("SELECT * FROM 'my_dataset.parquet';") -df.io.to_hive(hive_conn, "hive_table") +df.to_deltalake('/delta/my_dataset') ``` +This would allow to easily extend the existing number of connectors, adding +support to new formats and database engines, data lake technologies, +out-of-core connectors, the new ADBC interface and at the same time reduce the +maintenance cost of the pandas core. + ## Current state pandas supports importing and exporting data from different formats using @@ -131,8 +135,8 @@ there are some limitations to it: - Given the large number of formats supported by pandas itself, third-party connectors are likely seen as second class citizens, not important enough to be used, or not well supported. -- There is no standard API for I/O connectors, and users of them need to learn - each of them individually. +- There is no standard API for external I/O connectors, and users need + to learn each of them individually. - Method chaining is not possible with third-party I/O connectors to export data, unless authors monkey patch the `DataFrame` class, which should not be encouraged. @@ -147,31 +151,36 @@ the API defined next would be used. #### User API -A new `.io` accessor would be created for the `DataFrame` class, where all -I/O connector methods from third-parties would be loaded. Nothing else would -live under that namespace. All methods in the `DataFrame.io` namespace would -be consistently named `from_*` to import data from other sources to pandas -and `to_*` to export data from pandas to other formats. For example: +Users will be able to install third-party packages implementing pandas +connectors using the standard packaging tools (pip, conda, etc.). These +connectors should implement an entrypoint that pandas will use to +automatically create the corresponding methods `pandas.read_*` and +`pandas.DataFrame.to_*`. Arbitrary function or method names will not +be created by this interface, only the `read_*` and `to_*` pattern will +be allowed. By simply installing the appropriate packages users will +be able to use code like this: ```python import pandas -df = pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';") +df = pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") -df.io.to_hive(hive_conn, "hive_table") +df.to_hive(hive_conn, "hive_table") ``` This API allows for method chaining: ```python -(pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';") - .io.to_hive(hive_conn, "hive_table")) +(pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") + .to_hive(hive_conn, "hive_table")) ``` -By using a dedicated `.io` namespace, the amount of functions and methods in -the already big namespaces of the `pandas` module and the `pandas.DataFrame` -class will not be increased, and core pandas functionality would not be mixed -with functionality registered via third-party plugins. +The total number of I/O functions and methods is expected to be small, as users +in general use only a small subset of formats. The number could actually be +reduced from the current state if the less popular formats (such as SAS, SPSS, +BigQuery, etc.) are removed from the pandas core into third-party packages. +Moving these connectors is not part of this proposal, and will be discussed +later in a separate proposal. #### Plugin registration @@ -179,93 +188,72 @@ Third-party packages would implement an [entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) to define the connectors that they implement, under a group `dataframe.io`. -For example, a hypothetical project `pandas_duckdb` implementing a `from_duckdb` +For example, a hypothetical project `pandas_duckdb` implementing a `read_duckdb` function, could use `pyproject.toml` to define the next entry point: ```toml [project.entry-points."dataframe.io"] -from_duckdb = "pandas_duckdb:from_duckdb" +from_duckdb = "pandas_duckdb:read_duckdb" ``` On import of the pandas module, it would read the entrypoint registry for the -`dataframe.io` group, and would dynamically create methods in the `DataFrame.io` -namespace for them. Method not starting with `from_` or `to_` would make pandas -raise an exception. This would guarantee a reasonably consistent API among -third-party connectors. +`dataframe.io` group, and would dynamically create methods in the `pandas` and +`pandas.DataFrame` namespace for them. Method not starting with `read_` or `to_` +would make pandas raise an exception. This would guarantee a reasonably +consistent API among third-party connectors. #### Internal API -Connectors would use Apache Arrow as the only interface to interact with pandas -or any other project using them. In case an internal representation other than -Arrow is used, like pandas using NumPy-backed data, the data will be converted to it -while creating the dataframe object in pandas, not in the connector. Consider the -next example (only for illustration, it does not use the exact proposed API, -which would be created dynamically): - -```python -class DataFrame: - class io: - @classmethod - def from_myformat(cls, *args, **kwargs): - arrow_table = third_party_connector.from_myformat(*args, **kwargs) - - # Final objects do not necessarily need to use Arrow - return convert_arrow_table_to_pandas_dataframe(arrow_table) -``` - -By standardizing the exchange format to Apache Arrow, there are some advantages: - -- The connectors can be reused by a wider variety of projects. Other dataframe - libraries could use them (e.g. Polars, Vaex, etc.), or other projects in the ecosystem - that could consume data as Arrow objects (e.g. matplotlib, scikit-learn, etc.) -- Connectors don't need to use internal pandas APIs (like extension array objects). - If they used them changes to those internal APIs would cause connectors to break. - By using Arrow as the exchange format, connectors shouldn't be affected by any - change to pandas, and pandas development can be executed without worrying about - any impact to the connectors ecosystem. -- The previous point would also simplify testing of both pandas and the connectors. - When implementations are very coupled, it's easy that changes to one project impact - the other, and it's common that a project tests another downstream project in its - test suite. This is the currently the case in pandas with the xarray, and Google - Big Query connectors, as well as other downstream libraries such as Dask or - scikit-learn. By using Arrow and fully decoupling the implementations, - testing of connectors would not be needed in pandas, and connectors wouldn not - need to test compatibility with pandas or other consumers either. - -In case a `from_` method returned something different than a PyArrow table, +Connectors would use one of two different interface options: a pandas `DataFrame` +or an Apache Arrow table. + +The Apache Arrow format would allow that connectors do not need to use pandas to +create the data, making them more robust and less likely to break when changes to +pandas internals happen. It would also allow other possible consumers of the +connectors to not have pandas as a dependency. Testing also becomes simpler by +using Apache Arrow, since connectors can be tested independently, and pandas does +not need to be tested for each connector. If the Apache Arrow specification is +respected in both sides, the communication between connectors and pandas is +guaranteed to work. If pandas eventually has a hard dependency on an Apache +Arrow implementation, this should be the preferred interface. + +Allowing connectors to use pandas dataframes directly makes users not have to +depend on PyArrow for connectors that do not use an Apache Arrow object. It also +helps move existing connectors to this new API, since they are using pandas +dataframes as an exchange object now. It has the disadvantages stated in the +previous paragraph, and a future proposal may be created to discuss deprecating +pandas dataframes as a possible interface for connectors. + +In case a `read` method returned something different than a PyArrow table, pandas would raise an exception. pandas would expect all `to_` methods to have -`table: pyarrow.Table` as the first parameter, and it would raise an exception -otherwise. The `table` parameter would be exposed as the `self` parameter in -pandas, when the original function is registered as a method of the `.io` -accessor. - -Metadata not supported by Apache Arrow may be provided by users. For example the -column to use for row indices or the data type backend to use in the object being -created. This would be managed independently from the connectors. Given the previous -example, a new argument `index_col` could be added directly into pandas: +`table: pyarrow.Table | pandas.DataFrame` as the first parameter, and it would +raise an exception otherwise. The `table` parameter would be exposed as the +`self` parameter of the `to_*` method in pandas. + +In case the Apache Arrow interface is used, metadata not supported by Apache +Arrow may be provided by users. For example the column to use for row indices +or the data type backend to use in the object being created. This would be +managed independently from the connectors. Given the previous example, a new +argument `index_col` could be added directly into pandas to the function or +method automatically generated from the entrypoint. Since this would apply to +all functions and methods automatically generated, it would also improve the +consistency of pandas connectors. For example: ```python -class DataFrame: - class io: - @classmethod - def from_myformat(cls, index_col=None, *args, **kwargs): - # The third-party connector doesn't need to know about functionality - # specific to pandas like the row index - arrow_table = third_party_connector.from_myformat(*args, **kwargs) +def read_myformat(index_col=None, *args, **kwargs): + # The third-party connector doesn't need to know about functionality + # specific to pandas like the row index + arrow_table = third_party_connector.from_myformat(*args, **kwargs) - df = convert_arrow_table_to_pandas_dataframe(arrow_table) + df = convert_arrow_table_to_pandas_dataframe(arrow_table) - # Transformations to the dataframe with custom parameters is possible - if index_col is not None: - df = df.set_index(index_col) + # Transformations to the dataframe with custom parameters is possible + if index_col is not None: + df = df.set_index(index_col) - return df + return df ``` -Since the methods of `pandas.DataFrame` would be dynamically created, those custom -arguments would be generic and added to all `from_` and/or `to_` connectors at once, -avoiding code duplication and creating a more consistent API. - #### Connector guidelines In order to provide a better and more consistent experience to users, guidelines @@ -373,6 +361,10 @@ pandas.DataFrame.io.from_duckdb("SELECT * WHERE my_col > 0") ``` +**Out-of-core algorithms** push some operations like filtering or grouping +to the loading of the data. While this is not currently possible, connectors +implementing out-of-core algorithms could be developed using this interface. + **Big data** systems such as Hive, Iceberg, Presto, etc. could benefit from a standard way to load data to pandas. Also regular **SQL databases** that can return their query results as Arrow, would benefit from better @@ -384,20 +376,17 @@ implement pandas connectors with a clear an intuitive API. ## Proposal extensions -The scope of the current proposal is limited to the addition of the -`DataFrame.io` namespace, and the automatic registration of functions defined -by third-party projects, if an entrypoint is defined. +The scope of the current proposal is limited to the registration of functions +defined by third-party projects, if an entrypoint is defined. Any changes to the current connectors of pandas (e.g. `read_csv`, `from_records`, etc.) or their migration to the new system are out of scope for this proposal, but the next tasks can be considered for future work and proposals: -- Migrate connectors currently implemented in pandas to the new interface. - This would require a transition period where users would be warned that - existing `DataFrame.read_*` may have been moved to `DataFrame.io.from_*`, - and that the old API will stop working in a future version. - Move out of the pandas repository and into their own third-party projects - some of the existing I/O connectors. + some of the existing I/O connectors. This would require a transition period + to let users know that future versions of pandas will require a dependency + installed for a particular connector to exist. - Implement with the new interface some of the data structures that the `DataFrame` constructor accepts. From 5cb47d938b2c672fbecf94664caca6666735d7f8 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Fri, 7 Apr 2023 12:10:28 +0400 Subject: [PATCH 05/13] Rejecting --- web/pandas/pdeps/0009-io-extensions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 3a5a1226694c1..404bdcc6975eb 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -1,7 +1,7 @@ # PDEP-9: Allow third-party projects to register pandas connectors with a standard API - Created: 5 March 2023 -- Status: Under discussion +- Status: Rejected - Discussion: [#51799](https://github.com/pandas-dev/pandas/pull/51799) - Author: [Marc Garcia](https://github.com/datapythonista) - Revision: 1 From 68ca3de3c85a71d360c2d97716cdc74c92e204ca Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Fri, 7 Apr 2023 12:11:58 +0400 Subject: [PATCH 06/13] Reorder interfaces --- web/pandas/pdeps/0009-io-extensions.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 404bdcc6975eb..c47b6af386c78 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -204,8 +204,8 @@ consistent API among third-party connectors. #### Internal API -Connectors would use one of two different interface options: a pandas `DataFrame` -or an Apache Arrow table. +Connectors would use one of two different interface options: an Apache Arrow table +or a pandas `DataFrame`. The Apache Arrow format would allow that connectors do not need to use pandas to create the data, making them more robust and less likely to break when changes to From 150d1d1ca2bfaf1e4578ee109a4d9442f5e35965 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Sat, 29 Apr 2023 17:48:07 +0100 Subject: [PATCH 07/13] Update web/pandas/pdeps/0009-io-extensions.md Co-authored-by: Simon Hawkins --- web/pandas/pdeps/0009-io-extensions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index c47b6af386c78..63d0ebed06c14 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -372,7 +372,7 @@ and faster connectors than the existing ones based on SQL Alchemy and Python structures. Any other format, including **domain-specific formats** could easily -implement pandas connectors with a clear an intuitive API. +implement pandas connectors with a clear and intuitive API. ## Proposal extensions From 6eea8a8d2a6456c77402c0a600f5265e170c70b0 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 30 May 2023 14:10:07 +0400 Subject: [PATCH 08/13] Use dataframe interchange protocol --- web/pandas/pdeps/0009-io-extensions.md | 211 +++++++++++-------------- 1 file changed, 94 insertions(+), 117 deletions(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index c47b6af386c78..dbaed0f1131ed 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -1,16 +1,21 @@ # PDEP-9: Allow third-party projects to register pandas connectors with a standard API - Created: 5 March 2023 -- Status: Rejected +- Status: Under discussion - Discussion: [#51799](https://github.com/pandas-dev/pandas/pull/51799) + [#53005](https://github.com/pandas-dev/pandas/pull/53005) - Author: [Marc Garcia](https://github.com/datapythonista) - Revision: 1 ## PDEP Summary This document proposes that third-party projects implementing I/O or memory -connectors can register them using Python's entrypoint system, and make them -available to pandas users with the existing I/O interface. For example: +connectors to pandas can register them using Python's entrypoint system, +and make them available to pandas users with the usual pandas I/O interface. +For example, packages independent from pandas could implement readers from +DuckDB and writers to Delta Lake, and when installed in the user environment +the user would be able to use them as if they were implemented in pandas. +For example: ```python import pandas @@ -22,14 +27,14 @@ df.to_deltalake('/delta/my_dataset') This would allow to easily extend the existing number of connectors, adding support to new formats and database engines, data lake technologies, -out-of-core connectors, the new ADBC interface and at the same time reduce the -maintenance cost of the pandas core. +out-of-core connectors, the new ADBC interface, and others, and at the +same time reduce the maintenance cost of the pandas code base. ## Current state pandas supports importing and exporting data from different formats using I/O connectors, currently implemented in `pandas/io`, as well as connectors -to in-memory structures, like Python structures or other library formats. +to in-memory structures like Python structures or other library formats. In many cases, those connectors wrap an existing Python library, while in some others, pandas implements the logic to read and write to a particular format. @@ -48,8 +53,8 @@ In some cases, the pandas API provides `DataFrame.to_*` methods that are not used to export the data to a disk or memory object, but instead to transform the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`. -Dependencies of the connectors are not loaded by default, and will be -imported when the connector is used. If the dependencies are not installed, +Dependencies of the connectors are not loaded by default, and are +imported when the connector is used. If the dependencies are not installed an `ImportError` is raised. ```python @@ -105,7 +110,7 @@ format in pandas. The number of existing formats available for data that can be processed with pandas is constantly increasing, and its difficult for pandas to keep up to -date even with popular formats. It could possibly make sense to have connectors +date even with popular formats. It possibly makes sense to have connectors to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others. At the same time, some of the formats are not frequently used as shown in the @@ -119,8 +124,7 @@ code and reviewing pull requests, but also it has a significant cost in time spent on CI systems installing dependencies, compiling code, running tests, etc. In some cases, the main maintainers of some of the connectors are not part of -the pandas core development team, but people specialized in one of the formats -without commit rights. +the pandas core development team, but people specialized in one of the formats. ## Proposal @@ -136,7 +140,11 @@ there are some limitations to it: connectors are likely seen as second class citizens, not important enough to be used, or not well supported. - There is no standard API for external I/O connectors, and users need - to learn each of them individually. + to learn each of them individually. Since the pandas I/O API is inconsistent + by using read/to instead of read/write or from/to, developers in many cases + ignore the convention. Also, even if developers follow the pandas convention + the namespaces would be different, since developers of connectors will rarely + monkeypatch their functions into the `pandas` or `DataFrame` namespace. - Method chaining is not possible with third-party I/O connectors to export data, unless authors monkey patch the `DataFrame` class, which should not be encouraged. @@ -153,12 +161,14 @@ the API defined next would be used. Users will be able to install third-party packages implementing pandas connectors using the standard packaging tools (pip, conda, etc.). These -connectors should implement an entrypoint that pandas will use to -automatically create the corresponding methods `pandas.read_*` and -`pandas.DataFrame.to_*`. Arbitrary function or method names will not -be created by this interface, only the `read_*` and `to_*` pattern will -be allowed. By simply installing the appropriate packages users will -be able to use code like this: +connectors should implement entrypoints that pandas will use to +automatically create the corresponding methods `pandas.read_*`, +`pandas.DataFrame.to_*` and `pandas.Series.to_*`. Arbitrary function or +method names will not be created by this interface, only the `read_*` +and `to_*` pattern will be allowed. + +By simply installing the appropriate packages users will be able to use +code like this: ```python import pandas @@ -179,13 +189,13 @@ The total number of I/O functions and methods is expected to be small, as users in general use only a small subset of formats. The number could actually be reduced from the current state if the less popular formats (such as SAS, SPSS, BigQuery, etc.) are removed from the pandas core into third-party packages. -Moving these connectors is not part of this proposal, and will be discussed +Moving these connectors is not part of this proposal, and could be discussed later in a separate proposal. #### Plugin registration -Third-party packages would implement an -[entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) +Third-party packages would implement +[entrypoints](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins) to define the connectors that they implement, under a group `dataframe.io`. For example, a hypothetical project `pandas_duckdb` implementing a `read_duckdb` @@ -193,66 +203,31 @@ function, could use `pyproject.toml` to define the next entry point: ```toml [project.entry-points."dataframe.io"] -from_duckdb = "pandas_duckdb:read_duckdb" +reader_duckdb = "pandas_duckdb:read_duckdb" ``` On import of the pandas module, it would read the entrypoint registry for the -`dataframe.io` group, and would dynamically create methods in the `pandas` and -`pandas.DataFrame` namespace for them. Method not starting with `read_` or `to_` -would make pandas raise an exception. This would guarantee a reasonably -consistent API among third-party connectors. +`dataframe.io` group, and would dynamically create methods in the `pandas`, +`pandas.DataFrame` and `pandas.Series` namespaces for them. Only entrypoints with +name starting by `reader_` or `writer_` would be processed by pandas, and the functions +registered in the entrypoint would be made available to pandas users in the corresponsing +pandas namespaces. The text after the keywords `reader_` and `writer_` would be used +for the name of the function. In the example above, the entrypoint name `reader_duckdb` +would create `pandas.read_duckdb`. An entrypoint with name `writer_hive` would create +the methods `DataFrame.to_hive` and `Series.to_hive`. + +Entrypoints not starting with `reader_` or `writer_` would be ignored by this interface, +but will not raise an exception since they can be used for future extensions of this +API, or other related dataframe I/O interfaces. #### Internal API -Connectors would use one of two different interface options: an Apache Arrow table -or a pandas `DataFrame`. - -The Apache Arrow format would allow that connectors do not need to use pandas to -create the data, making them more robust and less likely to break when changes to -pandas internals happen. It would also allow other possible consumers of the -connectors to not have pandas as a dependency. Testing also becomes simpler by -using Apache Arrow, since connectors can be tested independently, and pandas does -not need to be tested for each connector. If the Apache Arrow specification is -respected in both sides, the communication between connectors and pandas is -guaranteed to work. If pandas eventually has a hard dependency on an Apache -Arrow implementation, this should be the preferred interface. - -Allowing connectors to use pandas dataframes directly makes users not have to -depend on PyArrow for connectors that do not use an Apache Arrow object. It also -helps move existing connectors to this new API, since they are using pandas -dataframes as an exchange object now. It has the disadvantages stated in the -previous paragraph, and a future proposal may be created to discuss deprecating -pandas dataframes as a possible interface for connectors. - -In case a `read` method returned something different than a PyArrow table, -pandas would raise an exception. pandas would expect all `to_` methods to have -`table: pyarrow.Table | pandas.DataFrame` as the first parameter, and it would -raise an exception otherwise. The `table` parameter would be exposed as the -`self` parameter of the `to_*` method in pandas. - -In case the Apache Arrow interface is used, metadata not supported by Apache -Arrow may be provided by users. For example the column to use for row indices -or the data type backend to use in the object being created. This would be -managed independently from the connectors. Given the previous example, a new -argument `index_col` could be added directly into pandas to the function or -method automatically generated from the entrypoint. Since this would apply to -all functions and methods automatically generated, it would also improve the -consistency of pandas connectors. For example: - -```python -def read_myformat(index_col=None, *args, **kwargs): - # The third-party connector doesn't need to know about functionality - # specific to pandas like the row index - arrow_table = third_party_connector.from_myformat(*args, **kwargs) - - df = convert_arrow_table_to_pandas_dataframe(arrow_table) - - # Transformations to the dataframe with custom parameters is possible - if index_col is not None: - df = df.set_index(index_col) - - return df -``` +Connectors will use the dataframe interchange API to provide data to pandas. When +data is read from a connector, and before returning it to the user as a response +to `pandas.read_`, data will be parsed from the data interchange interface +and converted to a pandas DataFrame. In practice, connectors are likely to return +a pandas DataFrame or a PyArrow Table, but the interface will support any object +implementing the dataframe interchange API. #### Connector guidelines @@ -260,44 +235,40 @@ In order to provide a better and more consistent experience to users, guidelines will be created to unify terminology and behavior. Some of the topics to unify are defined next. -**Existence and naming of columns**, since many connectors are likely to provide +**Guidelines to avoid name conflicts**. Since it is expected that more than one +implementation exists for certain formats, as it already happens, guidelines on +how to name connectors would be created. The easiest approach is probably to use +as the format a string of the type `to__` if it is +expected that more than one connector can exist. For example, for LanceDB it is likely +that only one connector exist, and the name `lance` can be used (which would create +`pandas.read_lance` or `DataFrame.to_lance`. But if a new `csv` reader based in the +Arrow2 Rust implementation, the guidelines can recommend to use `csv_arrow2` to +create `pandas.read_csv_arrow2`, etc. + +**Existence and naming of parameters**, since many connectors are likely to provide similar features, like loading only a subset of columns in the data, or dealing -with file names. Examples of recommendations to connector developers: +with paths. Examples of recommendations to connector developers could be: - `columns`: Use this argument to let the user load a subset of columns. Allow a list or tuple. - `path`: Use this argument if the dataset is a file in the file disk. Allow a string, a `pathlib.Path` object, or a file descriptor. For a string object, allow URLs that will be automatically download, compressed files that will be automatically - uncompressed, etc. A library can be provided to deal with those in an easier and - more consistent way. + uncompressed, etc. Specific libraries can be recommended to deal with those in an + easier and more consistent way. - `schema`: For datasets that don't have a schema (e.g. `csv`), allow providing an Apache Arrow schema instance, and automatically infer types if not provided. Note that the above are only examples of guidelines for illustration, and not -a proposal of the guidelines. - -**Guidelines to avoid name conflicts**. Since it is expected that more than one -implementation exists for certain formats, as it already happens, guidelines on -how to name connectors would be created. The easiest approach is probably to use -the format `from__` / `to__`. - -For example a `csv` loader based on PyArrow could be named as `from_csv_pyarrow`, -and an implementation that does not infer types and raises an exception in case -of mistake or ambiguity it could be named `from_csv_strict`. Exact guidelines -would be developed independently from this proposal. +a proposal of the guidelines, which would be developed independently after this +PDEP is approved. **Connector registry and documentation**. To simplify the discovery of connectors and its documentation, connector developers can be encourage to register their projects in a central location, and to use a standard structure for documentation. This would allow the creation of a unified website to find the available connectors, and their documentation. It would also allow to customize the -documentation for specific implementations, and include their final API, and -include arguments specific to the implementation. In the case of pandas, it -would allow to add arguments such as `index_col` to all loader methods, and -to potentially build the API reference of certain third-party connectors as part -as the pandas own documentation. That may or may not be a good idea, but -standardizing the documentation of connectors would allow it. +documentation for specific implementations, and include their final API. ### Connector examples @@ -321,9 +292,9 @@ pyarrow.Table.from_pandas(table.to_pandas() _Proposed API_: ```python -(pandas.DataFrame.io.from_pyarrow(table) - .query('my_col > 0') - .io.to_pyarrow()) +(pandas.read_pyarrow(table) + .query('my_col > 0') + .to_pyarrow()) ``` **Polars**, **Vaex** and other dataframe frameworks could benefit from @@ -341,9 +312,9 @@ polars.DataFrame(df.to_pandas() _Proposed API_: ```python -(pandas.DataFrame.io.from_polars(df) - .query('my_col > 0') - .io.to_polars()) +(pandas.read_polars(df) + .query('my_col > 0') + .to_polars()) ``` **DuckDB** provides an out-of-core engine able to push predicates before @@ -356,9 +327,9 @@ or as a third-party extension with an arbitrary API. But this proposal would let the creation of a third-party extension with a standard and intuitive API: ```python -pandas.DataFrame.io.from_duckdb("SELECT * - FROM 'dataset.parquet' - WHERE my_col > 0") +pandas.read_duckdb("SELECT * + FROM 'dataset.parquet' + WHERE my_col > 0") ``` **Out-of-core algorithms** push some operations like filtering or grouping @@ -374,22 +345,28 @@ Python structures. Any other format, including **domain-specific formats** could easily implement pandas connectors with a clear an intuitive API. -## Proposal extensions +## Future plans + +This PDEP is exclusively to support a better API for existing of future +connectors. It is out of scope for this PDEP to implement changes to any +connectors existing in the pandas code base. + +Some ideas for the future related to this PDED include: -The scope of the current proposal is limited to the registration of functions -defined by third-party projects, if an entrypoint is defined. +- Removing from the pandas code base some of the least frequently used connectors, +such as SAS, SPSS or Google BigQuery, and move them to third-party connectors +registered with this interface. -Any changes to the current connectors of pandas (e.g. `read_csv`, -`from_records`, etc.) or their migration to the new system are out of scope for -this proposal, but the next tasks can be considered for future work and proposals: +- Discussing a better API for pandas connectors. For example, using `read_*` +methods instead of `from_*` methods, renaming `to_*` methods not used as I/O +connectors, using a consistent terminology like from/to, read/write, load/dump, etc. +or using a dedicated namespace for connectors (e.g. `pandas.io` instead of the +general `pandas` namespace). -- Move out of the pandas repository and into their own third-party projects - some of the existing I/O connectors. This would require a transition period - to let users know that future versions of pandas will require a dependency - installed for a particular connector to exist. -- Implement with the new interface some of the data structures that the - `DataFrame` constructor accepts. +- Implement as I/O connectors some of the formats supported by the `DataFrame` +constructor. ## PDEP-9 History - 5 March 2023: Initial version +- 30 May 2023: Major refactoring to use the pandas existing API and the dataframe interchange API From 40ebacca5db863c6df97142b4b1da9bf484fdda9 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 30 May 2023 14:32:04 +0400 Subject: [PATCH 09/13] typo --- web/pandas/pdeps/0009-io-extensions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 5aebddc7e73d1..aeda62441e5cf 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -210,7 +210,7 @@ On import of the pandas module, it would read the entrypoint registry for the `dataframe.io` group, and would dynamically create methods in the `pandas`, `pandas.DataFrame` and `pandas.Series` namespaces for them. Only entrypoints with name starting by `reader_` or `writer_` would be processed by pandas, and the functions -registered in the entrypoint would be made available to pandas users in the corresponsing +registered in the entrypoint would be made available to pandas users in the corresponding pandas namespaces. The text after the keywords `reader_` and `writer_` would be used for the name of the function. In the example above, the entrypoint name `reader_duckdb` would create `pandas.read_duckdb`. An entrypoint with name `writer_hive` would create From eb7c6f06424d6d88135112172b7179b319814317 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 30 May 2023 15:07:01 +0400 Subject: [PATCH 10/13] Make users load modules explicitly --- web/pandas/pdeps/0009-io-extensions.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index aeda62441e5cf..3896d0caa4451 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -20,6 +20,8 @@ For example: ```python import pandas +pandas.load_io_plugins() + df = pandas.DataFrame.read_duckdb("SELECT * FROM 'my_dataset.parquet';") df.to_deltalake('/delta/my_dataset') @@ -167,12 +169,14 @@ automatically create the corresponding methods `pandas.read_*`, method names will not be created by this interface, only the `read_*` and `to_*` pattern will be allowed. -By simply installing the appropriate packages users will be able to use -code like this: +By simply installing the appropriate packages and calling the function +`pandas.load_io_plugins()` users will be able to use code like this: ```python import pandas +pandas.load_io_plugins() + df = pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") df.to_hive(hive_conn, "hive_table") @@ -206,7 +210,7 @@ function, could use `pyproject.toml` to define the next entry point: reader_duckdb = "pandas_duckdb:read_duckdb" ``` -On import of the pandas module, it would read the entrypoint registry for the +When the user calls `pandas.load_io_plugins()`, it would read the entrypoint registry for the `dataframe.io` group, and would dynamically create methods in the `pandas`, `pandas.DataFrame` and `pandas.Series` namespaces for them. Only entrypoints with name starting by `reader_` or `writer_` would be processed by pandas, and the functions @@ -351,7 +355,9 @@ This PDEP is exclusively to support a better API for existing of future connectors. It is out of scope for this PDEP to implement changes to any connectors existing in the pandas code base. -Some ideas for the future related to this PDED include: +Some ideas for future discussion related to this PDED include: + +- Automatically loading of I/O plugins when pandas is imported. - Removing from the pandas code base some of the least frequently used connectors, such as SAS, SPSS or Google BigQuery, and move them to third-party connectors @@ -369,4 +375,6 @@ constructor. ## PDEP-9 History - 5 March 2023: Initial version -- 30 May 2023: Major refactoring to use the pandas existing API and the dataframe interchange API +- 30 May 2023: Major refactoring to use the pandas existing API, + the dataframe interchange API and to make the user be explicit to load + the plugins From 805085346a69d80a0781d846e980ab7e5d63bebb Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Wed, 7 Jun 2023 10:15:18 +0400 Subject: [PATCH 11/13] Update web/pandas/pdeps/0009-io-extensions.md Co-authored-by: Irv Lustig --- web/pandas/pdeps/0009-io-extensions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 3896d0caa4451..5317296eb15c7 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -355,7 +355,7 @@ This PDEP is exclusively to support a better API for existing of future connectors. It is out of scope for this PDEP to implement changes to any connectors existing in the pandas code base. -Some ideas for future discussion related to this PDED include: +Some ideas for future discussion related to this PDEP include: - Automatically loading of I/O plugins when pandas is imported. From 5cb23ddb47fdde80455a8cecbf808612e62fab50 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Wed, 7 Jun 2023 11:17:00 +0400 Subject: [PATCH 12/13] Add limitations section --- web/pandas/pdeps/0009-io-extensions.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index 5317296eb15c7..db60b98edbaa1 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -349,6 +349,30 @@ Python structures. Any other format, including **domain-specific formats** could easily implement pandas connectors with a clear and intuitive API. +### Limitations + +The implementation of this proposal has some limitations discussed here: + +- **Lack of support for multiple engines.** The current pandas I/O API + supports multiple engines for the same format (for the same function or + method name). For example `read_csv(engine='pyarrow', ...)`. Supporting + engines requires that all engines for a particular format use the same + signature (the same parameters), which is not ideal. Different connectors + are likely to have different parameters and using `*args` and `**kwargs` + provides users with a more complex and difficult experience. For this + reason this proposal prefers that function and method names are unique + instead of supporting an option for engines. +- **Lack of support for type checking of connectors.** This PDEP proposes + creating functions and methods dynamically, and those are not supported + for type checking using stubs. This is already the case for other + dynamically created components of pandas, such as custom accessors. +- **No improvements to the current I/O API**. In the discussions of this + proposal it has been considered to improve the current pandas I/O API to + fix the inconsistency of using `read` / `to` (instead of for example + `read` / `write`), avoid using `to_` prefixed methods for non-I/O + operations, or using a dedicated namespace (e.g. `DataFrame.io`) for + the connectors. All of these changes are out of scope for this PDEP. + ## Future plans This PDEP is exclusively to support a better API for existing of future From ccb9674bff276cbdddfbfa1cc59f1a42d208cc55 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 13 Jun 2023 16:21:02 +0400 Subject: [PATCH 13/13] Rejecting PDEP --- web/pandas/pdeps/0009-io-extensions.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0009-io-extensions.md b/web/pandas/pdeps/0009-io-extensions.md index db60b98edbaa1..aeda990cea7df 100644 --- a/web/pandas/pdeps/0009-io-extensions.md +++ b/web/pandas/pdeps/0009-io-extensions.md @@ -1,7 +1,7 @@ # PDEP-9: Allow third-party projects to register pandas connectors with a standard API - Created: 5 March 2023 -- Status: Under discussion +- Status: Rejected - Discussion: [#51799](https://github.com/pandas-dev/pandas/pull/51799) [#53005](https://github.com/pandas-dev/pandas/pull/53005) - Author: [Marc Garcia](https://github.com/datapythonista) @@ -402,3 +402,5 @@ constructor. - 30 May 2023: Major refactoring to use the pandas existing API, the dataframe interchange API and to make the user be explicit to load the plugins +- 13 June 2023: The PDEP did not get any support after several iterations, + and its been closed as rejected by the author