Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: expose to_pandas_kwargs in read_parquet with pyarrow backend #59654

Merged
merged 16 commits into from
Nov 21, 2024
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v2.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ MultiIndex

I/O
^^^
-
- ``pyarrow`` engine for :func:`read_parquet` accepts ``to_pandas_kwargs`` which are forwarded to :meth:`pyarrow.Table.to_pandas`. This enables passing in ``maps_as_pydicts`` to read parquet map datatypes as python dictionaries. (:issue:`56842`)
kleinhenz marked this conversation as resolved.
Show resolved Hide resolved
-

Period
Expand Down
5 changes: 3 additions & 2 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,18 +245,19 @@ def read(
dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default,
storage_options: StorageOptions | None = None,
filesystem=None,
to_pandas_kwargs: dict[str, Any] | None = None,
**kwargs,
) -> DataFrame:
kwargs["use_pandas_metadata"] = True

to_pandas_kwargs = {}
to_pandas_kwargs = {} if to_pandas_kwargs is None else to_pandas_kwargs
if dtype_backend == "numpy_nullable":
from pandas.io._util import _arrow_dtype_mapping

mapping = _arrow_dtype_mapping()
to_pandas_kwargs["types_mapper"] = mapping.get
elif dtype_backend == "pyarrow":
to_pandas_kwargs["types_mapper"] = pd.ArrowDtype # type: ignore[assignment]
to_pandas_kwargs["types_mapper"] = pd.ArrowDtype
elif using_string_dtype():
to_pandas_kwargs["types_mapper"] = arrow_string_types_mapper()

Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -1173,6 +1173,20 @@ def test_non_nanosecond_timestamps(self, temp_file):
)
tm.assert_frame_equal(result, expected)

def test_maps_as_pydicts(self, pa):
pyarrow = pytest.importorskip("pyarrow", "13.0.0")

schema = pyarrow.schema(
[("foo", pyarrow.map_(pyarrow.string(), pyarrow.int64()))]
)
df = pd.DataFrame([{"foo": {"A": 1}}, {"foo": {"B": 2}}])
check_round_trip(
df,
pa,
write_kwargs={"schema": schema},
read_kwargs={"to_pandas_kwargs": {"maps_as_pydicts": "strict"}},
)


class TestParquetFastParquet(Base):
@pytest.mark.xfail(reason="datetime_with_nat gets incorrect values")
Expand Down