Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas 2.0 Support #94

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a43c8ed
Avoid pandas 2.1.0 due to timestamp bug
IzerOnadimQC Sep 8, 2023
f2ea716
Coerce timestamps to nanoseconds when converting to pandas
IzerOnadimQC Sep 14, 2023
fa15a47
Prevent dask from converting objects to strings
IzerOnadimQC Sep 18, 2023
c182ffb
Avoid dask 2023.9.2 due to failing tests
IzerOnadimQC Sep 18, 2023
8b1e0af
Cast metadata bytes to object to get around pandas bug
IzerOnadimQC Sep 18, 2023
d35b3e0
Generate arrow-compat reference data for 13.0.0
IzerOnadimQC Sep 19, 2023
5715758
Change package version in docs to match environment.yml
IzerOnadimQC Sep 19, 2023
1dbe229
Only use coerce timestamps arg on pyarrow>=13
IzerOnadimQC Sep 19, 2023
062f283
Remove pyarrow<8 tests from ci.yml
IzerOnadimQC Sep 20, 2023
f909ffe
Generate arrow-compat reference data for 12.0.0
IzerOnadimQC Sep 20, 2023
f0f0662
Avoid pandas 2.1.0 in numfocus nightly ci test
IzerOnadimQC Sep 20, 2023
3c56634
Avoid pandas 2.1.0.* in numfocus_nightly pip install
IzerOnadimQC Sep 20, 2023
60079ee
Add changelog entry and update setup.cfg
IzerOnadimQC Sep 20, 2023
2ccb2fe
Shrink PR
IzerOnadimQC Sep 20, 2023
d52f76c
Add dask tests for lines marked as uncovered by codecov
IzerOnadimQC Sep 20, 2023
fae951d
Check conda env before verbose import
IzerOnadimQC Sep 21, 2023
91ecebf
Use micromamba instead of mamba
IzerOnadimQC Sep 21, 2023
6248192
Pin pandas<2.1.0 due to bug in 2.1.0 and 2.1.1
IzerOnadimQC Sep 21, 2023
807650c
Check if adding pyarrow 13 tests improves coverage
IzerOnadimQC Sep 25, 2023
d45d160
Fix yaml error
IzerOnadimQC Sep 25, 2023
fdcecc3
Fix pyarrow install command and re-add removed tests
IzerOnadimQC Sep 25, 2023
0414b78
Experiment with tests for backwards compatibility
IzerOnadimQC Sep 25, 2023
1c2352d
Allow install of specific pandas version
IzerOnadimQC Sep 26, 2023
84cdbdb
Fix yaml error
IzerOnadimQC Sep 26, 2023
c14edda
Update changelog and re-add pyarrow 4.0.1 to ci.yml
IzerOnadimQC Sep 26, 2023
2c80c10
Remove test for pyarrow==3.0.0 as incompatible with pandas 2
IzerOnadimQC Sep 26, 2023
657704a
asv no longer supports dev, use run instead
IzerOnadimQC Sep 26, 2023
1f97207
Add environment arg to asv run
IzerOnadimQC Sep 26, 2023
c0218c3
Use astype when seting Series type due to change in pandas behaviour
IzerOnadimQC Sep 26, 2023
efb827f
Pin dask<2023.9.2
IzerOnadimQC Sep 26, 2023
61feb74
Add no-py-pin to pandas downgrade step
IzerOnadimQC Sep 26, 2023
ff4793f
Return to !=2023.9.2 due to broken CI
IzerOnadimQC Sep 26, 2023
f2d4a64
Switch CI operation order
IzerOnadimQC Sep 26, 2023
d088f3a
Test whether <2023.9.2 breaks CI
IzerOnadimQC Sep 26, 2023
db65092
Pin asv<0.6 due to API change
IzerOnadimQC Sep 27, 2023
7a7d1b3
Pin pyarrow>=4 due to pandas 2 incompatibility
IzerOnadimQC Sep 27, 2023
9b627d2
Pin asv during micromamba install
IzerOnadimQC Sep 27, 2023
5c1b49f
Remove square bracket notation from environment-docs.yml
IzerOnadimQC Sep 28, 2023
66bae06
Remove square bracket notation for dask in setup.cfg
IzerOnadimQC Sep 28, 2023
d27900d
Refactor PYARROW_LT_13 condition to remove repeated code
IzerOnadimQC Sep 28, 2023
41509b2
Add square brackets back to setup.cfg
IzerOnadimQC Sep 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 43 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,43 +25,73 @@ jobs:
matrix:
numfocus_nightly: [false]
os: ["ubuntu-latest"]
pyarrow: ["3.0.0", "4.0.1", "nightly"]
pandas: [""]
pyarrow: ["4.0.1", "nightly"]
python: ["3.8"]
include:
- numfocus_nightly: true
os: "ubuntu-latest"
pandas: ""
pyarrow: "4.0.1"
python: "3.10"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: "1.5.3"
pyarrow: "4.0.1"
python: "3.11"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: "1.5.3"
pyarrow: "13.0.0"
python: "3.11"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "5.0.0"
python: "3.9"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "6.0.1"
python: "3.9"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "7.0.0"
python: "3.10"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "8.0.1"
python: "3.10"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "9.0.0"
python: "3.10"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "10.0.1"
python: "3.11"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "11.0.0"
python: "3.11"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "12.0.0"
python: "3.11"
- numfocus_nightly: false
os: "ubuntu-latest"
pandas: ""
pyarrow: "13.0.0"
python: "3.11"
- numfocus_nightly: false
os: "macos-latest"
pandas: ""
pyarrow: "4.0.1"
python: "3.8"
continue-on-error: ${{ matrix.numfocus_nightly || matrix.pyarrow == 'nightly' }}
Expand Down Expand Up @@ -89,22 +119,28 @@ jobs:
cache-env: true
extra-specs: |
python=${{ matrix.PYTHON_VERSION }}
- name: Install repository
run: python -m pip install --no-build-isolation --no-deps --disable-pip-version-check -e .
- name: Install Pyarrow (non-nightly)
run: micromamba install pyarrow==${{ matrix.pyarrow }}
if: matrix.pyarrow != 'nightly'
# Don't pin python as older versions of pyarrow require older versions of python
# Pin asv so it doesn't get updated before the benchmarks are run
run: micromamba install -y --no-py-pin pyarrow==${{ matrix.pyarrow }} "pandas<2.1.0" "asv<0.6"
if: matrix.pyarrow != 'nightly' && matrix.pandas == ''
- name: Install Pyarrow (nightly)
# Install both arrow-cpp and pyarrow to make sure that we have the
# latest nightly of both packages. It is sadly not guaranteed that the
# nightlies and the latest release would otherwise work together.
run: micromamba update -c arrow-nightlies -c conda-forge arrow-cpp pyarrow
if: matrix.pyarrow == 'nightly'
- name: Pip Instal NumFOCUS nightly
- name: Install Pyarrow (downgrade pandas)
run: micromamba install -y --no-py-pin pyarrow==${{ matrix.pyarrow }} pandas==${{ matrix.pandas }}
if: matrix.pyarrow != 'nightly' && matrix.pandas != ''
- name: Pip Install NumFOCUS nightly
# NumFOCUS nightly wheels, contains numpy and pandas
# TODO(gh-45): Re-add numpy
run: python -m pip install --pre --upgrade --timeout=60 --extra-index-url https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas
# TODO: Remove pandas version pin once https://github.com/pandas-dev/pandas/issues/55014 is fixed
run: python -m pip install --pre --upgrade --timeout=60 --extra-index-url https://pypi.anaconda.org/scipy-wheels-nightly/simple "pandas<2.1.0"
if: matrix.numfocus_nightly
- name: Install repository
run: python -m pip install --no-build-isolation --no-deps --disable-pip-version-check -e .
- name: Test import
run: |
python -c "import plateau"
Expand Down
8 changes: 8 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@
Changelog
=========

Plateau 4.2.0 (unreleased)
==========================

* Support pandas 2
* Test pyarrow 12 and 13
* Prevent dask from casting all object dtypes to strings
* Remove tests for pyarrow<=3 as they fail with pandas>=2
IzerOnadimQC marked this conversation as resolved.
Show resolved Hide resolved

Plateau 4.1.5 (2023-03-14)
==========================

Expand Down
6 changes: 3 additions & 3 deletions docs/environment-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ channels:
- conda-forge
dependencies:
- python>=3.8
- dask[dataframe]
- dask<2023.9.2
- decorator
- msgpack-python>=0.5.2
# Currently dask and numpy==1.16.0 clash
- numpy!=1.15.0,!=1.16.0
- pandas>=0.23.0, !=1.0.0
- pyarrow>=0.17.1,!=1.0.0
- pandas>=0.23.0,!=1.0.0,<2.1.0
- pyarrow>=4
- simplejson
- minimalkv
- toolz
Expand Down
9 changes: 5 additions & 4 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@ channels:
- conda-forge
- nodefaults
dependencies:
- dask!=2021.5.1,!=2021.6.0 # gh475 - 2021.5.1 and 2021.6.0 broke ci, omit those versions
# TODO: Investigate issue with dask 2023.9.2
- dask!=2021.5.1,!=2021.6.0,<2023.9.2 # gh475 - 2021.5.1 and 2021.6.0 broke ci, omit those versions
- decorator
- msgpack-python>=0.5.2
# Currently dask and numpy==1.16.0 clash
# TODO: add support for numpy>=1.23
- numpy!=1.15.0,!=1.16.0
- pandas>=0.23.0,!=1.0.0
- pyarrow>=0.17.1,!=1.0.0
- pandas>=0.23.0,!=1.0.0,<2.1.0
- pyarrow>=4
- simplejson
- minimalkv>=1.4.2
- toolz
Expand All @@ -36,6 +37,6 @@ dependencies:
# CLI
- ipython
# ASV // Benchmark
- asv
- asv<0.6
# Packaging infrastructure
- python-build
24 changes: 21 additions & 3 deletions plateau/core/common_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,14 @@
import pyarrow.parquet as pq
import simplejson
from minimalkv import KeyValueStore
from packaging import version

from plateau.core import naming
from plateau.core._compat import load_json
from plateau.core.naming import SINGLE_TABLE
from plateau.core.utils import ensure_string_type
from plateau.serialization._parquet import PARQUET_VERSION
from plateau.serialization._util import schema_metadata_bytes_to_object

_logger = logging.getLogger()

Expand All @@ -28,6 +30,8 @@
"normalize_column_order",
)

PYARROW_LT_13 = version.parse(pa.__version__) < version.parse("13")


class SchemaWrapper:
"""Wrapper object for pyarrow.Schema to handle forwards and backwards
Expand Down Expand Up @@ -736,7 +740,9 @@ def _dict_to_binary(dct):
return simplejson.dumps(dct, sort_keys=True).encode("utf8")


def empty_dataframe_from_schema(schema, columns=None, date_as_object=False):
def empty_dataframe_from_schema(
schema, columns=None, date_as_object=False, coerce_temporal_nanoseconds=True
):
"""Create an empty DataFrame from provided schema.

Parameters
Expand All @@ -746,14 +752,26 @@ def empty_dataframe_from_schema(schema, columns=None, date_as_object=False):
columns: Union[None, List[str]]
Optional list of columns that should be part of the resulting DataFrame. All columns in that list must also be
part of the provided schema.
date_as_object: bool
Cast dates to objects.
coerce_temporal_nanoseconds: bool
Coerce date32, date64, duration and timestamp units to nanoseconds to retain behaviour of pandas 1.x.
Only applicable to pandas version >= 2.0 and PyArrow version >= 13.0.0.

Returns
-------
DataFrame
Empty DataFrame with requested columns and types.
"""

df = schema.internal().empty_table().to_pandas(date_as_object=date_as_object)
# HACK: Cast bytes to object in metadata until Pandas bug is fixed: https://github.com/pandas-dev/pandas/issues/50127
schema = schema_metadata_bytes_to_object(schema.internal())

# Prior to pyarrow 13.0.0 coerce_temporal_nanoseconds didn't exist
# as it was introduced for backwards compatibility with pandas 1.x
_coerce = {}
if not PYARROW_LT_13:
_coerce["coerce_temporal_nanoseconds"] = coerce_temporal_nanoseconds
df = schema.empty_table().to_pandas(date_as_object=date_as_object, **_coerce)

df.columns = df.columns.map(ensure_string_type)
if columns is not None:
Expand Down
27 changes: 23 additions & 4 deletions plateau/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from packaging import version
from toolz.itertoolz import partition_all

import plateau.core._time
Expand Down Expand Up @@ -37,6 +38,8 @@
"PartitionIndex",
)

PYARROW_LT_13 = version.parse(pa.__version__) < version.parse("13")


class IndexBase(CopyMixin):
"""Initialize an IndexBase.
Expand Down Expand Up @@ -136,11 +139,21 @@ def __repr__(self) -> str:
class_=type(self).__name__, attrs=", ".join(repr_str)
)

def observed_values(self, date_as_object=True) -> np.ndarray:
def observed_values(
self, date_as_object=True, coerce_temporal_nanoseconds=True
) -> np.ndarray:
"""Return an array of all observed values."""
keys = np.array(list(self.index_dct.keys()))
labeled_array = pa.array(keys, type=self.dtype)
return np.array(labeled_array.to_pandas(date_as_object=date_as_object))

# Prior to pyarrow 13.0.0 coerce_temporal_nanoseconds didn't exist
# as it was introduced for backwards compatibility with pandas 1.x
_coerce = {}
if not PYARROW_LT_13:
_coerce["coerce_temporal_nanoseconds"] = coerce_temporal_nanoseconds
return np.array(
labeled_array.to_pandas(date_as_object=date_as_object, **_coerce)
)

@staticmethod
def normalize_value(dtype: pa.DataType, value: Any) -> Any:
Expand Down Expand Up @@ -476,7 +489,10 @@ def as_flat_series(
table = _index_dct_to_table(
self.index_dct, column=self.column, dtype=self.dtype
)
df = table.to_pandas(date_as_object=date_as_object)
# Prior to pyarrow 13.0.0 coerce_temporal_nanoseconds didn't exist
# as it was introduced for backwards compatibility with pandas 1.x
_coerce = {} if PYARROW_LT_13 else {"coerce_temporal_nanoseconds": True}
df = table.to_pandas(date_as_object=date_as_object, **_coerce)

if predicates is not None:
# If there is a conjunction without any reference to the index
Expand Down Expand Up @@ -862,7 +878,10 @@ def _parquet_bytes_to_dict(column: str, index_buffer: bytes):
if column_type == pa.timestamp("us"):
column_type = pa.timestamp("ns")

df = table.to_pandas()
# Prior to pyarrow 13.0.0 coerce_temporal_nanoseconds didn't exist
# as it was introduced for backwards compatibility with pandas 1.x
_coerce = {} if PYARROW_LT_13 else {"coerce_temporal_nanoseconds": True}
df = table.to_pandas(**_coerce)

index_dct = dict(
zip(df[column].values, (list(x) for x in df[_PARTITION_COLUMN_NAME].values))
Expand Down
11 changes: 7 additions & 4 deletions plateau/io/dask/compression.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from functools import partial
from typing import List, Union

import dask
import dask.dataframe as dd
import pandas as pd

Expand Down Expand Up @@ -109,7 +110,8 @@ def pack_payload(df: dd.DataFrame, group_key: Union[List[str], str]) -> dd.DataF

_pack_payload = partial(pack_payload_pandas, group_key=group_key)

return df.map_partitions(_pack_payload, meta=packed_meta)
with dask.config.set({"dataframe.convert-string": False}):
return df.map_partitions(_pack_payload, meta=packed_meta)


def unpack_payload_pandas(
Expand Down Expand Up @@ -154,6 +156,7 @@ def unpack_payload(df: dd.DataFrame, unpack_meta: pd.DataFrame) -> dd.DataFrame:
)
return df

return df.map_partitions(
unpack_payload_pandas, unpack_meta=unpack_meta, meta=unpack_meta
)
with dask.config.set({"dataframe.convert-string": False}):
return df.map_partitions(
unpack_payload_pandas, unpack_meta=unpack_meta, meta=unpack_meta
)
Loading