Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] cudf-polars chunked parquet reader #16789

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
7742b8b
Avoid GPU initialisation during import
wence- Jul 26, 2024
ef0b49f
Require polars >= 1.3
wence- Jul 29, 2024
e9fd96d
Adapt to IR changes
wence- Jul 25, 2024
9d69621
Use new IR versioning to report if we don't support an IR version
wence- Jul 26, 2024
918a40e
Use new GPUEngine config object to set things up
wence- Jul 22, 2024
f8f2d0d
Plausibly provide useful error message if driver is too old
wence- Jul 26, 2024
bcedb6b
Update overview docs
wence- Jul 24, 2024
6f2d406
Support right join
wence- Jul 29, 2024
f3bbd3f
Test that invalid GPUEngine config raises
wence- Jul 29, 2024
1d4c30c
More coverage for gpuengine config
wence- Jul 29, 2024
abcf22b
Versioned handling of PythonScan translation
wence- Jul 30, 2024
62a5dbd
Merge pull request #16347 from wence-/wence/fea/polars-engine-config
wence- Aug 2, 2024
7d0c7ad
Adapt to IR changes in polars 1.4 (#16494)
lithomas1 Aug 5, 2024
5de29b3
Implement polars string Replace and ReplaceMany (#16039)
lithomas1 Aug 6, 2024
7f6b00f
Use a key column rather than a placeholder for count agg
wence- Aug 19, 2024
822e7d0
Backport: Remove cuDF dependency from pylibcudf column from_device te…
lithomas1 Aug 20, 2024
152111b
Implement scan-based whole-frame aggregations for cudf-polars (#16509)
lithomas1 Aug 20, 2024
13a1493
Merge pull request #16599 from wence/fix/remove-placeholder-column
wence- Aug 21, 2024
7cf3289
Implement order preserving groupby in cudf-polars (#16555)
lithomas1 Aug 22, 2024
f6c938f
Fix integer overflow in indexalator pointer logic
davidwendt Aug 22, 2024
4ded370
use std::ptrdiff_t
davidwendt Aug 23, 2024
edabb67
Correctly export empty column names in DataFrame.to_polars (#16596)
wence- Aug 27, 2024
a4c35e9
Forward-merge 24.08
wence- Aug 27, 2024
0a95b2c
Add more `cudf-polars` unaryops (#16579)
brandon-b-miller Aug 27, 2024
cc892fc
Merge pull request #16667 from wence-/wence/merge-2408
wence- Aug 27, 2024
41a3a95
Add `pylibcudf`/`cudf-polars` string `strip` (#16504)
brandon-b-miller Aug 27, 2024
0bf68d4
`cudf-polars`/`pylibcudf` string -> date parsing (#16306)
brandon-b-miller Aug 28, 2024
40d33cb
Support quantile in cudf_polars (#16093)
lithomas1 Aug 29, 2024
95da2c5
Implement handlers for first/last in groupby (#16688)
wence- Aug 30, 2024
434afab
Ensure IR validation always checks for empty columns
wence- Aug 30, 2024
385ae98
Need to check for nulls in nested dtypes
wence- Aug 30, 2024
1cf1146
Add test reading nested Null column
wence- Aug 30, 2024
de445a3
Move creation of regex program to initialisation
wence- Aug 30, 2024
f39713e
Merge pull request #16703 from wence-/wence/fea/polars-reject-invalid…
wence- Aug 30, 2024
ad364c6
Include failing node in error message
wence- Aug 30, 2024
d158b22
Merge pull request #16702 from wence-/wence/fea/polars-no-empty-columns
wence- Sep 2, 2024
b550645
Partially reject dynamic groupby (#16720)
wence- Sep 3, 2024
eb2a23e
Implement Kleene logic handling for Any/All and bitwise Or/And (#16476)
wence- Sep 4, 2024
ebc3bbe
Some fixes for unary functions (#16719)
wence- Sep 4, 2024
5d262df
Implement unpivot in cudf-polars (#16689)
wence- Sep 4, 2024
c76e90b
Small scan-handler fixes (#16721)
wence- Sep 4, 2024
ccb8061
Implement cudf-polars datetime extraction methods (#16500)
lithomas1 Sep 5, 2024
feb2e63
Polars 1.7 will change a minor thing in the IR, adapt to that (#16755)
wence- Sep 6, 2024
6d2e455
Run polars test suite (defaulting to GPU) in CI (#16710)
wence- Sep 6, 2024
24f9516
access and config chunked parquet reader
brandon-b-miller Sep 10, 2024
4bbbdc2
do not early return df
brandon-b-miller Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .github/workflows/pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ jobs:
- wheel-tests-cudf
- wheel-build-cudf-polars
- wheel-tests-cudf-polars
- cudf-polars-polars-tests
- wheel-build-dask-cudf
- wheel-tests-dask-cudf
- devcontainer
Expand Down Expand Up @@ -154,6 +155,17 @@ jobs:
# This always runs, but only fails if this PR touches code in
# pylibcudf or cudf_polars
script: "ci/test_wheel_cudf_polars.sh"
cudf-polars-polars-tests:
needs: wheel-build-cudf-polars
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
# This selects "ARCH=amd64 + the latest supported Python + CUDA".
matrix_filter: map(select(.ARCH == "amd64")) | group_by(.CUDA_VER|split(".")|map(tonumber)|.[0]) | map(max_by([(.PY_VER|split(".")|map(tonumber)), (.CUDA_VER|split(".")|map(tonumber))]))
build_type: pull-request
# This always runs, but only fails if this PR touches code in
# pylibcudf or cudf_polars
script: "ci/test_cudf_polars_polars_tests.sh"
wheel-build-dask-cudf:
needs: wheel-build-cudf
secrets: inherit
Expand Down
27 changes: 27 additions & 0 deletions ci/run_cudf_polars_polars_tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash
# Copyright (c) 2024, NVIDIA CORPORATION.

set -euo pipefail

# Support invoking run_cudf_polars_pytests.sh outside the script directory
# Assumption, polars has been cloned in the root of the repo.
cd "$(dirname "$(realpath "${BASH_SOURCE[0]}")")"/../polars/

DESELECTED_TESTS=(
"tests/unit/test_polars_import.py::test_polars_import" # relies on a polars built in place
"tests/unit/streaming/test_streaming_sort.py::test_streaming_sort[True]" # relies on polars built in debug mode
"tests/unit/test_cpu_check.py::test_check_cpu_flags_skipped_no_flags" # Mock library error
"tests/docs/test_user_guide.py" # No dot binary in CI image
)

DESELECTED_TESTS=$(printf -- " --deselect %s" "${DESELECTED_TESTS[@]}")
python -m pytest \
--import-mode=importlib \
--cache-clear \
-m "" \
-p cudf_polars.testing.plugin \
-v \
--tb=short \
${DESELECTED_TESTS} \
"$@" \
py-polars/tests
68 changes: 68 additions & 0 deletions ci/test_cudf_polars_polars_tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash
# Copyright (c) 2024, NVIDIA CORPORATION.

set -eou pipefail

# We will only fail these tests if the PR touches code in pylibcudf
# or cudf_polars itself.
# Note, the three dots mean we are doing diff between the merge-base
# of upstream and HEAD. So this is asking, "does _this branch_ touch
# files in cudf_polars/pylibcudf", rather than "are there changes
# between upstream and this branch which touch cudf_polars/pylibcudf"
# TODO: is the target branch exposed anywhere in an environment variable?
if [ -n "$(git diff --name-only origin/branch-24.08...HEAD -- python/cudf_polars/ python/cudf/cudf/_lib/pylibcudf/)" ];
then
HAS_CHANGES=1
rapids-logger "PR has changes in cudf-polars/pylibcudf, test fails treated as failure"
else
HAS_CHANGES=0
rapids-logger "PR does not have changes in cudf-polars/pylibcudf, test fails NOT treated as failure"
fi

rapids-logger "Download wheels"

RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
RAPIDS_PY_WHEEL_NAME="cudf_polars_${RAPIDS_PY_CUDA_SUFFIX}" RAPIDS_PY_WHEEL_PURE="1" rapids-download-wheels-from-s3 ./dist

# Download the cudf built in the previous step
RAPIDS_PY_WHEEL_NAME="cudf_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-s3 ./local-cudf-dep

rapids-logger "Install cudf"
python -m pip install ./local-cudf-dep/cudf*.whl

rapids-logger "Install cudf_polars"
python -m pip install $(echo ./dist/cudf_polars*.whl)

TAG=$(python -c 'import polars; print(f"py-{polars.__version__}")')
rapids-logger "Clone polars to ${TAG}"
git clone https://github.com/pola-rs/polars.git --branch ${TAG} --depth 1

# Install requirements for running polars tests
rapids-logger "Install polars test requirements"
python -m pip install -r polars/py-polars/requirements-dev.txt -r polars/py-polars/requirements-ci.txt

function set_exitcode()
{
EXITCODE=$?
}
EXITCODE=0
trap set_exitcode ERR
set +e

rapids-logger "Run polars tests"
./ci/run_cudf_polars_polars_tests.sh

trap ERR
set -e

if [ ${EXITCODE} != 0 ]; then
rapids-logger "Running polars test suite FAILED: exitcode ${EXITCODE}"
else
rapids-logger "Running polars test suite PASSED"
fi

if [ ${HAS_CHANGES} == 1 ]; then
exit ${EXITCODE}
else
exit 0
fi
6 changes: 6 additions & 0 deletions ci/test_wheel_cudf_polars.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,21 @@ set -eou pipefail
if [ -n "$(git diff --name-only origin/branch-24.08...HEAD -- python/cudf_polars/ python/cudf/cudf/_lib/pylibcudf/)" ];
then
HAS_CHANGES=1
rapids-logger "PR has changes in cudf-polars/pylibcudf, test fails treated as failure"
else
HAS_CHANGES=0
rapids-logger "PR does not have changes in cudf-polars/pylibcudf, test fails NOT treated as failure"
fi

rapids-logger "Download wheels"

RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
RAPIDS_PY_WHEEL_NAME="cudf_polars_${RAPIDS_PY_CUDA_SUFFIX}" RAPIDS_PY_WHEEL_PURE="1" rapids-download-wheels-from-s3 ./dist

# Download the cudf built in the previous step
RAPIDS_PY_WHEEL_NAME="cudf_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-s3 ./local-cudf-dep

rapids-logger "Install cudf"
python -m pip install ./local-cudf-dep/cudf*.whl

rapids-logger "Install cudf_polars"
Expand Down
6 changes: 3 additions & 3 deletions cpp/include/cudf/detail/indexalator.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ struct input_indexalator : base_normalator<input_indexalator, cudf::size_type> {
*/
__device__ inline cudf::size_type operator[](size_type idx) const
{
void const* tp = p_ + (idx * this->width_);
void const* tp = p_ + (static_cast<std::ptrdiff_t>(idx) * this->width_);
return type_dispatcher(this->dtype_, normalize_type{}, tp);
}

Expand All @@ -109,7 +109,7 @@ struct input_indexalator : base_normalator<input_indexalator, cudf::size_type> {
CUDF_HOST_DEVICE input_indexalator(void const* data, data_type dtype, cudf::size_type offset = 0)
: base_normalator<input_indexalator, cudf::size_type>(dtype), p_{static_cast<char const*>(data)}
{
p_ += offset * this->width_;
p_ += static_cast<std::ptrdiff_t>(offset) * this->width_;
}

protected:
Expand Down Expand Up @@ -165,7 +165,7 @@ struct output_indexalator : base_normalator<output_indexalator, cudf::size_type>
__device__ inline output_indexalator const operator[](size_type idx) const
{
output_indexalator tmp{*this};
tmp.p_ += (idx * this->width_);
tmp.p_ += static_cast<std::ptrdiff_t>(idx) * this->width_;
return tmp;
}

Expand Down
2 changes: 1 addition & 1 deletion dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -631,7 +631,7 @@ dependencies:
common:
- output_types: [conda, requirements, pyproject]
packages:
- polars>=1.0,<1.3
- polars>=1.6
run_dask_cudf:
common:
- output_types: [conda, requirements, pyproject]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ strings
contains
replace
slice
strip
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
=====
strip
=====

.. automodule:: cudf._lib.pylibcudf.strings.strip
:members:
42 changes: 5 additions & 37 deletions python/cudf/cudf/_lib/datetime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ from cudf._lib.pylibcudf.libcudf.scalar.scalar cimport scalar
from cudf._lib.pylibcudf.libcudf.types cimport size_type
from cudf._lib.scalar cimport DeviceScalar

import cudf._lib.pylibcudf as plc


@acquire_spill_lock()
def add_months(Column col, Column months):
Expand All @@ -37,43 +39,9 @@ def add_months(Column col, Column months):

@acquire_spill_lock()
def extract_datetime_component(Column col, object field):

cdef unique_ptr[column] c_result
cdef column_view col_view = col.view()

with nogil:
if field == "year":
c_result = move(libcudf_datetime.extract_year(col_view))
elif field == "month":
c_result = move(libcudf_datetime.extract_month(col_view))
elif field == "day":
c_result = move(libcudf_datetime.extract_day(col_view))
elif field == "weekday":
c_result = move(libcudf_datetime.extract_weekday(col_view))
elif field == "hour":
c_result = move(libcudf_datetime.extract_hour(col_view))
elif field == "minute":
c_result = move(libcudf_datetime.extract_minute(col_view))
elif field == "second":
c_result = move(libcudf_datetime.extract_second(col_view))
elif field == "millisecond":
c_result = move(
libcudf_datetime.extract_millisecond_fraction(col_view)
)
elif field == "microsecond":
c_result = move(
libcudf_datetime.extract_microsecond_fraction(col_view)
)
elif field == "nanosecond":
c_result = move(
libcudf_datetime.extract_nanosecond_fraction(col_view)
)
elif field == "day_of_year":
c_result = move(libcudf_datetime.day_of_year(col_view))
else:
raise ValueError(f"Invalid datetime field: '{field}'")

result = Column.from_unique_ptr(move(c_result))
result = Column.from_pylibcudf(
plc.datetime.extract_datetime_component(col.to_pylibcudf(mode="read"), field)
)

if field == "weekday":
# Pandas counts Monday-Sunday as 0-6
Expand Down
9 changes: 4 additions & 5 deletions python/cudf/cudf/_lib/pylibcudf/column.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,11 @@ from cudf._lib.pylibcudf.libcudf.types cimport size_type

from .gpumemoryview cimport gpumemoryview
from .scalar cimport Scalar
from .types cimport DataType, type_id
from .types cimport DataType, size_of, type_id
from .utils cimport int_to_bitmask_ptr, int_to_void_ptr

import functools

import numpy as np


cdef class Column:
"""A container of nullable device data as a column of elements.
Expand Down Expand Up @@ -303,14 +301,15 @@ cdef class Column:
raise ValueError("mask not yet supported.")

typestr = iface['typestr'][1:]
data_type = _datatype_from_dtype_desc(typestr)

if not is_c_contiguous(
iface['shape'],
iface['strides'],
np.dtype(typestr).itemsize
size_of(data_type)
):
raise ValueError("Data must be C-contiguous")

data_type = _datatype_from_dtype_desc(typestr)
size = iface['shape'][0]
return Column(
data_type,
Expand Down
49 changes: 49 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/datetime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@ from libcpp.utility cimport move

from cudf._lib.pylibcudf.libcudf.column.column cimport column
from cudf._lib.pylibcudf.libcudf.datetime cimport (
day_of_year as cpp_day_of_year,
extract_day as cpp_extract_day,
extract_hour as cpp_extract_hour,
extract_microsecond_fraction as cpp_extract_microsecond_fraction,
extract_millisecond_fraction as cpp_extract_millisecond_fraction,
extract_minute as cpp_extract_minute,
extract_month as cpp_extract_month,
extract_nanosecond_fraction as cpp_extract_nanosecond_fraction,
extract_second as cpp_extract_second,
extract_weekday as cpp_extract_weekday,
extract_year as cpp_extract_year,
)

Expand Down Expand Up @@ -31,3 +41,42 @@ cpdef Column extract_year(
with nogil:
result = move(cpp_extract_year(values.view()))
return Column.from_libcudf(move(result))


def extract_datetime_component(Column col, str field):

cdef unique_ptr[column] c_result

with nogil:
if field == "year":
c_result = move(cpp_extract_year(col.view()))
elif field == "month":
c_result = move(cpp_extract_month(col.view()))
elif field == "day":
c_result = move(cpp_extract_day(col.view()))
elif field == "weekday":
c_result = move(cpp_extract_weekday(col.view()))
elif field == "hour":
c_result = move(cpp_extract_hour(col.view()))
elif field == "minute":
c_result = move(cpp_extract_minute(col.view()))
elif field == "second":
c_result = move(cpp_extract_second(col.view()))
elif field == "millisecond":
c_result = move(
cpp_extract_millisecond_fraction(col.view())
)
elif field == "microsecond":
c_result = move(
cpp_extract_microsecond_fraction(col.view())
)
elif field == "nanosecond":
c_result = move(
cpp_extract_nanosecond_fraction(col.view())
)
elif field == "day_of_year":
c_result = move(cpp_day_of_year(col.view()))
else:
raise ValueError(f"Invalid datetime field: '{field}'")

return Column.from_libcudf(move(c_result))
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# the License.
# =============================================================================

set(cython_sources char_types.pyx regex_flags.pyx)
set(cython_sources char_types.pyx regex_flags.pyx side_type.pyx)

set(linked_libraries cudf::cudf)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2024, NVIDIA CORPORATION.
from libc.stdint cimport int32_t


cdef extern from "cudf/strings/side_type.hpp" namespace "cudf::strings" nogil:

ctypedef enum side_type:
cpdef enum class side_type(int32_t):
LEFT 'cudf::strings::side_type::LEFT'
RIGHT 'cudf::strings::side_type::RIGHT'
BOTH 'cudf::strings::side_type::BOTH'
Expand Down
Empty file.
2 changes: 2 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/libcudf/types.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,5 @@ cdef extern from "cudf/types.hpp" namespace "cudf" nogil:
HIGHER
MIDPOINT
NEAREST

cdef size_type size_of(data_type t) except +
4 changes: 3 additions & 1 deletion python/cudf/cudf/_lib/pylibcudf/strings/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# =============================================================================

set(cython_sources capitalize.pyx case.pyx char_types.pyx contains.pyx find.pyx regex_flags.pyx
regex_program.pyx replace.pyx slice.pyx
regex_program.pyx replace.pyx side_type.pyx slice.pyx strip.pyx
)

set(linked_libraries cudf::cudf)
Expand All @@ -22,3 +22,5 @@ rapids_cython_create_modules(
SOURCE_FILES "${cython_sources}"
LINKED_LIBRARIES "${linked_libraries}" MODULE_PREFIX pylibcudf_strings_ ASSOCIATED_TARGETS cudf
)

add_subdirectory(convert)
3 changes: 3 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/strings/__init__.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,12 @@ from . cimport (
case,
char_types,
contains,
convert,
find,
regex_flags,
regex_program,
replace,
slice,
strip,
)
from .side_type cimport side_type
Loading
Loading