v23.12.00
🚨 Breaking Changes
- Raise error in
reindex
whenindex
is not unique (#14400) @galipremsagar - Expose stream parameter to get_json_object API (#14297) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
🐛 Bug Fixes
- Update actions/labeler to v4 (#14562) @raydouglass
- Fix data corruption when skipping rows (#14557) @etseidl
- Fix function name typo in
cudf.pandas
profiler (#14514) @galipremsagar - Fix intermediate type checking in expression parsing (#14445) @vyasr
- Forward merge
branch-23.10
intobranch-23.12
(#14435) @raydouglass - Remove needs: wheel-build-cudf. (#14427) @bdice
- Fix dask dependency in custreamz (#14420) @vyasr
- Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
- Support java AST String literal with desired encoding (#14402) @winningsix
- Raise error in
reindex
whenindex
is not unique (#14400) @galipremsagar - Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
- Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
- Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
- cudf.pandas: cuDF subpath checking in module
__getattr__
(#14388) @shwina - Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
- Add the new manylinux builds to the build job (#14351) @vyasr
- cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
- Fix overflow check in
cudf::merge
(#14345) @divyegala - Add cramjam (#14344) @vyasr
- Enable
dask_cudf/io
pytests in CI (#14338) @galipremsagar - Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
- Fix host buffer access from device function in the Parquet reader (#14328) @vuule
- Run IO tests for Dask-cuDF (#14327) @rjzamora
- Fix logical type issues in the Parquet writer (#14322) @vuule
- Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
- test is_valid before reading column data (#14318) @etseidl
- Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
- Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
- Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
- fixing thread index overflow issue (#14290) @hyperbolic2346
- Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
- Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
- Handle empty string correctly in Parquet statistics (#14257) @etseidl
- Fixes behaviour for incomplete lines when
recover_with_nulls
is enabled (#14252) @elstehle - cudf::detail::pinned_allocator doesn't throw from
deallocate
(#14251) @robertmaynard - Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
- Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
- Fixing parquet list of struct interpretation (#13715) @hyperbolic2346
📖 Documentation
- Fix io reference in docs. (#14452) @bdice
- Update README (#14374) @shwina
- Example code for blog on new row comparators (#13795) @divyegala
🚀 New Features
- Expose streams in public unary APIs (#14342) @vyasr
- Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
- Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
- Expose streams in public null mask APIs (#14263) @vyasr
- Expose streams in binaryop APIs (#14187) @vyasr
- Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
- Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
- Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
- Add BytePairEncoder class to cuDF (#13891) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
- Use
pynvjitlink
for CUDA 12+ MVC (#13650) @brandon-b-miller
🛠️ Improvements
- Build concurrency for nightly and merge triggers (#14441) @bdice
- Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
- Update to Arrow 14.0.1. (#14387) @bdice
- Remove Cython libcpp wrappers (#14382) @vyasr
- Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
- Upgrade to arrow 14 (#14371) @galipremsagar
- Fix a pytest typo in
test_kurt_skew_error
(#14368) @galipremsagar - Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
- Change
nullable()
tohas_nulls()
incudf::detail::gather
(#14363) @divyegala - Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
- Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
- Added streams to CSV reader and writer api (#14340) @shrshi
- Upgrade wheels to use arrow 13 (#14339) @vyasr
- Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
- Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
- Upgrade
arrow
to13
(#14330) @galipremsagar - Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
- Drop
pyorc
dependency and usepandas
/pyarrow
instead (#14323) @galipremsagar - Avoid
pyarrow.fs
import for local storage (#14321) @rjzamora - Unpin
dask
anddistributed
for23.12
development (#14320) @galipremsagar - Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
- Added streams to JSON reader and writer api (#14313) @shrshi
- Minor improvements in
source_info
(#14308) @vuule - Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
- Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
- Expose stream parameter to get_json_object API (#14297) @davidwendt
- Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
- Expose stream parameter in public strings filter APIs (#14293) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Update
shared-action-workflows
references (#14289) @AyodeAwe - Register
partd
encode dispatch indask_cudf
(#14287) @rjzamora - Update versioning strategy (#14285) @vyasr
- Move and rename byte-pair-encoding source files (#14284) @davidwendt
- Expose stream parameter in public strings combine APIs (#14281) @davidwendt
- Expose stream parameter in public strings contains APIs (#14280) @davidwendt
- Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
- Use branch-23.12 workflows. (#14271) @bdice
- Refactor LogicalType for Parquet (#14264) @etseidl
- Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
- Expose stream parameter in public strings replace APIs (#14261) @davidwendt
- Expose stream parameter in public strings APIs (#14260) @davidwendt
- Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
- Make parquet schema index type consistent (#14256) @hyperbolic2346
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Add in java bindings for DataSource (#14254) @revans2
- Reimplement
cudf::merge
for nested types without using comparators (#14250) @divyegala - Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
- Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
- Improve
contains_column
by invokingcontains_table
(#14238) @PointKernel - Detect and report errors in Parquet header parsing (#14237) @etseidl
- Normalizing offsets iterator (#14234) @davidwendt
- Forward merge
23.10
into23.12
(#14231) @galipremsagar - Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
- Enable indexalator for device code (#14206) @davidwendt
- Marginally reduce memory footprint of joins (#14197) @wence-
- Add nvtx annotations to spilling-based data movement (#14196) @wence-
- Optimize ORC writer for decimal columns (#14190) @vuule
- Remove the use of volatile in ORC (#14175) @vuule
- Add
bytes_per_second
to distinct_count of stream_compaction nvbench. (#14172) @Blonck - Add
bytes_per_second
to transpose benchmark (#14170) @Blonck - cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
- Add
bytes_per_second
to shift benchmark (#13950) @Blonck - Extract
debug_utilities.hpp/cu
fromcolumn_utilities.hpp/cu
(#13720) @ttnghia