v23.06.00
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
meta
calculation indask_cudf.read_parquet
(#13327) @rjzamora - Changes to support Numpy >= 1.24 (#13325) @shwina
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Clean up
distinct_count
benchmark (#13321) @PointKernel - Fix gtest pinning to 1.13.0. (#13319) @bdice
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Address feedback from 13289 (#13306) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - First check for
BaseDtype
when infering the data type of an arbitrary object (#13295) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Support CUDA 12.0 for pip wheels (#13289) @divyegala
- Refactor
transform_lists_of_structs
inrow_operators.cu
(#13288) @ttnghia - Branch 23.06 merge 23.04 (#13286) @vyasr
- Update cupy dependency (#13284) @vyasr
- Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
- Fix unused variables and functions (#13275) @karthikeyann
- Fix integer overflow in
partition
scatter_map
construction (#13272) @wence- - Numba 0.57 compatibility fixes (#13271) @gmarkall
- Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
- Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
- Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
- Build wheels using new single image workflow (#13249) @vyasr
- Enable sccache hits from local builds (#13248) @AyodeAwe
- Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
- Introduce
pandas_compatible
option incudf
(#13241) @galipremsagar - Add metadata_builder helper class (#13232) @abellina
- Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
- Add chunked reader benchmark (#13223) @SrikarVanavasam
- Set the null count in output columns in the CSV reader (#13221) @vuule
- Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
- Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
- Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
- Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Optimization to decoding of parquet level streams (#13203) @nvdbaranec
- Clean up and simplify
gpuDecideCompression
(#13202) @vuule - Use std::array for a statically sized vector in
create_serialized_trie
(#13201) @vuule - Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
- Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
- Split up unique_count.cu to improve build time (#13169) @davidwendt
- Use nvtx3 includes in string examples. (#13165) @bdice
- Change some .cu gtest files to .cpp (#13155) @davidwendt
- Remove wheel pytest verbosity (#13151) @sevagh
- Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
- Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
- Optimize JSON writer (#13144) @karthikeyann
- Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
- [REVIEW] Deprecate
pad
andbackfill
methods (#13140) @galipremsagar - Use CTAD instead of functions in ProtobufReader (#13135) @vuule
- Remove more instances of
UNKNOWN_NULL_COUNT
(#13134) @vyasr - Update clang-format to 16.0.1. (#13133) @bdice
- Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
- Branch 23.06 merge 23.04 (#13131) @vyasr
- Compute null-count in cudf::detail::slice (#13124) @davidwendt
- Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
- Set null-count in linked_column_view conversion operator (#13121) @davidwendt
- Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
- Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
- Remove uses-setup-env-vars (#13105) @vyasr
- Explicitly compute null count in concatenate APIs (#13104) @vyasr
- Replace unnecessary uses of
UNKNOWN_NULL_COUNT
(#13102) @vyasr - Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
- Use
.element()
instead of.data()
for window range calculations (#13095) @mythrocks - Cleanup Parquet chunked writer (#13094) @ttnghia
- Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
- Cleanup ORC chunked writer (#13091) @ttnghia
- Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
- Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
- Resolved automerger from
branch-23.04
tobranch-23.06
(#13080) @galipremsagar - Assert for non-empty nulls (#13071) @razajafri
- Remove deprecated regex functions from libcudf (#13067) @davidwendt
- Refactor
cudf::detail::sorted_order
(#13062) @ttnghia - Improve performance of slice_strings for long strings (#13057) @davidwendt
- Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
- [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
- Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
- Remove console output from some libcudf gtests (#13027) @davidwendt
- Remove underscore in build string. (#13025) @bdice
- Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
- Fix auto merger from
branch-23.04
tobranch-23.06
(#13009) @galipremsagar - Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
- Add nvtx annotatations to groupby methods (#12941) @wence-
- Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
- Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
- Optimize set-like operations (#12769) @ttnghia
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Add empty test files for test reorganization (#12288) @shwina