Release [NIGHTLY] v23.06.00 · rapidsai/cudf

🔗 Links

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NIGHTLY] v23.06.00

🔗 Links

🚨 Breaking Changes

🐛 Bug Fixes

🚀 New Features

🛠️ Improvements

Contributors