Release v22.12.00 · rapidsai/cudf

🚨 Breaking Changes

Add JNI for substring without 'end' parameter. (#12113) @firestarman
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Fix type promotion edge cases in numerical binops (#12074) @wence-
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Rollback of DeviceBufferLike (#12009) @madsbk
Remove unused managed_allocator (#12005) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
Remove validation that requires introspection (#11938) @vyasr
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

Fix include line for IO Cython modules (#12250) @vyasr
Make dask pinning looser (#12231) @vyasr
Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
Fix compression in ORC writer (#12194) @vuule
Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
Fix decimal binary operations (#12142) @galipremsagar
Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
Fix/disable jitify lto (#12122) @robertmaynard
Fix conditional_full_join benchmark (#12121) @GregoryKimball
Fix regex working-memory-size refactor error (#12119) @davidwendt
Add in negative size checks for columns (#12118) @revans2
Add JNI for substring without 'end' parameter. (#12113) @firestarman
Fix reading of CSV files with blank second row (#12098) @vuule
Fix an error in IO with GzipFile type (#12085) @galipremsagar
Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
Fix alignment of compressed blocks in ORC writer (#12077) @vuule
Fix singleton-range __setitem__ edge case (#12075) @wence-
Fix type promotion edge cases in numerical binops (#12074) @wence-
Force using old fmt in nvbench. (#12067) @vyasr
Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
Force black exclusions for pre-commit. (#12036) @bdice
Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
Fix maximum page size estimate in Parquet writer (#11962) @vuule
Fix local offset handling in bgzip reader (#11918) @upsj
Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
Fix type casting in Series.setitem (#11904) @wence-
Fix memcheck error in get_dremel_data (#11903) @davidwendt
Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
Fix writing of Parquet files with many fragments (#11869) @etseidl
Fix RangeIndex unary operators. (#11868) @vyasr
JNI Avoid NPE for reading host binary data (#11865) @revans2
Fix decimal benchmark input data generation (#11863) @karthikeyann
Fix pre-commit copyright check (#11860) @galipremsagar
Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
add V2 page header support to parquet reader (#11778) @etseidl
Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
Add symlinks to notebooks. (#12128) @bdice
Add truncate API to python doc pages (#12109) @galipremsagar
Update Numba docs links. (#12107) @bdice
Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
Add pivot_table and crosstab to docs. (#12014) @bdice
Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
Rename libcudf++ to libcudf. (#11953) @bdice
Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
Support + in strings_udf (#12117) @brandon-b-miller
Support upper and lower in strings_udf (#12099) @brandon-b-miller
Add wheel builds (#12096) @vyasr
Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
Mark nvcomp zstd compression stable (#12059) @jbrennan333
Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
Enable building against the libarrow contained in pyarrow (#12034) @vyasr
Add strings like jni and native method (#12032) @cindyyuanjiang
Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
byte_range support for JSON Lines format (#12017) @karthikeyann
Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
Implement JNI for chunked Parquet reader (#11961) @ttnghia
Add method argument to DataFrame.quantile (#11957) @rjzamora
Add gpu memory watermark apis to JNI (#11950) @abellina
Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Enable CEC for strings_udf (#11884) @brandon-b-miller
ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
Implement chunked Parquet reader (#11867) @ttnghia
Add read_orc_metadata to libcudf (#11815) @vuule
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

Reduce number of tests marked spilling (#12197) @madsbk
Pin dask and distributed for release (#12165) @galipremsagar
Don't rely on GNU find in headers_test.sh (#12164) @wence-
Update cp.clip call (#12148) @quasiben
Enable automatic column projection in groupby().agg (#12124) @rjzamora
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Spilling to host memory (#12106) @madsbk
First pass of pd.read_orc changes in tests (#12103) @galipremsagar
Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
Remove CUDA 10 compatibility code. (#12088) @bdice
Move and update dask nigthly install in CI (#12082) @galipremsagar
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Remove macros that inspect the contents of exceptions (#12076) @vyasr
Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
Remove overflow error during decimal binops (#12063) @galipremsagar
Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
Add support for DataFrame.from_dict`to_dictandSeries.to_dict` (#12048) @galipremsagar
Refactor Parquet reader (#12046) @ttnghia
Forward merge 22.10 into 22.12 (#12045) @vyasr
Standardize newlines at ends of files. (#12042) @bdice
Trim trailing whitespace from all files. (#12041) @bdice
Use nosync policy in gather and scatter implementations. (#12038) @bdice
Remove smart quotes from all docstrings. (#12035) @bdice
Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
Add cython-lint to pre-commit checks. (#12020) @bdice
Use pragma once (#12019) @bdice
New GHA to add issues/prs to project board (#12016) @jarmak-nv
Add DataFrame.pivot_table. (#12015) @bdice
Rollback of DeviceBufferLike (#12009) @madsbk
Remove default parameters for nvtext::detail functions (#12007) @davidwendt
Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
Remove unused managed_allocator (#12005) @vyasr
Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
Ignore python docs build artifacts (#12000) @galipremsagar
Use rapids-cmake for google benchmark. (#11997) @vyasr
Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
Remove stale labeler (#11995) @raydouglass
Move protobuf compilation to CMake (#11986) @vyasr
Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
Add missing noexcepts to column_in_metadata methods (#11973) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
Feature/remove default streams (#11967) @vyasr
Add pool memory resource to libcudf basic example (#11966) @davidwendt
Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Add deprecation warning for set_allocator. (#11958) @vyasr
Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
Add strip_delimiters option to read_text (#11946) @upsj
Refactor multibyte_split output_builder (#11945) @upsj
Remove validation that requires introspection (#11938) @vyasr
Add .str.find_multiple API (#11928) @galipremsagar
Add regex_program class for use with all regex APIs (#11927) @davidwendt
Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
Performance improvement in JSON Tree traversal (#11919) @karthikeyann
Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
Pin mimesis version in setup.py. (#11906) @bdice
Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
Relax codecov threshold diff (#11899) @galipremsagar
Use public APIs in STREAM_COMPACTION_NVBENCH (#11892) @GregoryKimball
Add coverage for string UDF tests. (#11891) @vyasr
Provide data_chunk_source wrapper for datasource (#11886) @upsj
Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Change expect_strings_empty into expect_column_empty libcudf test utility (#11873) @davidwendt
Add ngroup (#11871) @shwina
Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
Unpin dask and distributed for development (#11859) @galipremsagar
Remove unused includes for table/row_operators (#11857) @GregoryKimball
Use conda-forge's pyorc (#11855) @jakirkham
Add libcudf strings examples (#11849) @davidwendt
Remove cudf_io namespace alias (#11827) @vuule
Test/remove thrust vector usage (#11813) @vyasr
Add BGZIP reader to python read_text (#11802) @upsj
Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
Fix compile warning from CUDF_FUNC_RANGE in a member function (#11798) @davidwendt
Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
Add BGZIP multibyte_split benchmark (#11723) @upsj
Bifurcate Dependency Lists (#11674) @bdice
Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
Make all nvcc warnings into errors (#8916) @trxcllnt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v22.12.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors