Release v22.04.00 · rapidsai/cudf

🚨 Breaking Changes

Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
Refactor stream compaction APIs (#10370) @PointKernel
Add scan_aggregation and reduce_aggregation derived types. (#10357) @nvdbaranec
Avoid decimal type narrowing for decimal binops (#10299) @galipremsagar
Rewrites sample API (#10262) @isVoid
Remove probe-time null equality parameters in cudf::hash_join (#10260) @PointKernel
Enable proper Index round-tripping in orc reader and writer (#10170) @galipremsagar
Add JNI for strings::split_re and strings::split_record_re (#10139) @ttnghia
Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
Remove deprecated code (#10124) @vyasr
Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
Optimize compaction operations (#10030) @PointKernel
Remove deprecated method Series.set_index. (#9945) @bdice
Add cudf::strings::findall_record API (#9911) @davidwendt
Upgrade arrow & pyarrow to 6.0.1 (#9686) @galipremsagar

🐛 Bug Fixes

Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
Fix for integer overflow in contiguous-split (#10437) @jbrennan333
Fix has_null predicate for drop_list_duplicates on nested structs (#10436) @sperlingxx
Fix empty reduce with List output and non-List input (#10435) @sperlingxx
Fix list and struct meta generation issue in dask-cudf (#10434) @galipremsagar
Fix error in cudf.to_numeric when a bool input is passed (#10431) @galipremsagar
Support cupy array in quantile input (#10429) @galipremsagar
Fix benchmarks to work with new aggregation types (#10428) @davidwendt
Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
Limiting async allocator using alignment of 512 (#10395) @rongou
Include <optional> in multibyte split. (#10385) @bdice
Fix issue with column and scalar re-assignment (#10377) @galipremsagar
Fix floating point data generation in benchmarks (#10372) @vuule
Avoid overflow in fused_concatenate_kernel output_index (#10344) @abellina
Remove is_relationally_comparable for table device views (#10342) @davidwendt
Fix debug compile error in device_span to column_view conversion (#10331) @davidwendt
Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
Fix std::bad_alloc exception due to JIT reserving a huge buffer (#10317) @ttnghia
Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
Fix documentation issues (#10307) @ajschmidt8
Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
Fix incorrect slicing of GDS read/write calls (#10274) @vuule
Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
Remove probe-time null equality parameters in cudf::hash_join (#10260) @PointKernel
Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
Fix small leak in explode (#10245) @revans2
Yet another small JNI memory leak (#10238) @revans2
Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
Fix JNI leak on copy to device (#10229) @revans2
Fix the data generator element size for decimal types (#10225) @vuule
Fix decimal metadata in parquet writer (#10224) @galipremsagar
Fix strings handling of hex in regex pattern (#10220) @davidwendt
Fix docs builds (#10216) @ajschmidt8
Fix a leftover _has_nulls change from Nullate (#10211) @devavret
Fix bitmask of the output for JNI of lists::drop_list_duplicates (#10210) @ttnghia
Fix compile error in binaryop/compiled/util.cpp (#10209) @ttnghia
Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
Enable proper Index round-tripping in orc reader and writer (#10170) @galipremsagar
Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
Preserve the correct ListDtype while creating an identical empty column (#10151) @galipremsagar
benchmark fixture - static object pointer fix (#10145) @karthikeyann
Fix UDF Caching (#10133) @brandon-b-miller
Raise duplicate column error in DataFrame.rename (#10120) @galipremsagar
Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
Encode values from python callback for C++ (#10103) @jdye64
Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
Column equality testing fixes (#10011) @brandon-b-miller
Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca

📖 Documentation

Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
Add cut to API docs (#10479) @shwina
Remove documentation for methods removed in #10124. (#10366) @bdice
Fix documentation issues (#10306) @ajschmidt8
Fix fixed_point binary operation documentation (#10198) @codereport
Remove cleaned up methods from docs (#10189) @galipremsagar
Update developer guide to recommend no default stream parameter. (#10136) @bdice
Update benchmarking guide to use NVBench. (#10093) @bdice

🚀 New Features

Add StringIO support to read_text (#10465) @cwharris
Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
JNI support for Collect Ops in Reduction (#10427) @sperlingxx
Enable read_text with dask_cudf using byte_range (#10407) @ChrisJar
Add cudf::stable_sort_by_key (#10387) @PointKernel
Implement maps_column_view abstraction over LIST<STRUCT<K,V>> (#10380) @mythrocks
Support Java bindings for Avro reader (#10373) @HaoYang670
Refactor stream compaction APIs (#10370) @PointKernel
Support collect aggregations in reduction (#10353) @sperlingxx
Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
Add JNI for extract_list_element with index column (#10341) @firestarman
Support min and max operations for structs in rolling window (#10332) @ttnghia
Add device create_sequence_table for benchmarks (#10300) @karthikeyann
Enable numpy ufuncs for DataFrame (#10287) @vyasr
move input generation for json benchmark to device (#10281) @karthikeyann
move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
move input generation for copy benchmark to device (#10279) @karthikeyann
generate url decode benchmark input in device (#10278) @karthikeyann
device input generation in join bench (#10277) @karthikeyann
Add nvtext::byte_pair_encoding API (#10270) @davidwendt
Prevent internal usage of expensive APIs (#10263) @vyasr
Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
Support percent_rank() aggregation (#10227) @mythrocks
Refactor Series.array_ufunc (#10217) @vyasr
Reduce pytest runtime (#10203) @brandon-b-miller
Add regex flags parameter to python cudf strings split (#10185) @davidwendt
Support for MOD, PMOD and PYMOD for decimal32/64/128 (#10179) @codereport
Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
Add file size counter to cuIO benchmarks (#10154) @vuule
byte_range support for multibyte_split/read_text (#10150) @cwharris
Add JNI for strings::split_re and strings::split_record_re (#10139) @ttnghia
Add maxSplit parameter to Java binding for strings:split (#10137) @ttnghia
Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
generate benchmark input in device (#10109) @karthikeyann
Avoid nan_as_null op if nan_count is 0 (#10082) @galipremsagar
Add Dataframe and Index nunique (#10077) @martinfalisse
Support nanosecond timestamps in parquet (#10063) @PointKernel
Java bindings for mixed semi and anti joins (#10040) @jlowe
Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
Optimize compaction operations (#10030) @PointKernel
Support args= in Series.apply (#9982) @brandon-b-miller
Add cudf::strings::findall_record API (#9911) @davidwendt
Add covariance for sort groupby (python) (#9889) @mayankanand007
Implement DataFrame diff() (#9817) @skirui-source
Implement DataFrame pct_change (#9805) @skirui-source
Support segmented reductions and null mask reductions (#9621) @isVoid
Add 'spearman' correlation method for dataframe.corr and series.corr (#7141) @dominicshanshan

🛠️ Improvements

Add scipy skip for a test (#10502) @galipremsagar
Temporarily disable new ops-bot functionality (#10496) @ajschmidt8
Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
Pin dask and distributed (#10481) @galipremsagar
MD5 refactoring. (#10445) @bdice
Remove or split up Frame methods that use the index (#10439) @vyasr
Centralization of tdigest aggregation code. (#10422) @nvdbaranec
Simplify column binary operations (#10421) @vyasr
Add .github/ops-bot.yaml config file (#10420) @ajschmidt8
Use list of columns for methods in Groupby.pyx (#10419) @isVoid
Remove warnings in test_timedelta.py (#10418) @galipremsagar
Fix some warnings in test_parquet.py (#10416) @galipremsagar
JNI support for segmented reduce (#10413) @revans2
Clean up null mask after purging null entries (#10412) @sperlingxx
Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
Use str instead of builtins.str. (#10410) @bdice
Fix warnings in test_rolling (#10405) @bdice
Enable codecov github-check in CI (#10404) @galipremsagar
Fix warnings in test_cuda_apply, test_numerical, test_pickling, test_unaops. (#10402) @bdice
Set column names in _from_columns_like_self factory (#10400) @isVoid
Refactor nvtx annotations in cudf & dask-cudf (#10396) @galipremsagar
Consolidate .cov and .corr for sort groupby (#10386) @skirui-source
Consolidate some Frame APIs (#10381) @vyasr
Refactor hash functions and hash_combine (#10379) @bdice
Add nvtx annotations for Series and Index (#10374) @galipremsagar
Refactor filling.repeat API (#10371) @isVoid
Move standalone UTF8 functions from string_view.hpp to utf8.hpp (#10369) @davidwendt
Remove doc for deprecated function one_hot_encoding (#10367) @isVoid
Refactor array function (#10364) @vyasr
Fix warnings in test_csv.py. (#10362) @bdice
Implement a mixin for binops (#10360) @vyasr
Refactor cython interface: copying.pyx (#10359) @isVoid
Implement a mixin for scans (#10358) @vyasr
Add scan_aggregation and reduce_aggregation derived types. (#10357) @nvdbaranec
Add cleanup of python artifacts (#10355) @galipremsagar
Fix warnings in test_categorical.py. (#10354) @bdice
Create a dispatcher for invoking regex kernel functions (#10349) @davidwendt
Fix codecov in CI (#10347) @galipremsagar
Enable caching for memory_usage calculation in Column (#10345) @galipremsagar
C++17 cleanup: traits replace std::enable_if<>::type with std::enable_if_t (#10343) @karthikeyann
JNI: Support appending DECIMAL128 into ColumnBuilder in terms of byte array (#10338) @sperlingxx
multibyte_split test improvements (#10328) @vuule
Fix warnings in test_binops.py. (#10327) @bdice
Fix warnings from pandas in test_array_ufunc.py. (#10324) @bdice
Update upload script (#10321) @ajschmidt8
Move hash type declarations to hashing.hpp (#10320) @davidwendt
C++17 cleanup: traits replace ::value with _v (#10319) @karthikeyann
Remove internal columns usage (#10315) @vyasr
Remove extraneous build.sh parameter (#10313) @ajschmidt8
Add const qualifier to MurmurHash3_32::hash_combine (#10311) @davidwendt
Remove TODO in libcudf_kafka recipe (#10309) @ajschmidt8
Add conversions between column_view and device_span<T const>. (#10302) @bdice
Avoid decimal type narrowing for decimal binops (#10299) @galipremsagar
Deprecate DataFrame.iteritems and introduce .items (#10298) @galipremsagar
Explicitly request CMake use gnu++17 over c++17 (#10297) @robertmaynard
Add copyright check as pre-commit hook. (#10290) @vyasr
DataFrame insert and creation optimizations (#10285) @galipremsagar
Improve hash join detail functions (#10273) @PointKernel
Replace custom cached_property implementation with functools (#10272) @shwina
Rewrites sample API (#10262) @isVoid
Bump hadoop-common from 3.1.0 to 3.1.4 in /java (#10259) @dependabot[bot]
Remove making redundant copy across code-base (#10257) @galipremsagar
Add more nvtx annotations (#10256) @galipremsagar
Add copyright check in cudf (#10253) @galipremsagar
Remove redundant copies in fillna to improve performance (#10241) @galipremsagar
Remove std::numeric_limit specializations for timestamp & durations (#10239) @codereport
Optimize DataFrame creation across code-base (#10236) @galipremsagar
Change pytest distribution algorithm and increase parallelism in CI (#10232) @galipremsagar
Add environment variables for I/O thread pool and slice sizes (#10218) @vuule
Add regex flags to strings findall functions (#10208) @davidwendt
Update dask-cudf parquet tests to reflect upstream bugfixes to _metadata (#10206) @charlesbluca
Remove unnecessary nunique function in Series. (#10205) @martinfalisse
Refactor DataFrame tests. (#10204) @bdice
Rewrites column.__setitem__, Use boolean_mask_scatter (#10202) @isVoid
Java utilities to aid in accelerating aggregations on 128-bit types (#10201) @jlowe
Fix docstrings alignment in Frame methods (#10199) @galipremsagar
Fix cuco pair issue in hash join (#10195) @PointKernel
Replace dask groupby .index usages with .by (#10193) @galipremsagar
Add regex flags to strings extract function (#10192) @davidwendt
Forward-merge branch-22.02 to branch-22.04 (#10191) @bdice
Add CMake install rule for tests (#10190) @ajschmidt8
Unpin dask & distributed (#10182) @galipremsagar
Add comments to explain test validation (#10176) @galipremsagar
Reduce warnings in pytest output (#10168) @bdice
Some consolidation of indexed frame methods (#10167) @vyasr
Refactor isin implementations (#10165) @vyasr
Faster struct row comparator (#10164) @devavret
Refactor groupby::get_groups. (#10161) @bdice
Deprecate decimal_cols_as_float in ORC reader (C++ layer) (#10152) @vuule
Replace ccache with sccache (#10146) @ajschmidt8
Murmur3 hash kernel cleanup (#10143) @rwlee
Deprecate decimal_cols_as_float in ORC reader (#10142) @galipremsagar
Run pyupgrade 2.31.0. (#10141) @bdice
Remove drop_nan from internal IndexedFrame._drop_na_rows. (#10140) @bdice
Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
Update cmake-format script for branch 22.04. (#10132) @bdice
Accept r-value references in convert_table_for_return(): (#10131) @mythrocks
Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
Remove deprecated code (#10124) @vyasr
Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
Remove benchmarks suffix (#10112) @bdice
Update cudf java binding version to 22.04.0-SNAPSHOT (#10084) @pxLi
Remove unnecessary docker files. (#10069) @vyasr
Limit benchmark iterations using environment variable (#10060) @karthikeyann
Add timing chart for libcudf build metrics report page (#10038) @davidwendt
JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder (#10025) @sperlingxx
Reduce redundant code in CUDF JNI (#10019) @mythrocks
Make snappy decompress check more efficient (#9995) @cheinger
Remove deprecated method Series.set_index. (#9945) @bdice
Implement a mixin for reductions (#9925) @vyasr
JNI: Push back decimal utils from spark-rapids (#9907) @sperlingxx
Add assert_column_memory_* (#9882) @isVoid
Add CUDF_UNREACHABLE macro. (#9727) @bdice
Upgrade arrow & pyarrow to 6.0.1 (#9686) @galipremsagar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v22.04.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors