v22.04.00
🚨 Breaking Changes
- Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
- Refactor stream compaction APIs (#10370) @PointKernel
- Add scan_aggregation and reduce_aggregation derived types. (#10357) @nvdbaranec
- Avoid
decimal
type narrowing for decimal binops (#10299) @galipremsagar - Rewrites
sample
API (#10262) @isVoid - Remove probe-time null equality parameters in
cudf::hash_join
(#10260) @PointKernel - Enable proper
Index
round-tripping inorc
reader and writer (#10170) @galipremsagar - Add JNI for
strings::split_re
andstrings::split_record_re
(#10139) @ttnghia - Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
- Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
- Remove deprecated code (#10124) @vyasr
- Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
- Optimize compaction operations (#10030) @PointKernel
- Remove deprecated method Series.set_index. (#9945) @bdice
- Add cudf::strings::findall_record API (#9911) @davidwendt
- Upgrade
arrow
&pyarrow
to6.0.1
(#9686) @galipremsagar
🐛 Bug Fixes
- Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
- Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
- Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
- Fix for integer overflow in contiguous-split (#10437) @jbrennan333
- Fix has_null predicate for drop_list_duplicates on nested structs (#10436) @sperlingxx
- Fix empty reduce with List output and non-List input (#10435) @sperlingxx
- Fix
list
andstruct
meta generation issue indask-cudf
(#10434) @galipremsagar - Fix error in
cudf.to_numeric
when abool
input is passed (#10431) @galipremsagar - Support cupy array in
quantile
input (#10429) @galipremsagar - Fix benchmarks to work with new aggregation types (#10428) @davidwendt
- Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
- Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
- Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
- Limiting async allocator using alignment of 512 (#10395) @rongou
- Include <optional> in multibyte split. (#10385) @bdice
- Fix issue with column and scalar re-assignment (#10377) @galipremsagar
- Fix floating point data generation in benchmarks (#10372) @vuule
- Avoid overflow in fused_concatenate_kernel output_index (#10344) @abellina
- Remove is_relationally_comparable for table device views (#10342) @davidwendt
- Fix debug compile error in device_span to column_view conversion (#10331) @davidwendt
- Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
- Fix
std::bad_alloc
exception due to JIT reserving a huge buffer (#10317) @ttnghia - Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
- Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
- Fix documentation issues (#10307) @ajschmidt8
- Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
- Fix incorrect slicing of GDS read/write calls (#10274) @vuule
- Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
- Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
- Remove probe-time null equality parameters in
cudf::hash_join
(#10260) @PointKernel - Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
- Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
- Fix small leak in explode (#10245) @revans2
- Yet another small JNI memory leak (#10238) @revans2
- Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
- Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
- Fix JNI leak on copy to device (#10229) @revans2
- Fix the data generator element size for decimal types (#10225) @vuule
- Fix
decimal
metadata in parquet writer (#10224) @galipremsagar - Fix strings handling of hex in regex pattern (#10220) @davidwendt
- Fix docs builds (#10216) @ajschmidt8
- Fix a leftover _has_nulls change from Nullate (#10211) @devavret
- Fix bitmask of the output for JNI of
lists::drop_list_duplicates
(#10210) @ttnghia - Fix compile error in
binaryop/compiled/util.cpp
(#10209) @ttnghia - Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
- Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
- Enable proper
Index
round-tripping inorc
reader and writer (#10170) @galipremsagar - Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
- Preserve the correct
ListDtype
while creating an identical empty column (#10151) @galipremsagar - benchmark fixture - static object pointer fix (#10145) @karthikeyann
- Fix UDF Caching (#10133) @brandon-b-miller
- Raise duplicate column error in
DataFrame.rename
(#10120) @galipremsagar - Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
- Encode values from python callback for C++ (#10103) @jdye64
- Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
- Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
- Column equality testing fixes (#10011) @brandon-b-miller
- Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca
📖 Documentation
- Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
- Add
cut
to API docs (#10479) @shwina - Remove documentation for methods removed in #10124. (#10366) @bdice
- Fix documentation issues (#10306) @ajschmidt8
- Fix
fixed_point
binary operation documentation (#10198) @codereport - Remove cleaned up methods from docs (#10189) @galipremsagar
- Update developer guide to recommend no default stream parameter. (#10136) @bdice
- Update benchmarking guide to use NVBench. (#10093) @bdice
🚀 New Features
- Add StringIO support to read_text (#10465) @cwharris
- Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
- JNI support for Collect Ops in Reduction (#10427) @sperlingxx
- Enable read_text with dask_cudf using byte_range (#10407) @ChrisJar
- Add
cudf::stable_sort_by_key
(#10387) @PointKernel - Implement
maps_column_view
abstraction overLIST<STRUCT<K,V>>
(#10380) @mythrocks - Support Java bindings for Avro reader (#10373) @HaoYang670
- Refactor stream compaction APIs (#10370) @PointKernel
- Support collect aggregations in reduction (#10353) @sperlingxx
- Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
- Add JNI for extract_list_element with index column (#10341) @firestarman
- Support
min
andmax
operations for structs in rolling window (#10332) @ttnghia - Add device create_sequence_table for benchmarks (#10300) @karthikeyann
- Enable numpy ufuncs for DataFrame (#10287) @vyasr
- move input generation for json benchmark to device (#10281) @karthikeyann
- move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
- move input generation for copy benchmark to device (#10279) @karthikeyann
- generate url decode benchmark input in device (#10278) @karthikeyann
- device input generation in join bench (#10277) @karthikeyann
- Add nvtext::byte_pair_encoding API (#10270) @davidwendt
- Prevent internal usage of expensive APIs (#10263) @vyasr
- Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
- Support
percent_rank()
aggregation (#10227) @mythrocks - Refactor Series.array_ufunc (#10217) @vyasr
- Reduce pytest runtime (#10203) @brandon-b-miller
- Add regex flags parameter to python cudf strings split (#10185) @davidwendt
- Support for
MOD
,PMOD
andPYMOD
fordecimal32/64/128
(#10179) @codereport - Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
- Add file size counter to cuIO benchmarks (#10154) @vuule
- byte_range support for multibyte_split/read_text (#10150) @cwharris
- Add JNI for
strings::split_re
andstrings::split_record_re
(#10139) @ttnghia - Add
maxSplit
parameter to Java binding forstrings:split
(#10137) @ttnghia - Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
- generate benchmark input in device (#10109) @karthikeyann
- Avoid
nan_as_null
op ifnan_count
is 0 (#10082) @galipremsagar - Add Dataframe and Index nunique (#10077) @martinfalisse
- Support nanosecond timestamps in parquet (#10063) @PointKernel
- Java bindings for mixed semi and anti joins (#10040) @jlowe
- Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
- Optimize compaction operations (#10030) @PointKernel
- Support
args=
inSeries.apply
(#9982) @brandon-b-miller - Add cudf::strings::findall_record API (#9911) @davidwendt
- Add covariance for sort groupby (python) (#9889) @mayankanand007
- Implement DataFrame diff() (#9817) @skirui-source
- Implement DataFrame pct_change (#9805) @skirui-source
- Support segmented reductions and null mask reductions (#9621) @isVoid
- Add 'spearman' correlation method for
dataframe.corr
andseries.corr
(#7141) @dominicshanshan
🛠️ Improvements
- Add
scipy
skip for a test (#10502) @galipremsagar - Temporarily disable new
ops-bot
functionality (#10496) @ajschmidt8 - Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
- Pin
dask
anddistributed
(#10481) @galipremsagar - MD5 refactoring. (#10445) @bdice
- Remove or split up Frame methods that use the index (#10439) @vyasr
- Centralization of tdigest aggregation code. (#10422) @nvdbaranec
- Simplify column binary operations (#10421) @vyasr
- Add
.github/ops-bot.yaml
config file (#10420) @ajschmidt8 - Use list of columns for methods in
Groupby.pyx
(#10419) @isVoid - Remove warnings in
test_timedelta.py
(#10418) @galipremsagar - Fix some warnings in
test_parquet.py
(#10416) @galipremsagar - JNI support for segmented reduce (#10413) @revans2
- Clean up null mask after purging null entries (#10412) @sperlingxx
- Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
- Use str instead of builtins.str. (#10410) @bdice
- Fix warnings in
test_rolling
(#10405) @bdice - Enable
codecov
github-check in CI (#10404) @galipremsagar - Fix warnings in test_cuda_apply, test_numerical, test_pickling, test_unaops. (#10402) @bdice
- Set column names in
_from_columns_like_self
factory (#10400) @isVoid - Refactor
nvtx
annotations incudf
&dask-cudf
(#10396) @galipremsagar - Consolidate .cov and .corr for sort groupby (#10386) @skirui-source
- Consolidate some Frame APIs (#10381) @vyasr
- Refactor hash functions and
hash_combine
(#10379) @bdice - Add
nvtx
annotations forSeries
andIndex
(#10374) @galipremsagar - Refactor
filling.repeat
API (#10371) @isVoid - Move standalone UTF8 functions from string_view.hpp to utf8.hpp (#10369) @davidwendt
- Remove doc for deprecated function
one_hot_encoding
(#10367) @isVoid - Refactor array function (#10364) @vyasr
- Fix warnings in test_csv.py. (#10362) @bdice
- Implement a mixin for binops (#10360) @vyasr
- Refactor cython interface:
copying.pyx
(#10359) @isVoid - Implement a mixin for scans (#10358) @vyasr
- Add scan_aggregation and reduce_aggregation derived types. (#10357) @nvdbaranec
- Add cleanup of python artifacts (#10355) @galipremsagar
- Fix warnings in test_categorical.py. (#10354) @bdice
- Create a dispatcher for invoking regex kernel functions (#10349) @davidwendt
- Fix
codecov
in CI (#10347) @galipremsagar - Enable caching for
memory_usage
calculation inColumn
(#10345) @galipremsagar - C++17 cleanup: traits replace std::enable_if<>::type with std::enable_if_t (#10343) @karthikeyann
- JNI: Support appending DECIMAL128 into ColumnBuilder in terms of byte array (#10338) @sperlingxx
- multibyte_split test improvements (#10328) @vuule
- Fix warnings in test_binops.py. (#10327) @bdice
- Fix warnings from pandas in test_array_ufunc.py. (#10324) @bdice
- Update upload script (#10321) @ajschmidt8
- Move hash type declarations to hashing.hpp (#10320) @davidwendt
- C++17 cleanup: traits replace
::value
with_v
(#10319) @karthikeyann - Remove internal columns usage (#10315) @vyasr
- Remove extraneous
build.sh
parameter (#10313) @ajschmidt8 - Add const qualifier to MurmurHash3_32::hash_combine (#10311) @davidwendt
- Remove
TODO
inlibcudf_kafka
recipe (#10309) @ajschmidt8 - Add conversions between column_view and device_span<T const>. (#10302) @bdice
- Avoid
decimal
type narrowing for decimal binops (#10299) @galipremsagar - Deprecate
DataFrame.iteritems
and introduce.items
(#10298) @galipremsagar - Explicitly request CMake use
gnu++17
overc++17
(#10297) @robertmaynard - Add copyright check as pre-commit hook. (#10290) @vyasr
- DataFrame
insert
and creation optimizations (#10285) @galipremsagar - Improve hash join detail functions (#10273) @PointKernel
- Replace custom
cached_property
implementation with functools (#10272) @shwina - Rewrites
sample
API (#10262) @isVoid - Bump hadoop-common from 3.1.0 to 3.1.4 in /java (#10259) @dependabot[bot]
- Remove making redundant
copy
across code-base (#10257) @galipremsagar - Add more
nvtx
annotations (#10256) @galipremsagar - Add
copyright
check incudf
(#10253) @galipremsagar - Remove redundant copies in
fillna
to improve performance (#10241) @galipremsagar - Remove
std::numeric_limit
specializations for timestamp & durations (#10239) @codereport - Optimize
DataFrame
creation across code-base (#10236) @galipremsagar - Change pytest distribution algorithm and increase parallelism in CI (#10232) @galipremsagar
- Add environment variables for I/O thread pool and slice sizes (#10218) @vuule
- Add regex flags to strings findall functions (#10208) @davidwendt
- Update dask-cudf parquet tests to reflect upstream bugfixes to
_metadata
(#10206) @charlesbluca - Remove unnecessary nunique function in Series. (#10205) @martinfalisse
- Refactor DataFrame tests. (#10204) @bdice
- Rewrites
column.__setitem__
, Useboolean_mask_scatter
(#10202) @isVoid - Java utilities to aid in accelerating aggregations on 128-bit types (#10201) @jlowe
- Fix docstrings alignment in
Frame
methods (#10199) @galipremsagar - Fix cuco pair issue in hash join (#10195) @PointKernel
- Replace
dask
groupby.index
usages with.by
(#10193) @galipremsagar - Add regex flags to strings extract function (#10192) @davidwendt
- Forward-merge branch-22.02 to branch-22.04 (#10191) @bdice
- Add CMake
install
rule for tests (#10190) @ajschmidt8 - Unpin
dask
&distributed
(#10182) @galipremsagar - Add comments to explain test validation (#10176) @galipremsagar
- Reduce warnings in pytest output (#10168) @bdice
- Some consolidation of indexed frame methods (#10167) @vyasr
- Refactor isin implementations (#10165) @vyasr
- Faster struct row comparator (#10164) @devavret
- Refactor groupby::get_groups. (#10161) @bdice
- Deprecate
decimal_cols_as_float
in ORC reader (C++ layer) (#10152) @vuule - Replace
ccache
withsccache
(#10146) @ajschmidt8 - Murmur3 hash kernel cleanup (#10143) @rwlee
- Deprecate
decimal_cols_as_float
in ORC reader (#10142) @galipremsagar - Run pyupgrade 2.31.0. (#10141) @bdice
- Remove
drop_nan
from internalIndexedFrame._drop_na_rows
. (#10140) @bdice - Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
- Update cmake-format script for branch 22.04. (#10132) @bdice
- Accept r-value references in convert_table_for_return(): (#10131) @mythrocks
- Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
- Remove deprecated code (#10124) @vyasr
- Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
- Remove benchmarks suffix (#10112) @bdice
- Update cudf java binding version to 22.04.0-SNAPSHOT (#10084) @pxLi
- Remove unnecessary docker files. (#10069) @vyasr
- Limit benchmark iterations using environment variable (#10060) @karthikeyann
- Add timing chart for libcudf build metrics report page (#10038) @davidwendt
- JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder (#10025) @sperlingxx
- Reduce redundant code in CUDF JNI (#10019) @mythrocks
- Make snappy decompress check more efficient (#9995) @cheinger
- Remove deprecated method Series.set_index. (#9945) @bdice
- Implement a mixin for reductions (#9925) @vyasr
- JNI: Push back decimal utils from spark-rapids (#9907) @sperlingxx
- Add
assert_column_memory_*
(#9882) @isVoid - Add CUDF_UNREACHABLE macro. (#9727) @bdice
- Upgrade
arrow
&pyarrow
to6.0.1
(#9686) @galipremsagar