Release v22.02.00 · rapidsai/cudf

🚨 Breaking Changes

ORC writer API changes for granular statistics (#10058) @mythrocks
decimal128 Support for to/from_arrow (#9986) @codereport
Remove deprecated method one_hot_encoding (#9977) @isVoid
Remove str.subword_tokenize (#9968) @VibhuJawa
Remove deprecated method parameter from merge and join. (#9944) @bdice
Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
Remove deprecated method Series.hash_encode. (#9942) @bdice
Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
Introduce nan_as_null parameter for cudf.Index (#9893) @galipremsagar
Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
Break tie for top categorical columns in Series.describe (#9867) @isVoid
Add partitioning support in parquet writer (#9810) @devavret
Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts (#9807) @isVoid
Raise temporary error for decimal128 types in parquet reader (#9804) @galipremsagar
Change default dtype of all nulls column from float to object (#9803) @galipremsagar
Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
Add decimal128 support to Parquet reader and writer (#9765) @vuule
Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
Match pandas scalar result types in reductions (#9717) @brandon-b-miller
Add parameters to control row group size in Parquet writer (#9677) @vuule
Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
Add support for decimal128 in cudf python (#9533) @galipremsagar
Implement lists::index_of() to find positions in list rows (#9510) @mythrocks
Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346

🐛 Bug Fixes

Add check for negative stripe index in ORC reader (#10074) @vuule
Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
Avoid index materialization when DataFrame is created with un-named Series objects (#10071) @galipremsagar
fix gcc 11 compilation errors (#10067) @rongou
Fix columns ordering issue in parquet reader (#10066) @galipremsagar
Fix dataframe setitem with ndarray types (#10056) @galipremsagar
Remove implicit copy due to conversion from cudf::size_type and size_t (#10045) @robertmaynard
Include <optional> in headers that use std::optional (#10044) @robertmaynard
Fix repr and concat of StructColumn (#10042) @galipremsagar
Include row group level stats when writing ORC files (#10041) @vuule
build.sh respects the --build_metrics and --incl_cache_stats flags (#10035) @robertmaynard
Fix memory leaks in JNI native code. (#10029) @mythrocks
Update JNI to use new arena mr constructor (#10027) @rongou
Fix null check when comparing structs in arg_min operation of reduction/groupby (#10026) @ttnghia
Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
Remove CUDA_DEVICE_CALLABLE macro usage (#10015) @hyperbolic2346
Add missing list filling header in meta.yaml (#10007) @devavret
Fix conda recipes for custreamz & cudf_kafka (#10003) @ajschmidt8
Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
Fix null check when comparing structs in min and max reduction/groupby operations (#9994) @ttnghia
Fix octal pattern matching in regex string (#9993) @davidwendt
decimal128 Support for to/from_arrow (#9986) @codereport
Fix groupby shift/diff/fill after selecting from a GroupBy (#9984) @shwina
Fix the overflow problem of decimal rescale (#9966) @sperlingxx
Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
Fix cudf java build error. (#9958) @firestarman
Use gpuci_mamba_retry to install local artifacts. (#9951) @bdice
Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
Resolve racecheck errors in ORC kernels (#9916) @vuule
Fix the java build after parquet partitioning support (#9908) @revans2
Fix compilation of benchmark for parquet writer. (#9905) @bdice
Fix a memcheck error in ORC writer (#9896) @vuule
Introduce nan_as_null parameter for cudf.Index (#9893) @galipremsagar
Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
Break tie for top categorical columns in Series.describe (#9867) @isVoid
Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia
Add one-level list encoding support in parquet reader (#9848) @PointKernel
Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
Fix caching in Series.applymap (#9821) @brandon-b-miller
Enforce boolean ascending for dask-cudf sort_values (#9814) @charlesbluca
Fix ORC writer crash with empty input columns (#9808) @vuule
Change default dtype of all nulls column from float to object (#9803) @galipremsagar
Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
Fix dtype-argument bug in dask_cudf read_csv (#9796) @rjzamora
Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
Fix missing streams (#9767) @karthikeyann
Fix make_empty_scalar_like on list_type (#9759) @sperlingxx
Update cmake and conda to 22.02 (#9746) @devavret
Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
Match pandas scalar result types in reductions (#9717) @brandon-b-miller
Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
Fixed build by adding more checks for int8, int16 (#9707) @razajafri
Fix null handling when boolean dtype is passed (#9691) @galipremsagar
Fix stream usage in segmented_gather() (#9679) @mythrocks

📖 Documentation

Update decimal dtypes related docs entries (#10072) @galipremsagar
Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
Fix cudf compilation instructions. (#9956) @esoha-nvidia
Fix see also links for IO APIs (#9895) @galipremsagar
Fix build instructions for libcudf doxygen (#9837) @davidwendt
Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
update cuda version in local build (#9736) @karthikeyann
Fix doxygen for enum types in libcudf (#9724) @davidwendt
Spell check fixes (#9682) @karthikeyann
Fix links in C++ Developer Guide. (#9675) @bdice

🚀 New Features

Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
Allow CuPy 10 (#10048) @jakirkham
Add in support for NULL_LOGICAL_AND and NULL_LOGICAL_OR binops (#10016) @revans2
Add groupby.transform (only support for aggregations) (#10005) @shwina
Add partitioning support to Parquet chunked writer (#10000) @devavret
Add jni for sequences (#9972) @wbo4958
Java bindings for mixed left, inner, and full joins (#9941) @jlowe
Java bindings for JSON reader support (#9940) @wbo4958
Enable transpose for string columns in cudf python (#9937) @galipremsagar
Support structs for cudf::contains with column/scalar input (#9929) @ttnghia
Implement mixed equality/conditional joins (#9917) @vyasr
Add cudf::strings::extract_all API (#9909) @davidwendt
Implement JNI for cudf::scatter APIs (#9903) @ttnghia
JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
Add dictionary support to cudf::copy_if_else (#9887) @davidwendt
add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
Add_suffix and add_prefix for DataFrames and Series (#9846) @mayankanand007
Add JNI for cudf::drop_duplicates (#9841) @ttnghia
Implement per-list sequence (#9839) @ttnghia
adding series.transpose (#9835) @mayankanand007
Adding support for Series.autocorr (#9833) @mayankanand007
Support round operation on datetime64 datatypes (#9820) @mayankanand007
Add partitioning support in parquet writer (#9810) @devavret
Raise temporary error for decimal128 types in parquet reader (#9804) @galipremsagar
Add decimal128 support to Parquet reader and writer (#9765) @vuule
Optimize groupby::scan (#9754) @PointKernel
Add sample JNI API (#9728) @res-life
Support min and max in inclusive scan for structs (#9725) @ttnghia
Add first and last method to IndexedFrame (#9710) @isVoid
Support min and max reduction for structs (#9697) @ttnghia
Add parameters to control row group size in Parquet writer (#9677) @vuule
Run compute-sanitizer in nightly build (#9641) @karthikeyann
Implement Series.datetime.floor (#9571) @skirui-source
ceil/floor for DatetimeIndex (#9554) @mayankanand007
Add support for decimal128 in cudf python (#9533) @galipremsagar
Implement lists::index_of() to find positions in list rows (#9510) @mythrocks
custreamz oauth callback for kafka (librdkafka) (#9486) @jdye64
Add Pearson correlation for sort groupby (python) (#9166) @skirui-source
Interchange dataframe protocol (#9071) @iskode
Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346

🛠️ Improvements

Prepare upload scripts for Python 3.7 removal (#10092) @Ethyling
Simplify custreamz and cudf_kafka recipes files (#10065) @Ethyling
ORC writer API changes for granular statistics (#10058) @mythrocks
Remove python constraints in cutreamz and cudf_kafka recipes (#10052) @Ethyling
Unpin dask and distributed in CI (#10028) @galipremsagar
Add _from_column_like_self factory (#10022) @isVoid
Replace custom CUDA bindings previously provided by RMM with official CUDA Python bindings (#10008) @shwina
Use cuda::std::is_arithmetic in cudf::is_numeric trait. (#9996) @bdice
Clean up CUDA stream use in cuIO (#9991) @vuule
Use addressed-ordered first fit for the pinned memory pool (#9989) @rongou
Add strings tests to transpose_test.cpp (#9985) @davidwendt
Use gpuci_mamba_retry on Java CI. (#9983) @bdice
Remove deprecated method one_hot_encoding (#9977) @isVoid
Minor cleanup of unused Python functions (#9974) @vyasr
Use new efficient partitioned parquet writing in cuDF (#9971) @devavret
Remove str.subword_tokenize (#9968) @VibhuJawa
Forward-merge branch-21.12 to branch-22.02 (#9947) @bdice
Remove deprecated method parameter from merge and join. (#9944) @bdice
Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
Remove deprecated method Series.hash_encode. (#9942) @bdice
use ninja in java ci build (#9933) @rongou
Add build-time publish step to cpu build script (#9927) @davidwendt
Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
Remove various unused functions (#9922) @vyasr
Raise in query if dtype is not supported (#9921) @brandon-b-miller
Add missing imports tests (#9920) @Ethyling
Spark Decimal128 hashing (#9919) @rwlee
Replace thrust/std::get with structured bindings (#9915) @codereport
Upgrade thrust version to 1.15 (#9912) @robertmaynard
Remove conda envs for CUDA 11.0 and 11.2. (#9910) @bdice
Return count of set bits from inplace_bitmask_and. (#9904) @bdice
Use dynamic nullate for join hasher and equality comparator (#9902) @davidwendt
Update ucx-py version on release using rvc (#9897) @Ethyling
Remove IncludeCategories from .clang-format (#9876) @codereport
Support statically linking CUDA runtime for Java bindings (#9873) @jlowe
Add clang-tidy to libcudf (#9860) @codereport
Remove deprecated methods from Java Table class (#9853) @jlowe
Add test for map column metadata handling in ORC writer (#9852) @vuule
Use pandas to_offset to parse frequency string in date_range (#9843) @isVoid
add templated benchmark with fixture (#9838) @karthikeyann
Use list of column inputs for apply_boolean_mask (#9832) @isVoid
Added a few more tests for Decimal to String cast (#9818) @razajafri
Run doctests. (#9815) @bdice
Avoid overflow for fixed_point round (#9809) @sperlingxx
Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts (#9807) @isVoid
Use vector factories for host-device copies. (#9806) @bdice
Refactor host device macros (#9797) @vyasr
Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
Allow custom sort functions for dask-cudf sort_values (#9789) @charlesbluca
Improve build time of libcudf iterator tests (#9788) @davidwendt
Copy Java native dependencies directly into classpath (#9787) @jlowe
Add decimal types to cuIO benchmarks (#9776) @vuule
Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
Avoid overflow for fixed_point cudf::cast and performance optimization (#9772) @codereport
Use CTAD with Thrust function objects (#9768) @codereport
Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
Use Java classloader to find test resources (#9760) @jlowe
Allow cast decimal128 to string and add tests (#9756) @razajafri
Load balance optimization for contiguous_split (#9755) @nvdbaranec
Consolidate and improve reset_index (#9750) @isVoid
Update to UCX-Py 0.24 (#9748) @pentschev
Skip cufile tests in JNI build script (#9744) @pxLi
Enable string to decimal 128 cast (#9742) @razajafri
Use stop instead of stop_. (#9735) @bdice
Forward-merge branch-21.12 to branch-22.02 (#9730) @bdice
Improve cmake format script (#9723) @vyasr
Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
Add directory-partitioned data support to cudf.read_parquet (#9720) @rjzamora
Use stream allocator adaptor for hash join table (#9704) @PointKernel
Update check for inf/nan strings in libcudf float conversion to ignore case (#9694) @davidwendt
Update cudf JNI to 22.02.0-SNAPSHOT (#9681) @pxLi
Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins (#9666) @vyasr
Some improvements to parse_decimal function and bindings for is_fixed_point (#9658) @razajafri
Add utility to format ninja-log build times (#9631) @davidwendt
Allow runtime has_nulls parameter for row operators (#9623) @davidwendt
Use fsspec.parquet for improved read_parquet performance from remote storage (#9589) @rjzamora
Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
Use List of Columns as Input for drop_nulls, gather and drop_duplicates (#9558) @isVoid
Simplify merge internals and reduce overhead (#9516) @vyasr
Add struct generation support in datagenerator & fuzz tests (#9180) @galipremsagar
Simplify write_csv by removing unnecessary writer/impl classes (#9089) @cwharris

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v22.02.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors