v22.02.00
🚨 Breaking Changes
- ORC writer API changes for granular statistics (#10058) @mythrocks
decimal128
Support forto/from_arrow
(#9986) @codereport- Remove deprecated method
one_hot_encoding
(#9977) @isVoid - Remove str.subword_tokenize (#9968) @VibhuJawa
- Remove deprecated
method
parameter frommerge
andjoin
. (#9944) @bdice - Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
- Remove deprecated method Series.hash_encode. (#9942) @bdice
- Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
- Introduce
nan_as_null
parameter forcudf.Index
(#9893) @galipremsagar - Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
- Break tie for
top
categorical columns inSeries.describe
(#9867) @isVoid - Add partitioning support in parquet writer (#9810) @devavret
- Move
drop_duplicates
,drop_na
,_gather
,take
to IndexFrame and create their_base_index
counterparts (#9807) @isVoid - Raise temporary error for
decimal128
types in parquet reader (#9804) @galipremsagar - Change default
dtype
of all nulls column fromfloat
toobject
(#9803) @galipremsagar - Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
- Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
- Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
- Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Add parameters to control row group size in Parquet writer (#9677) @vuule
- Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
- Add support for
decimal128
in cudf python (#9533) @galipremsagar - Implement
lists::index_of()
to find positions in list rows (#9510) @mythrocks - Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346
🐛 Bug Fixes
- Add check for negative stripe index in ORC reader (#10074) @vuule
- Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
- Avoid index materialization when
DataFrame
is created with un-namedSeries
objects (#10071) @galipremsagar - fix gcc 11 compilation errors (#10067) @rongou
- Fix
columns
ordering issue in parquet reader (#10066) @galipremsagar - Fix dataframe setitem with
ndarray
types (#10056) @galipremsagar - Remove implicit copy due to conversion from cudf::size_type and size_t (#10045) @robertmaynard
- Include <optional> in headers that use std::optional (#10044) @robertmaynard
- Fix repr and concat of
StructColumn
(#10042) @galipremsagar - Include row group level stats when writing ORC files (#10041) @vuule
- build.sh respects the
--build_metrics
and--incl_cache_stats
flags (#10035) @robertmaynard - Fix memory leaks in JNI native code. (#10029) @mythrocks
- Update JNI to use new arena mr constructor (#10027) @rongou
- Fix null check when comparing structs in
arg_min
operation of reduction/groupby (#10026) @ttnghia - Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
- cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
- Remove
CUDA_DEVICE_CALLABLE
macro usage (#10015) @hyperbolic2346 - Add missing list filling header in meta.yaml (#10007) @devavret
- Fix
conda
recipes forcustreamz
&cudf_kafka
(#10003) @ajschmidt8 - Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
- Fix null check when comparing structs in
min
andmax
reduction/groupby operations (#9994) @ttnghia - Fix octal pattern matching in regex string (#9993) @davidwendt
decimal128
Support forto/from_arrow
(#9986) @codereport- Fix groupby shift/diff/fill after selecting from a
GroupBy
(#9984) @shwina - Fix the overflow problem of decimal rescale (#9966) @sperlingxx
- Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
- Fix cudf java build error. (#9958) @firestarman
- Use gpuci_mamba_retry to install local artifacts. (#9951) @bdice
- Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
- Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
- Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
- Resolve racecheck errors in ORC kernels (#9916) @vuule
- Fix the java build after parquet partitioning support (#9908) @revans2
- Fix compilation of benchmark for parquet writer. (#9905) @bdice
- Fix a memcheck error in ORC writer (#9896) @vuule
- Introduce
nan_as_null
parameter forcudf.Index
(#9893) @galipremsagar - Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
- Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
- TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
- Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
- Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
- Break tie for
top
categorical columns inSeries.describe
(#9867) @isVoid - Fix null handling for structs
min
andarg_min
in groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia - Add one-level list encoding support in parquet reader (#9848) @PointKernel
- Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
- Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
- Fix caching in
Series.applymap
(#9821) @brandon-b-miller - Enforce boolean
ascending
for dask-cudfsort_values
(#9814) @charlesbluca - Fix ORC writer crash with empty input columns (#9808) @vuule
- Change default
dtype
of all nulls column fromfloat
toobject
(#9803) @galipremsagar - Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
- Fix dtype-argument bug in dask_cudf read_csv (#9796) @rjzamora
- Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
- Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
- Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
- Fix missing streams (#9767) @karthikeyann
- Fix make_empty_scalar_like on list_type (#9759) @sperlingxx
- Update cmake and conda to 22.02 (#9746) @devavret
- Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
- Fixed build by adding more checks for int8, int16 (#9707) @razajafri
- Fix
null
handling whenboolean
dtype is passed (#9691) @galipremsagar - Fix stream usage in
segmented_gather()
(#9679) @mythrocks
📖 Documentation
- Update
decimal
dtypes related docs entries (#10072) @galipremsagar - Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
- Fix cudf compilation instructions. (#9956) @esoha-nvidia
- Fix see also links for IO APIs (#9895) @galipremsagar
- Fix build instructions for libcudf doxygen (#9837) @davidwendt
- Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
- update cuda version in local build (#9736) @karthikeyann
- Fix doxygen for enum types in libcudf (#9724) @davidwendt
- Spell check fixes (#9682) @karthikeyann
- Fix links in C++ Developer Guide. (#9675) @bdice
🚀 New Features
- Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
- Allow CuPy 10 (#10048) @jakirkham
- Add in support for NULL_LOGICAL_AND and NULL_LOGICAL_OR binops (#10016) @revans2
- Add
groupby.transform
(only support for aggregations) (#10005) @shwina - Add partitioning support to Parquet chunked writer (#10000) @devavret
- Add jni for sequences (#9972) @wbo4958
- Java bindings for mixed left, inner, and full joins (#9941) @jlowe
- Java bindings for JSON reader support (#9940) @wbo4958
- Enable transpose for string columns in cudf python (#9937) @galipremsagar
- Support structs for
cudf::contains
with column/scalar input (#9929) @ttnghia - Implement mixed equality/conditional joins (#9917) @vyasr
- Add cudf::strings::extract_all API (#9909) @davidwendt
- Implement JNI for
cudf::scatter
APIs (#9903) @ttnghia - JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
- Add dictionary support to cudf::copy_if_else (#9887) @davidwendt
- add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
- Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
- Add_suffix and add_prefix for DataFrames and Series (#9846) @mayankanand007
- Add JNI for
cudf::drop_duplicates
(#9841) @ttnghia - Implement per-list sequence (#9839) @ttnghia
- adding
series.transpose
(#9835) @mayankanand007 - Adding support for
Series.autocorr
(#9833) @mayankanand007 - Support round operation on datetime64 datatypes (#9820) @mayankanand007
- Add partitioning support in parquet writer (#9810) @devavret
- Raise temporary error for
decimal128
types in parquet reader (#9804) @galipremsagar - Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Optimize
groupby::scan
(#9754) @PointKernel - Add sample JNI API (#9728) @res-life
- Support
min
andmax
in inclusive scan for structs (#9725) @ttnghia - Add
first
andlast
method toIndexedFrame
(#9710) @isVoid - Support
min
andmax
reduction for structs (#9697) @ttnghia - Add parameters to control row group size in Parquet writer (#9677) @vuule
- Run compute-sanitizer in nightly build (#9641) @karthikeyann
- Implement Series.datetime.floor (#9571) @skirui-source
- ceil/floor for
DatetimeIndex
(#9554) @mayankanand007 - Add support for
decimal128
in cudf python (#9533) @galipremsagar - Implement
lists::index_of()
to find positions in list rows (#9510) @mythrocks - custreamz oauth callback for kafka (librdkafka) (#9486) @jdye64
- Add Pearson correlation for sort groupby (python) (#9166) @skirui-source
- Interchange dataframe protocol (#9071) @iskode
- Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346
🛠️ Improvements
- Prepare upload scripts for Python 3.7 removal (#10092) @Ethyling
- Simplify custreamz and cudf_kafka recipes files (#10065) @Ethyling
- ORC writer API changes for granular statistics (#10058) @mythrocks
- Remove python constraints in cutreamz and cudf_kafka recipes (#10052) @Ethyling
- Unpin
dask
anddistributed
in CI (#10028) @galipremsagar - Add
_from_column_like_self
factory (#10022) @isVoid - Replace custom CUDA bindings previously provided by RMM with official CUDA Python bindings (#10008) @shwina
- Use
cuda::std::is_arithmetic
incudf::is_numeric
trait. (#9996) @bdice - Clean up CUDA stream use in cuIO (#9991) @vuule
- Use addressed-ordered first fit for the pinned memory pool (#9989) @rongou
- Add strings tests to transpose_test.cpp (#9985) @davidwendt
- Use gpuci_mamba_retry on Java CI. (#9983) @bdice
- Remove deprecated method
one_hot_encoding
(#9977) @isVoid - Minor cleanup of unused Python functions (#9974) @vyasr
- Use new efficient partitioned parquet writing in cuDF (#9971) @devavret
- Remove str.subword_tokenize (#9968) @VibhuJawa
- Forward-merge branch-21.12 to branch-22.02 (#9947) @bdice
- Remove deprecated
method
parameter frommerge
andjoin
. (#9944) @bdice - Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
- Remove deprecated method Series.hash_encode. (#9942) @bdice
- use ninja in java ci build (#9933) @rongou
- Add build-time publish step to cpu build script (#9927) @davidwendt
- Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
- Remove various unused functions (#9922) @vyasr
- Raise in
query
if dtype is not supported (#9921) @brandon-b-miller - Add missing imports tests (#9920) @Ethyling
- Spark Decimal128 hashing (#9919) @rwlee
- Replace
thrust/std::get
with structured bindings (#9915) @codereport - Upgrade thrust version to 1.15 (#9912) @robertmaynard
- Remove conda envs for CUDA 11.0 and 11.2. (#9910) @bdice
- Return count of set bits from inplace_bitmask_and. (#9904) @bdice
- Use dynamic nullate for join hasher and equality comparator (#9902) @davidwendt
- Update ucx-py version on release using rvc (#9897) @Ethyling
- Remove
IncludeCategories
from.clang-format
(#9876) @codereport - Support statically linking CUDA runtime for Java bindings (#9873) @jlowe
- Add
clang-tidy
to libcudf (#9860) @codereport - Remove deprecated methods from Java Table class (#9853) @jlowe
- Add test for map column metadata handling in ORC writer (#9852) @vuule
- Use pandas
to_offset
to parse frequency string indate_range
(#9843) @isVoid - add templated benchmark with fixture (#9838) @karthikeyann
- Use list of column inputs for
apply_boolean_mask
(#9832) @isVoid - Added a few more tests for Decimal to String cast (#9818) @razajafri
- Run doctests. (#9815) @bdice
- Avoid overflow for fixed_point round (#9809) @sperlingxx
- Move
drop_duplicates
,drop_na
,_gather
,take
to IndexFrame and create their_base_index
counterparts (#9807) @isVoid - Use vector factories for host-device copies. (#9806) @bdice
- Refactor host device macros (#9797) @vyasr
- Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
- Allow custom sort functions for dask-cudf
sort_values
(#9789) @charlesbluca - Improve build time of libcudf iterator tests (#9788) @davidwendt
- Copy Java native dependencies directly into classpath (#9787) @jlowe
- Add decimal types to cuIO benchmarks (#9776) @vuule
- Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
- Avoid overflow for
fixed_point
cudf::cast
and performance optimization (#9772) @codereport - Use CTAD with Thrust function objects (#9768) @codereport
- Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
- Use Java classloader to find test resources (#9760) @jlowe
- Allow cast decimal128 to string and add tests (#9756) @razajafri
- Load balance optimization for contiguous_split (#9755) @nvdbaranec
- Consolidate and improve
reset_index
(#9750) @isVoid - Update to UCX-Py 0.24 (#9748) @pentschev
- Skip cufile tests in JNI build script (#9744) @pxLi
- Enable string to decimal 128 cast (#9742) @razajafri
- Use stop instead of stop_. (#9735) @bdice
- Forward-merge branch-21.12 to branch-22.02 (#9730) @bdice
- Improve cmake format script (#9723) @vyasr
- Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
- Add directory-partitioned data support to cudf.read_parquet (#9720) @rjzamora
- Use stream allocator adaptor for hash join table (#9704) @PointKernel
- Update check for inf/nan strings in libcudf float conversion to ignore case (#9694) @davidwendt
- Update cudf JNI to 22.02.0-SNAPSHOT (#9681) @pxLi
- Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins (#9666) @vyasr
- Some improvements to
parse_decimal
function and bindings foris_fixed_point
(#9658) @razajafri - Add utility to format ninja-log build times (#9631) @davidwendt
- Allow runtime has_nulls parameter for row operators (#9623) @davidwendt
- Use fsspec.parquet for improved read_parquet performance from remote storage (#9589) @rjzamora
- Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
- Use List of Columns as Input for
drop_nulls
,gather
anddrop_duplicates
(#9558) @isVoid - Simplify merge internals and reduce overhead (#9516) @vyasr
- Add
struct
generation support in datagenerator & fuzz tests (#9180) @galipremsagar - Simplify write_csv by removing unnecessary writer/impl classes (#9089) @cwharris