v23.04.01
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandas
andnumpy
pinnings (#12887) @galipremsagar - Deprecate
names
&dtype
inIndex.copy
(#12825) @galipremsagar - Deprecate
Index.is_*
methods (#12820) @galipremsagar - Deprecate
datetime_is_numeric
fromdescribe
(#12818) @galipremsagar - Deprecate
na_sentinel
infactorize
(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
🐛 Bug Fixes
- Pin curand version (#13127) @vyasr
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrame
constructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schema
from chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby
(#12992) @galipremsagar - Fix
__setitem__
on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_values
when column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquet
changes (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__
(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer column_size() should return a size_t (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndex
bug (#12838) @rjzamora - Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes
(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keys
in_scatter_by_slice
(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullable
flag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply
(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
round
API (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrow
to load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtype
andvalues
to handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Series
comparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFString
back toStringView
to call methods instrings_udf
(#12363) @brandon-b-miller
📖 Documentation
- Fix
GroupBy.apply
doc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0
compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
🚀 New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullable
option application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter
(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.apply
functions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilities
to useexperimental::equality
row comparator (#12777) @divyegala - Update
distinct/unique_count
toexperimental::row
hasher/comparator (#12776) @divyegala - Update
hash_partition
to useexperimental::row::row_hasher
(#12761) @divyegala - Update
is_sorted
to useexperimental::row::lexicographic
(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_program
strings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_program
strings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_program
strings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Add
regex_program
searching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rank
to use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schema
parameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.head
andgroupby.tail
(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSL
to conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
moto
env vars & passAWS_SESSION_TOKEN
to conda builds (#12902) @ajschmidt8 - Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
- Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
- Deprecate
line_terminator
in favor oflineterminator
into_csv
(#12896) @wence- - Add
stream
andmr
parameters forstructs::detail::flatten_nested_columns
(#12892) @ttnghia - Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
- Remove default parameters from detail headers in include (#12888) @vyasr
- Update minimum
pandas
andnumpy
pinnings (#12887) @galipremsagar - Implement
groupby.sample
(#12882) @wence- - Update JNI build ENV default to gcc 11 (#12881) @pxLi
- Change return type of
cudf::structs::detail::flatten_nested_columns
to smart pointer (#12878) @ttnghia - Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
- Remove manual artifact upload step in CI (#12869) @ajschmidt8
- Update to GCC 11 (#12868) @bdice
- Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
- Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
- Update RMM allocators (#12861) @pentschev
- Improve performance for replace-multi for long strings (#12858) @davidwendt
- Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
- Migrate as much as possible to pyproject.toml (#12850) @vyasr
- Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
- Setting a threshold for KvikIO IO (#12841) @madsbk
- Update datasets download URL (#12840) @jjacobelli
- Make docs builds less verbose (#12836) @AyodeAwe
- Consolidate linter configs into pyproject.toml (#12834) @vyasr
- Deprecate
names
&dtype
inIndex.copy
(#12825) @galipremsagar - Deprecate
inplace
parameters in categorical methods (#12824) @galipremsagar - Add optional text file support to ninja-log utility (#12823) @davidwendt
- Deprecate
Index.is_*
methods (#12820) @galipremsagar - Add dfg as a pre-commit hook (#12819) @vyasr
- Deprecate
datetime_is_numeric
fromdescribe
(#12818) @galipremsagar - Deprecate
na_sentinel
infactorize
(#12817) @galipremsagar - Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
- Fixing parquet coalescing of reads (#12808) @hyperbolic2346
- CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
- Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
- Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
- Expose seed argument to hash_values (#12795) @ayushdg
- Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
- Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
- Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
- Stop force pulling fmt in nvbench. (#12768) @vyasr
- Remove now redundant cuda initialization (#12758) @vyasr
- Adds JSON reader, writer io benchmark (#12753) @karthikeyann
- Use test paths relative to package directory. (#12751) @bdice
- Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
- Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
- Stop using versioneer to manage versions (#12741) @vyasr
- Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
- Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
- Update shared workflow branches (#12733) @ajschmidt8
- JNI switches to nested JSON reader (#12732) @res-life
- Changing
cudf::io::source_info
to usecudf::host_span<std::byte>
in a non-breaking form (#12730) @hyperbolic2346 - Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
- Split C++ and Python build dependencies into separate lists. (#12724) @bdice
- Add build dependencies to Java tests. (#12723) @bdice
- Allow setting the seed argument for hash partition (#12715) @firestarman
- Remove gpuCI scripts. (#12712) @bdice
- Unpin
dask
anddistributed
for development (#12710) @galipremsagar partition_by_hash()
: use_split()
(#12704) @madsbk- Remove DataFrame.quantiles from docs. (#12684) @bdice
- Fast path for
experimental::row::equality
(#12676) @divyegala - Move date to build string in
conda
recipe (#12661) @ajschmidt8 - Refactor reduction logic for fixed-point types (#12652) @davidwendt
- Pay off some JNI RMM API tech debt (#12632) @revans2
- Merge
copy-on-write
feature branch intobranch-23.04
(#12619) @galipremsagar - Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
- Pin cuda-nvrtc. (#12606) @bdice
- Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
- Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
- Add performance benchmarks to user facing docs (#12595) @galipremsagar
- Add docs build job (#12592) @AyodeAwe
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
- Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora