v24.04.01
🚨 Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change strings_column_view::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2
(#15108) @galipremsagar - Add support for
pandas-2.2
incudf
(#15100) @galipremsagar - Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.x
support incudf
(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
🐛 Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'
(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False
(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel
(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDAS_GE_220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vector
and add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list
/collect_set
of lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columns
by an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__
(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()
to retainlist
type (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
const
fromrange_window_bounds::_extent
. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.apply
when chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_string
incudf
(#15093) @galipremsagar - Fix
sort_values
pytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object
(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferred
invoid_sink
anduser_sink_wrapper
(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.difference
to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_index
to respectignore_index
on all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregate
toSeriesGroupBy.agg
(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILL
after a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
📖 Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.md
with new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
🚀 New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_count
for computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
🛠️ Improvements
- Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
- Use
conda env create --yes
instead of--force
(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas
(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandas
pandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove create_chars_child_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_string
option is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
- Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
- Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
- DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
- Rewrite conversion in terms of column (#15213) @vyasr
- Switch
pytest-xdist
algo toworksteal
(#15207) @galipremsagar - Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
- Add
get_upstream_resource
method tostream_checking_resource_adaptor
(#15203) @miscco - Tune up row size estimation in the data generator (#15202) @vuule
- Fix
offset
value for generating test data inparquet_chunked_reader_test.cu
(#15200) @ttnghia - Change strings_column_view::char_size to return int64 (#15197) @davidwendt
- Fix includes for row_operators.cuh (#15194) @davidwendt
- Generalize GHA selectors for pure Python testing (#15191) @bdice
- Improvements for
__cuda_array_interface__
tests (#15188) @bdice - Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
- Ignore
byte_range
inread_json
when the size is not smaller than the input data (#15180) @vuule - Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
- [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
- Change make_strings_children to return uvector (#15171) @davidwendt
- Don't override to_pandas for Datelike columns (#15167) @mroeschke
- Drop python-snappy from dependencies. (#15161) @bdice
- Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
- Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
- Java bindings for left outer distinct join (#15154) @jlowe
- Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
- Enable pandas pytests for
cudf.pandas
(#15147) @galipremsagar - Add java option to keep quotes for JSON reads (#15146) @revans2
- Change cross-pandas-version testing in
cudf
(#15145) @galipremsagar - Use
hostdevice_vector
inkernel_error
to avoid the pageable copy (#15140) @vuule - Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
- Simplify some to_pandas implementations (#15123) @mroeschke
- Java: Add leak tracking for Scalar instances (#15121) @jlowe
- Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
- Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
- Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
- Upgrade to
arrow-14.0.2
(#15108) @galipremsagar - Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
- Add support for
pandas-2.2
incudf
(#15100) @galipremsagar - Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
- Fix
datetime
binop pytest failures in pandas-2.2 (#15090) @galipremsagar - Validate types in pylibcudf Column/Table constructors (#15088) @wence-
- xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
- Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
- Adjust test_binops for pandas 2.2 (#15078) @mroeschke
- Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
- Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
- Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
- Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
- Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
- xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
- target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
- Implement stable version of
cudf::sort
(#15066) @wence- - Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
- Adjust test_joining for pandas 2.2 (#15060) @mroeschke
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
- Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
- Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
- Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
- Avoid pandas 2.2
DeprecationWarning
in test_hdf (#15044) @mroeschke - Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
- Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
- Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
- Clean up nvtx macros (#15038) @PointKernel
- Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
- Expose libcudf filter expression in read_parquet (#15028) @wence-
- Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
- Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
- Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
- JNI bindings for distinct_hash_join (#15019) @jlowe
- Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
- Improve performance of copy_if_else for long strings (#15017) @davidwendt
- Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
- Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
- Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
- Align integral types in ORC to specs (#15008) @vuule
- Clean up detail sequence header inclusion (#15007) @PointKernel
- Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
- Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
- Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
- Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
- Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
- Deprecate groupby fillna (#15000) @mroeschke
- Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
- Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
- Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
- Filter all
DeprecationWarning
's byArrowTable.to_pandas()
(#14989) @galipremsagar - Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Ensure that
ctest
is called with--no-tests=error
. (#14983) @bdice - Deprecate non-integer
periods
indate_range
andinterval_range
(#14976) @galipremsagar - Update ops-bot.yaml (#14974) @AyodeAwe
- Use page statistics in Parquet reader (#14973) @etseidl
- Use fused types for overloaded function signatures (#14969) @vyasr
- Deprecate certain frequency strings (#14967) @galipremsagar
- Update copyrights for 24.04. (#14964) @bdice
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Introduce
GetJsonObjectOptions
ingetJSONObject
Java API (#14956) @SurajAralihalli - JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
- Make codecov only informational (always pass). (#14952) @bdice
- Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
- Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
- Update tests for pandas 2. (#14941) @bdice
- Use more public pandas APIs (#14929) @mroeschke
- Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
- Add
pandas-2.x
support incudf
(#14916) @galipremsagar - Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
- De-DOS line-endings (#14880) @wence-
- Add detail
cuco_allocator
(#14877) @PointKernel - Move all core types to using enum class in Cython (#14876) @vyasr
- Read
cudf.__version__
in Sphinx build (#14872) @KyleFromNVIDIA - Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
- Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
- Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
- Update cudf for compatibility with the latest cuco (#14849) @PointKernel
- Remove deprecated strings functions (#14848) @davidwendt
- Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
- Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
- Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
- Fix calls to deprecated strings factory API in examples. (#14838) @bdice
- Update pre-commit hooks (#14837) @bdice
- Use
rapids_cuda_set_runtime
to determine cuda runtime usage by target (#14833) @vyasr - Remove get_mem_info functions from custom memory resources (#14832) @harrism
- Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
- Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
- Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
- Branch 24.04 merge branch 24.02 (#14809) @vyasr
- Branch 24.04 merge branch 24.02 (#14806) @vyasr
- Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
- Remove
build_struct|list_column
(#14786) @mroeschke - Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
- Reduce execution time of Python ORC tests (#14776) @vuule
- Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
- Use offsetalator in cudf::strings::findall (#14745) @davidwendt
- Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
- Use get_offset_value utility in strings shift function (#14743) @davidwendt
- Use as_column instead of full (#14698) @mroeschke
- List all notable breaking changes (#13535) @galipremsagar