v22.12.00
🚨 Breaking Changes
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike
(#12009) @madsbk - Remove unused
managed_allocator
(#12005) @vyasr - Pass column names to
write_csv
instead oftable_metadata
pointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
🐛 Bug Fixes
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dict
backend dispatch to match upstreamdask
(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_string
pointers instrings_udf
(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditional_full_join benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFile
type (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__
edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptx
by default instrings_udf
(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage
&items
implementation forStruct
column & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
- Fix issues when both
usecols
andnames
options are used inread_csv
(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr
" (#11999) @vuule - Fix bug where
df.loc
resulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in get_dremel_data (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
📖 Documentation
- Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncate
API to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md
(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudf
specific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
🚀 New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+
instrings_udf
(#12117) @brandon-b-miller - Support
upper
andlower
instrings_udf
(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip
,lstrip
, andrstrip
instrings_udf
(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
like
jni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType
(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply
(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf
(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadata
to libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
🛠️ Improvements
- Reduce number of tests marked
spilling
(#12197) @madsbk - Pin
dask
anddistributed
for release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orc
changes in tests (#12103) @galipremsagar - Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dask
nigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow error during decimal binops (#12063) @galipremsagar
- Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
- Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
- Add support for
DataFrame.from_dict
`to_dictand
Series.to_dict` (#12048) @galipremsagar - Refactor Parquet reader (#12046) @ttnghia
- Forward merge 22.10 into 22.12 (#12045) @vyasr
- Standardize newlines at ends of files. (#12042) @bdice
- Trim trailing whitespace from all files. (#12041) @bdice
- Use nosync policy in gather and scatter implementations. (#12038) @bdice
- Remove smart quotes from all docstrings. (#12035) @bdice
- Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
- Add cython-lint to pre-commit checks. (#12020) @bdice
- Use pragma once (#12019) @bdice
- New GHA to add issues/prs to project board (#12016) @jarmak-nv
- Add DataFrame.pivot_table. (#12015) @bdice
- Rollback of
DeviceBufferLike
(#12009) @madsbk - Remove default parameters for nvtext::detail functions (#12007) @davidwendt
- Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
- Remove unused
managed_allocator
(#12005) @vyasr - Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
- Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
- Ignore python docs build artifacts (#12000) @galipremsagar
- Use rapids-cmake for google benchmark. (#11997) @vyasr
- Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
- Remove stale labeler (#11995) @raydouglass
- Move protobuf compilation to CMake (#11986) @vyasr
- Replace most of preprocessor usage in nvcomp adapter with
constexpr
(#11980) @vuule - Add missing noexcepts to column_in_metadata methods (#11973) @vyasr
- Pass column names to
write_csv
instead oftable_metadata
pointer (#11972) @vuule - Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
- Feature/remove default streams (#11967) @vyasr
- Add pool memory resource to libcudf basic example (#11966) @davidwendt
- Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
- Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Add deprecation warning for set_allocator. (#11958) @vyasr
- Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
- Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
- Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
- Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
- Add
strip_delimiters
option toread_text
(#11946) @upsj - Refactor multibyte_split
output_builder
(#11945) @upsj - Remove validation that requires introspection (#11938) @vyasr
- Add
.str.find_multiple
API (#11928) @galipremsagar - Add regex_program class for use with all regex APIs (#11927) @davidwendt
- Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
- Performance improvement in JSON Tree traversal (#11919) @karthikeyann
- Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
- Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
- Add
nanosecond
µsecond
toDatetimeProperties
(#11911) @galipremsagar - Pin mimesis version in setup.py. (#11906) @bdice
- Error on
ListColumn
or any new unsupported column incudf.Index
(#11902) @galipremsagar - Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
- Relax
codecov
threshold diff (#11899) @galipremsagar - Use public APIs in STREAM_COMPACTION_NVBENCH (#11892) @GregoryKimball
- Add coverage for string UDF tests. (#11891) @vyasr
- Provide
data_chunk_source
wrapper fordatasource
(#11886) @upsj - Handle
multibyte_split
byte_range out-of-bounds offsets on host (#11885) @upsj - Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Change expect_strings_empty into expect_column_empty libcudf test utility (#11873) @davidwendt
- Add ngroup (#11871) @shwina
- Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
- Unpin
dask
anddistributed
for development (#11859) @galipremsagar - Remove unused includes for table/row_operators (#11857) @GregoryKimball
- Use conda-forge's
pyorc
(#11855) @jakirkham - Add libcudf strings examples (#11849) @davidwendt
- Remove
cudf_io
namespace alias (#11827) @vuule - Test/remove thrust vector usage (#11813) @vyasr
- Add BGZIP reader to python
read_text
(#11802) @upsj - Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
- Fix compile warning from CUDF_FUNC_RANGE in a member function (#11798) @davidwendt
- Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
- Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
- Add BGZIP multibyte_split benchmark (#11723) @upsj
- Bifurcate Dependency Lists (#11674) @bdice
- Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
- Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
- Make all
nvcc
warnings into errors (#8916) @trxcllnt