Release v21.10.00 · rapidsai/cudf

🚨 Breaking Changes

Remove Cython APIs for table view generation (#9199) @vyasr
Upgrade pandas version in cudf (#9147) @galipremsagar
Make AST operators nullable (#9096) @vyasr
Remove the option to pass data types as strings to read_csv and read_json (#9079) @vuule
Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
Support additional format specifiers in from_timestamps (#9047) @davidwendt
Expose expression base class publicly and simplify public AST API (#9045) @vyasr
Add support for struct type in ORC writer (#9025) @vuule
Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
Java bindings for conditional join output sizes (#9002) @jlowe
Move compute_column API out of ast namespace (#8957) @vyasr
cudf.dtype function (#8949) @shwina
Refactor Frame reductions (#8944) @vyasr
Add nested column selection to parquet reader (#8933) @devavret
JNI Aggregation Type Changes (#8919) @revans2
Add groupby_aggregation and groupby_scan_aggregation classes and force their usage. (#8906) @nvdbaranec
Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects (#8856) @vuule
Change cudf docs theme to pydata theme (#8746) @galipremsagar
Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
Make groupby transform-like op order match original data order (#8720) @isVoid

🐛 Bug Fixes

fixed_point cudf::groupby for mean aggregation (#9296) @codereport
Fix interleave_columns when the input string lists column having empty child column (#9292) @ttnghia
Update nvcomp to include fixes for installation of headers (#9276) @devavret
Fix Java column leak in testParquetWriteMap (#9271) @jlowe
Fix call to thrust::reduce_by_key in argmin/argmax libcudf groupby (#9263) @davidwendt
Fixing empty input to getMapValue crashing (#9262) @hyperbolic2346
Fix duplicate names issue in MultiIndex.deserialize (#9258) @galipremsagar
Dataframe.sort_index optimizations (#9238) @galipremsagar
Temporarily disabling problematic test in parquet writer (#9230) @devavret
Explicitly disable groupby on unsupported key types. (#9227) @mythrocks
Fix gather for sliced input structs column (#9218) @ttnghia
Fix JNI code for left semi and anti joins (#9207) @jlowe
Only install thrust when using a non 'system' version (#9206) @robertmaynard
Remove zlib from libcudf public CMake dependencies (#9204) @robertmaynard
Fix out-of-bounds memory read in orc gpuEncodeOrcColumnData (#9196) @davidwendt
Fix gather() for STRUCT inputs with no nulls in members. (#9194) @mythrocks
get_cucollections properly uses rapids_cpm_find (#9189) @robertmaynard
rapids-export correctly reference build code block and doc strings (#9186) @robertmaynard
Fix logic while parsing the sum statistic for numerical orc columns (#9183) @ayushdg
Add handling for nulls in dask_cudf.sorting.quantile_divisions (#9171) @charlesbluca
Approximate overflow detection in ORC statistics (#9163) @vuule
Use decimal precision metadata when reading from parquet files (#9162) @shwina
Fix variable name in Java build script (#9161) @jlowe
Import rapids-cmake modules using the correct cmake variable. (#9149) @robertmaynard
Fix conditional joins with empty left table (#9146) @vyasr
Fix joining on indexes with duplicate level names (#9137) @shwina
Fixes missing child column name in dtype while reading ORC file. (#9134) @rgsl888prabhu
Apply type metadata after column is slice-copied (#9131) @isVoid
Fix a bug: inner_join_size return zero if build table is empty (#9128) @PointKernel
Fix multi hive-partition parquet reading in dask-cudf (#9122) @rjzamora
Support null literals in expressions (#9117) @vyasr
Fix cudf::hash_join output size for struct joins (#9107) @jlowe
Import fix (#9104) @shwina
Fix cudf::strings::is_fixed_point checking of overflow for decimal32 (#9093) @davidwendt
Fix branch_stack calculation in row_bit_count() (#9076) @mythrocks
Fetch rapids-cmake to work around cuCollection cmake issue (#9075) @jlowe
Fix compilation errors in groupby benchmarks. (#9072) @nvdbaranec
Preserve float16 upscaling (#9069) @galipremsagar
Fix memcheck read error in libcudf contiguous_split (#9067) @davidwendt
Add support for reading ORC file with no row group index (#9060) @rgsl888prabhu
Various multiindex related fixes (#9036) @shwina
Avoid rebuilding cython in build.sh (#9034) @brandon-b-miller
Add support for percentile dispatch in dask_cudf (#9031) @galipremsagar
cudf resolve nvcc 11.0 compiler crashes during codegen (#9028) @robertmaynard
Fetch correct grouping keys agg of dask groupby (#9022) @galipremsagar
Allow where() to work with a Series and other=cudf.NA (#9019) @sarahyurick
Use correct index when returning Series from GroupBy.apply() (#9016) @charlesbluca
Fix Dataframe indexer setitem when array is passed (#9006) @galipremsagar
Fix ORC reading of files with struct columns that have null values (#9005) @vuule
Ensure JNI native libraries load when CompiledExpression loads (#8997) @jlowe
Fix memory read error in get_dremel_data in page_enc.cu (#8995) @davidwendt
Fix memory write error in get_list_child_to_list_row_mapping utility (#8994) @davidwendt
Fix debug compile error for csv_test.cpp (#8981) @davidwendt
Fix memory read/write error in concatenate_lists_ignore_null (#8978) @davidwendt
Fix concatenation of cudf.RangeIndex (#8970) @galipremsagar
Java conditional joins should not require matching column counts (#8955) @jlowe
Fix concatenate empty structs (#8947) @sperlingxx
Fix cuda-memcheck errors for some libcudf functions (#8941) @davidwendt
Apply series name to result of SeriesGroupby.apply() (#8939) @charlesbluca
cdef packed_columns as cppclass instead of struct (#8936) @charlesbluca
Inserting a cudf.NA into a DataFrame (#8923) @sarahyurick
Support casting with Pandas dtype aliases (#8920) @sarahyurick
Allow sort_values to accept same kind values as Pandas (#8912) @sarahyurick
Enable casting to pandas nullable dtypes (#8889) @brandon-b-miller
Fix libcudf memory errors (#8884) @karthikeyann
Throw KeyError when accessing field from struct with nonexistent key (#8880) @NV-jpt
replace auto with auto& ref for cast<&> (#8866) @karthikeyann
Add missing include<optional> in binops (#8864) @karthikeyann
Fix select_dtypes to work when non-class dtypes present in dataframe (#8849) @sarahyurick
Re-enable JSON tests (#8843) @vuule
Support header with embedded delimiter in csv writer (#8798) @davidwendt

📖 Documentation

Add IO docs page in cudf documentation (#9145) @galipremsagar
use correct namespace in cuio code examples (#9037) @cwharris
Restructuring Contributing doc (#9026) @iskode
Update stable version in readme (#9008) @galipremsagar
Add spans and more include guidelines to libcudf developer guide (#8931) @harrism
Update Java build instructions to mention Arrow S3 and Docker (#8867) @jlowe
List GDS-enabled formats in the docs (#8805) @vuule
Change cudf docs theme to pydata theme (#8746) @galipremsagar

🚀 New Features

Revert "Add shallow hash function and shallow equality comparison for column_view (#9185)" (#9283) @karthikeyann
Align DataFrame.apply signature with pandas (#9275) @brandon-b-miller
Add struct type support for drop_list_duplicates (#9202) @ttnghia
support CUDA async memory resource in JNI (#9201) @rongou
Add shallow hash function and shallow equality comparison for column_view (#9185) @karthikeyann
Superimpose null masks for STRUCT columns. (#9144) @mythrocks
Implemented bindings for ceil timestamp operation (#9141) @shaneding
Adding MAP type support for ORC Reader (#9132) @rgsl888prabhu
Implement interleave_columns for lists with arbitrary nested type (#9130) @ttnghia
Add python bindings to fixed-size window and groupby rolling.var, rolling.std (#9097) @isVoid
Make AST operators nullable (#9096) @vyasr
Java bindings for approx_percentile (#9094) @andygrove
Add dseries.struct.explode (#9086) @isVoid
Add support for BaseIndexer in Rolling APIs (#9085) @galipremsagar
Remove the option to pass data types as strings to read_csv and read_json (#9079) @vuule
Add handling for nested dicts in dask-cudf groupby (#9054) @charlesbluca
Added Series.dt.is_quarter_start and Series.dt.is_quarter_end (#9046) @TravisHester
Support nested types for nth_element reduction (#9043) @sperlingxx
Update sort groupby to use non-atomic operation (#9035) @karthikeyann
Add support for struct type in ORC writer (#9025) @vuule
Implement interleave_columns for structs columns (#9012) @ttnghia
Add groupby first and last aggregations (#9004) @shwina
Add DecimalBaseColumn and move as_decimal_column (#9001) @isVoid
Python/Cython bindings for multibyte_split (#8998) @jdye64
Support scalar months in add_calendrical_months, extends API to INT32 support (#8991) @isVoid
Added Series.dt.is_month_end (#8989) @TravisHester
Support for using tdigests to compute approximate percentiles. (#8983) @nvdbaranec
Support "unflatten" of columns flattened via flatten_nested_columns(): (#8956) @mythrocks
Implement timestamp ceil (#8942) @shaneding
Add nested column selection to parquet reader (#8933) @devavret
Expose conditional join size calculation (#8928) @vyasr
Support Nulls in Timeseries Generator (#8925) @isVoid
Avoid index equality check in _CPackedColumns.from_py_table() (#8917) @charlesbluca
Add dot product binary op (#8909) @charlesbluca
Expose days_in_month function in libcudf and add python bindings (#8892) @isVoid
Series string repeat (#8882) @sarahyurick
Python binding for quarters (#8862) @shaneding
Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects (#8856) @vuule
Add Java bindings for AST transform (#8846) @jlowe
Series datetime is_month_start (#8844) @sarahyurick
Support bracket syntax for cudf::strings::replace_with_backrefs group index values (#8841) @davidwendt
Support VARIANCE and STD aggregation in rolling op (#8809) @isVoid
Add quarters to libcudf datetime (#8779) @shaneding
Linear Interpolation of nans via cupy (#8767) @brandon-b-miller
Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
Make groupby transform-like op order match original data order (#8720) @isVoid
multibyte_split (#8702) @cwharris
Implement JNI for strings:repeat_strings that repeats each string separately by different numbers of times (#8572) @ttnghia

🛠️ Improvements

Pin max dask and distributed versions to 2021.09.1 (#9286) @galipremsagar
Optimized fsspec data transfer for remote file-systems (#9265) @rjzamora
Skip dask-cudf tests on arm64 (#9252) @Ethyling
Use nvcomp's snappy compressor in ORC writer (#9242) @devavret
Only run imports tests on x86_64 (#9241) @Ethyling
Remove unnecessary call to device_uvector::release() (#9237) @harrism
Use nvcomp's snappy decompression in ORC reader (#9235) @devavret
Add grouped_rolling test with STRUCT groupby keys. (#9228) @mythrocks
Optimize cudf.concat for axis=0 (#9222) @galipremsagar
Fix some libcudf calls not passing the stream parameter (#9220) @davidwendt
Add min and max bounds for random dataframe generator numeric types (#9211) @galipremsagar
Improve performance of expression evaluation (#9210) @vyasr
Misc optimizations in cudf (#9203) @galipremsagar
Remove Cython APIs for table view generation (#9199) @vyasr
Add JNI support for drop_list_duplicates (#9198) @revans2
Update pandas versions in conda recipes and requirements.txt files (#9197) @galipremsagar
Minor C++17 cleanup of groupby.cu: structured bindings, more concise lambda, etc (#9193) @codereport
Explicit about bitwidth difference between cudf boolean and arrow boolean (#9192) @isVoid
Remove _source_index from MultiIndex (#9191) @vyasr
Fix typo in the name of cudf-testing-targets.cmake (#9190) @trxcllnt
Add support for single-digits in cudf::to_timestamps (#9173) @davidwendt
Fix cufilejni build include path (#9168) @pxLi
dask_cudf dispatch registering cleanup (#9160) @galipremsagar
Remove unneeded stream/mr from a cudf::make_strings_column (#9148) @davidwendt
Upgrade pandas version in cudf (#9147) @galipremsagar
make data chunk reader return unique_ptr (#9129) @cwharris
Add backend for percentile_lookup dispatch (#9118) @galipremsagar
Refactor implementation of column setitem (#9110) @vyasr
Fix compile warnings found using nvcc 11.4 (#9101) @davidwendt
Update to UCX-Py 0.22 (#9099) @pentschev
Simplify read_avro by removing unnecessary writer/impl classes (#9090) @cwharris
Allowing %f in format to return nanoseconds (#9081) @marlenezw
Java bindings for cudf::hash_join (#9080) @jlowe
Remove stale code in ColumnBase._fill (#9078) @isVoid
Add support for get_group in GroupBy (#9070) @galipremsagar
Remove remaining "support" methods from DataFrame (#9068) @vyasr
Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
Added method to remove null_masks if the column has no nulls (#9061) @razajafri
Consolidate Several Series and Dataframe Methods (#9059) @isVoid
Remove usage of string based set_dtypes for csv & json readers (#9049) @galipremsagar
Remove some debug print statements from gtests (#9048) @davidwendt
Support additional format specifiers in from_timestamps (#9047) @davidwendt
Expose expression base class publicly and simplify public AST API (#9045) @vyasr
move filepath and mmap logic out of json/csv up to functions.cpp (#9040) @cwharris
Refactor Index hierarchy (#9039) @vyasr
cudf now leverages rapids-cmake to reduce CMake boilerplate (#9030) @robertmaynard
Add support for STRUCT input to groupby (#9024) @mythrocks
Refactor Frame scans (#9021) @vyasr
Remove duplicate set_categories code (#9018) @isVoid
Map support for ParquetWriter (#9013) @razajafri
Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
Java bindings for conditional join output sizes (#9002) @jlowe
Remove _copy_construct factory (#8999) @vyasr
ENH Allow arbitrary CMake config options in build.sh (#8996) @dillon-cullinan
A small optimization for JNI copy column view to column vector (#8985) @revans2
Fix nvcc warnings in ORC writer (#8975) @devavret
Support nested structs in rank and dense rank (#8962) @rwlee
Move compute_column API out of ast namespace (#8957) @vyasr
Series datetime is_year_end and is_year_start (#8954) @marlenezw
Make Java AstNode public (#8953) @jlowe
Replace allocate with device_uvector for subword_tokenize internal tables (#8952) @davidwendt
cudf.dtype function (#8949) @shwina
Refactor Frame reductions (#8944) @vyasr
Add deprecation warning for Series.set_mask API (#8943) @galipremsagar
Move AST evaluator into a separate header (#8930) @vyasr
JNI Aggregation Type Changes (#8919) @revans2
Move template parameter to function parameter in cudf::detail::left_semi_anti_join (#8914) @davidwendt
Upgrade arrow & pyarrow to 5.0.0 (#8908) @galipremsagar
Add groupby_aggregation and groupby_scan_aggregation classes and force their usage. (#8906) @nvdbaranec
Move structs_column_tests.cu to .cpp. (#8902) @mythrocks
Add stream and memory-resource parameters to struct-scalar copy ctor (#8901) @davidwendt
Combine linearizer and ast_plan (#8900) @vyasr
Add Java bindings for conditional join gather maps (#8888) @jlowe
Remove max version pin for dask & distributed on development branch (#8881) @galipremsagar
fix cufilejni build w/ c++17 (#8877) @pxLi
Add struct accessor to dask-cudf (#8874) @NV-jpt
Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine (#8871) @rjzamora
Add JNI for extract_quarter, add_calendrical_months, and is_leap_year (#8863) @revans2
Change cudf::scalar copy and move constructors to protected (#8857) @davidwendt
Replace is_same<>::value with is_same_v<> (#8852) @codereport
Add min pytorch version to importorskip in pytest (#8851) @galipremsagar
Java bindings for regex replace (#8847) @jlowe
Remove make strings children with null mask (#8830) @davidwendt
Refactor conditional joins (#8815) @vyasr
Small cleanup (unused headers / commented code removals) (#8799) @codereport
ENH Replace gpuci_conda_retry with gpuci_mamba_retry (#8770) @dillon-cullinan
Update cudf java bindings to 21.10.0-SNAPSHOT (#8765) @pxLi
Refactor and improve join benchmarks with nvbench (#8734) @PointKernel
Refactor Python factories and remove usage of Table for libcudf output handling (#8687) @vyasr
Optimize URL Decoding (#8622) @gaohao95
Parquet writer dictionary encoding refactor (#8476) @devavret
Use nvcomp's snappy decompression in parquet reader (#8252) @devavret
Use nvcomp's snappy compressor in parquet writer (#8229) @devavret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v21.10.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors