v21.10.00
🚨 Breaking Changes
- Remove Cython APIs for table view generation (#9199) @vyasr
- Upgrade
pandas
version incudf
(#9147) @galipremsagar - Make AST operators nullable (#9096) @vyasr
- Remove the option to pass data types as strings to
read_csv
andread_json
(#9079) @vuule - Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
- Support additional format specifiers in from_timestamps (#9047) @davidwendt
- Expose expression base class publicly and simplify public AST API (#9045) @vyasr
- Add support for struct type in ORC writer (#9025) @vuule
- Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
- Java bindings for conditional join output sizes (#9002) @jlowe
- Move compute_column API out of ast namespace (#8957) @vyasr
cudf.dtype
function (#8949) @shwina- Refactor Frame reductions (#8944) @vyasr
- Add nested column selection to parquet reader (#8933) @devavret
- JNI Aggregation Type Changes (#8919) @revans2
- Add groupby_aggregation and groupby_scan_aggregation classes and force their usage. (#8906) @nvdbaranec
- Expand CSV and JSON reader APIs to accept
dtypes
as a vector or map ofdata_type
objects (#8856) @vuule - Change cudf docs theme to pydata theme (#8746) @galipremsagar
- Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
- Make groupby transform-like op order match original data order (#8720) @isVoid
🐛 Bug Fixes
fixed_point
cudf::groupby
formean
aggregation (#9296) @codereport- Fix
interleave_columns
when the input string lists column having empty child column (#9292) @ttnghia - Update nvcomp to include fixes for installation of headers (#9276) @devavret
- Fix Java column leak in testParquetWriteMap (#9271) @jlowe
- Fix call to thrust::reduce_by_key in argmin/argmax libcudf groupby (#9263) @davidwendt
- Fixing empty input to getMapValue crashing (#9262) @hyperbolic2346
- Fix duplicate names issue in
MultiIndex.deserialize
(#9258) @galipremsagar Dataframe.sort_index
optimizations (#9238) @galipremsagar- Temporarily disabling problematic test in parquet writer (#9230) @devavret
- Explicitly disable groupby on unsupported key types. (#9227) @mythrocks
- Fix
gather
for sliced input structs column (#9218) @ttnghia - Fix JNI code for left semi and anti joins (#9207) @jlowe
- Only install thrust when using a non 'system' version (#9206) @robertmaynard
- Remove zlib from libcudf public CMake dependencies (#9204) @robertmaynard
- Fix out-of-bounds memory read in orc gpuEncodeOrcColumnData (#9196) @davidwendt
- Fix
gather()
forSTRUCT
inputs with no nulls in members. (#9194) @mythrocks - get_cucollections properly uses rapids_cpm_find (#9189) @robertmaynard
- rapids-export correctly reference build code block and doc strings (#9186) @robertmaynard
- Fix logic while parsing the sum statistic for numerical orc columns (#9183) @ayushdg
- Add handling for nulls in
dask_cudf.sorting.quantile_divisions
(#9171) @charlesbluca - Approximate overflow detection in ORC statistics (#9163) @vuule
- Use decimal precision metadata when reading from parquet files (#9162) @shwina
- Fix variable name in Java build script (#9161) @jlowe
- Import rapids-cmake modules using the correct cmake variable. (#9149) @robertmaynard
- Fix conditional joins with empty left table (#9146) @vyasr
- Fix joining on indexes with duplicate level names (#9137) @shwina
- Fixes missing child column name in dtype while reading ORC file. (#9134) @rgsl888prabhu
- Apply type metadata after column is slice-copied (#9131) @isVoid
- Fix a bug: inner_join_size return zero if build table is empty (#9128) @PointKernel
- Fix multi hive-partition parquet reading in dask-cudf (#9122) @rjzamora
- Support null literals in expressions (#9117) @vyasr
- Fix cudf::hash_join output size for struct joins (#9107) @jlowe
- Import fix (#9104) @shwina
- Fix cudf::strings::is_fixed_point checking of overflow for decimal32 (#9093) @davidwendt
- Fix branch_stack calculation in
row_bit_count()
(#9076) @mythrocks - Fetch rapids-cmake to work around cuCollection cmake issue (#9075) @jlowe
- Fix compilation errors in groupby benchmarks. (#9072) @nvdbaranec
- Preserve float16 upscaling (#9069) @galipremsagar
- Fix memcheck read error in libcudf contiguous_split (#9067) @davidwendt
- Add support for reading ORC file with no row group index (#9060) @rgsl888prabhu
- Various multiindex related fixes (#9036) @shwina
- Avoid rebuilding cython in build.sh (#9034) @brandon-b-miller
- Add support for percentile dispatch in
dask_cudf
(#9031) @galipremsagar - cudf resolve nvcc 11.0 compiler crashes during codegen (#9028) @robertmaynard
- Fetch correct grouping keys
agg
of dask groupby (#9022) @galipremsagar - Allow
where()
to work with a Series andother=cudf.NA
(#9019) @sarahyurick - Use correct index when returning Series from
GroupBy.apply()
(#9016) @charlesbluca - Fix
Dataframe
indexer setitem when array is passed (#9006) @galipremsagar - Fix ORC reading of files with struct columns that have null values (#9005) @vuule
- Ensure JNI native libraries load when CompiledExpression loads (#8997) @jlowe
- Fix memory read error in get_dremel_data in page_enc.cu (#8995) @davidwendt
- Fix memory write error in get_list_child_to_list_row_mapping utility (#8994) @davidwendt
- Fix debug compile error for csv_test.cpp (#8981) @davidwendt
- Fix memory read/write error in concatenate_lists_ignore_null (#8978) @davidwendt
- Fix concatenation of
cudf.RangeIndex
(#8970) @galipremsagar - Java conditional joins should not require matching column counts (#8955) @jlowe
- Fix concatenate empty structs (#8947) @sperlingxx
- Fix cuda-memcheck errors for some libcudf functions (#8941) @davidwendt
- Apply series name to result of
SeriesGroupby.apply()
(#8939) @charlesbluca cdef packed_columns
ascppclass
instead ofstruct
(#8936) @charlesbluca- Inserting a
cudf.NA
into a DataFrame (#8923) @sarahyurick - Support casting with Pandas dtype aliases (#8920) @sarahyurick
- Allow
sort_values
to accept samekind
values as Pandas (#8912) @sarahyurick - Enable casting to pandas nullable dtypes (#8889) @brandon-b-miller
- Fix libcudf memory errors (#8884) @karthikeyann
- Throw KeyError when accessing field from struct with nonexistent key (#8880) @NV-jpt
- replace auto with auto& ref for cast<&> (#8866) @karthikeyann
- Add missing include<optional> in binops (#8864) @karthikeyann
- Fix
select_dtypes
to work when non-class dtypes present in dataframe (#8849) @sarahyurick - Re-enable JSON tests (#8843) @vuule
- Support header with embedded delimiter in csv writer (#8798) @davidwendt
📖 Documentation
- Add IO docs page in
cudf
documentation (#9145) @galipremsagar - use correct namespace in cuio code examples (#9037) @cwharris
- Restructuring
Contributing doc
(#9026) @iskode - Update stable version in readme (#9008) @galipremsagar
- Add spans and more include guidelines to libcudf developer guide (#8931) @harrism
- Update Java build instructions to mention Arrow S3 and Docker (#8867) @jlowe
- List GDS-enabled formats in the docs (#8805) @vuule
- Change cudf docs theme to pydata theme (#8746) @galipremsagar
🚀 New Features
- Revert "Add shallow hash function and shallow equality comparison for column_view (#9185)" (#9283) @karthikeyann
- Align
DataFrame.apply
signature with pandas (#9275) @brandon-b-miller - Add struct type support for
drop_list_duplicates
(#9202) @ttnghia - support CUDA async memory resource in JNI (#9201) @rongou
- Add shallow hash function and shallow equality comparison for column_view (#9185) @karthikeyann
- Superimpose null masks for STRUCT columns. (#9144) @mythrocks
- Implemented bindings for
ceil
timestamp operation (#9141) @shaneding - Adding MAP type support for ORC Reader (#9132) @rgsl888prabhu
- Implement
interleave_columns
for lists with arbitrary nested type (#9130) @ttnghia - Add python bindings to fixed-size window and groupby
rolling.var
,rolling.std
(#9097) @isVoid - Make AST operators nullable (#9096) @vyasr
- Java bindings for approx_percentile (#9094) @andygrove
- Add
dseries.struct.explode
(#9086) @isVoid - Add support for BaseIndexer in Rolling APIs (#9085) @galipremsagar
- Remove the option to pass data types as strings to
read_csv
andread_json
(#9079) @vuule - Add handling for nested dicts in dask-cudf groupby (#9054) @charlesbluca
- Added Series.dt.is_quarter_start and Series.dt.is_quarter_end (#9046) @TravisHester
- Support nested types for nth_element reduction (#9043) @sperlingxx
- Update sort groupby to use non-atomic operation (#9035) @karthikeyann
- Add support for struct type in ORC writer (#9025) @vuule
- Implement
interleave_columns
for structs columns (#9012) @ttnghia - Add groupby first and last aggregations (#9004) @shwina
- Add
DecimalBaseColumn
and moveas_decimal_column
(#9001) @isVoid - Python/Cython bindings for multibyte_split (#8998) @jdye64
- Support scalar
months
inadd_calendrical_months
, extends API to INT32 support (#8991) @isVoid - Added Series.dt.is_month_end (#8989) @TravisHester
- Support for using tdigests to compute approximate percentiles. (#8983) @nvdbaranec
- Support "unflatten" of columns flattened via
flatten_nested_columns()
: (#8956) @mythrocks - Implement timestamp ceil (#8942) @shaneding
- Add nested column selection to parquet reader (#8933) @devavret
- Expose conditional join size calculation (#8928) @vyasr
- Support Nulls in Timeseries Generator (#8925) @isVoid
- Avoid index equality check in
_CPackedColumns.from_py_table()
(#8917) @charlesbluca - Add dot product binary op (#8909) @charlesbluca
- Expose
days_in_month
function in libcudf and add python bindings (#8892) @isVoid - Series string repeat (#8882) @sarahyurick
- Python binding for quarters (#8862) @shaneding
- Expand CSV and JSON reader APIs to accept
dtypes
as a vector or map ofdata_type
objects (#8856) @vuule - Add Java bindings for AST transform (#8846) @jlowe
- Series datetime is_month_start (#8844) @sarahyurick
- Support bracket syntax for cudf::strings::replace_with_backrefs group index values (#8841) @davidwendt
- Support
VARIANCE
andSTD
aggregation in rolling op (#8809) @isVoid - Add quarters to libcudf datetime (#8779) @shaneding
- Linear Interpolation of
nan
s viacupy
(#8767) @brandon-b-miller - Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
- Make groupby transform-like op order match original data order (#8720) @isVoid
- multibyte_split (#8702) @cwharris
- Implement JNI for
strings:repeat_strings
that repeats each string separately by different numbers of times (#8572) @ttnghia
🛠️ Improvements
- Pin max
dask
anddistributed
versions to2021.09.1
(#9286) @galipremsagar - Optimized fsspec data transfer for remote file-systems (#9265) @rjzamora
- Skip dask-cudf tests on arm64 (#9252) @Ethyling
- Use nvcomp's snappy compressor in ORC writer (#9242) @devavret
- Only run imports tests on x86_64 (#9241) @Ethyling
- Remove unnecessary call to device_uvector::release() (#9237) @harrism
- Use nvcomp's snappy decompression in ORC reader (#9235) @devavret
- Add grouped_rolling test with STRUCT groupby keys. (#9228) @mythrocks
- Optimize
cudf.concat
foraxis=0
(#9222) @galipremsagar - Fix some libcudf calls not passing the stream parameter (#9220) @davidwendt
- Add min and max bounds for random dataframe generator numeric types (#9211) @galipremsagar
- Improve performance of expression evaluation (#9210) @vyasr
- Misc optimizations in
cudf
(#9203) @galipremsagar - Remove Cython APIs for table view generation (#9199) @vyasr
- Add JNI support for drop_list_duplicates (#9198) @revans2
- Update pandas versions in conda recipes and requirements.txt files (#9197) @galipremsagar
- Minor C++17 cleanup of
groupby.cu
: structured bindings, more concise lambda, etc (#9193) @codereport - Explicit about bitwidth difference between cudf boolean and arrow boolean (#9192) @isVoid
- Remove _source_index from MultiIndex (#9191) @vyasr
- Fix typo in the name of
cudf-testing-targets.cmake
(#9190) @trxcllnt - Add support for single-digits in cudf::to_timestamps (#9173) @davidwendt
- Fix cufilejni build include path (#9168) @pxLi
dask_cudf
dispatch registering cleanup (#9160) @galipremsagar- Remove unneeded stream/mr from a cudf::make_strings_column (#9148) @davidwendt
- Upgrade
pandas
version incudf
(#9147) @galipremsagar - make data chunk reader return unique_ptr (#9129) @cwharris
- Add backend for
percentile_lookup
dispatch (#9118) @galipremsagar - Refactor implementation of column setitem (#9110) @vyasr
- Fix compile warnings found using nvcc 11.4 (#9101) @davidwendt
- Update to UCX-Py 0.22 (#9099) @pentschev
- Simplify read_avro by removing unnecessary writer/impl classes (#9090) @cwharris
- Allowing %f in format to return nanoseconds (#9081) @marlenezw
- Java bindings for cudf::hash_join (#9080) @jlowe
- Remove stale code in
ColumnBase._fill
(#9078) @isVoid - Add support for
get_group
in GroupBy (#9070) @galipremsagar - Remove remaining "support" methods from DataFrame (#9068) @vyasr
- Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
- Added method to remove null_masks if the column has no nulls (#9061) @razajafri
- Consolidate Several Series and Dataframe Methods (#9059) @isVoid
- Remove usage of string based
set_dtypes
forcsv
&json
readers (#9049) @galipremsagar - Remove some debug print statements from gtests (#9048) @davidwendt
- Support additional format specifiers in from_timestamps (#9047) @davidwendt
- Expose expression base class publicly and simplify public AST API (#9045) @vyasr
- move filepath and mmap logic out of json/csv up to functions.cpp (#9040) @cwharris
- Refactor Index hierarchy (#9039) @vyasr
- cudf now leverages rapids-cmake to reduce CMake boilerplate (#9030) @robertmaynard
- Add support for
STRUCT
input togroupby
(#9024) @mythrocks - Refactor Frame scans (#9021) @vyasr
- Remove duplicate
set_categories
code (#9018) @isVoid - Map support for ParquetWriter (#9013) @razajafri
- Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
- Java bindings for conditional join output sizes (#9002) @jlowe
- Remove _copy_construct factory (#8999) @vyasr
- ENH Allow arbitrary CMake config options in build.sh (#8996) @dillon-cullinan
- A small optimization for JNI copy column view to column vector (#8985) @revans2
- Fix nvcc warnings in ORC writer (#8975) @devavret
- Support nested structs in rank and dense rank (#8962) @rwlee
- Move compute_column API out of ast namespace (#8957) @vyasr
- Series datetime is_year_end and is_year_start (#8954) @marlenezw
- Make Java AstNode public (#8953) @jlowe
- Replace allocate with device_uvector for subword_tokenize internal tables (#8952) @davidwendt
cudf.dtype
function (#8949) @shwina- Refactor Frame reductions (#8944) @vyasr
- Add deprecation warning for
Series.set_mask
API (#8943) @galipremsagar - Move AST evaluator into a separate header (#8930) @vyasr
- JNI Aggregation Type Changes (#8919) @revans2
- Move template parameter to function parameter in cudf::detail::left_semi_anti_join (#8914) @davidwendt
- Upgrade
arrow
&pyarrow
to5.0.0
(#8908) @galipremsagar - Add groupby_aggregation and groupby_scan_aggregation classes and force their usage. (#8906) @nvdbaranec
- Move
structs_column_tests.cu
to.cpp
. (#8902) @mythrocks - Add stream and memory-resource parameters to struct-scalar copy ctor (#8901) @davidwendt
- Combine linearizer and ast_plan (#8900) @vyasr
- Add Java bindings for conditional join gather maps (#8888) @jlowe
- Remove max version pin for
dask
&distributed
on development branch (#8881) @galipremsagar - fix cufilejni build w/ c++17 (#8877) @pxLi
- Add struct accessor to dask-cudf (#8874) @NV-jpt
- Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine (#8871) @rjzamora
- Add JNI for extract_quarter, add_calendrical_months, and is_leap_year (#8863) @revans2
- Change cudf::scalar copy and move constructors to protected (#8857) @davidwendt
- Replace
is_same<>::value
withis_same_v<>
(#8852) @codereport - Add min
pytorch
version toimportorskip
in pytest (#8851) @galipremsagar - Java bindings for regex replace (#8847) @jlowe
- Remove make strings children with null mask (#8830) @davidwendt
- Refactor conditional joins (#8815) @vyasr
- Small cleanup (unused headers / commented code removals) (#8799) @codereport
- ENH Replace gpuci_conda_retry with gpuci_mamba_retry (#8770) @dillon-cullinan
- Update cudf java bindings to 21.10.0-SNAPSHOT (#8765) @pxLi
- Refactor and improve join benchmarks with nvbench (#8734) @PointKernel
- Refactor Python factories and remove usage of Table for libcudf output handling (#8687) @vyasr
- Optimize URL Decoding (#8622) @gaohao95
- Parquet writer dictionary encoding refactor (#8476) @devavret
- Use nvcomp's snappy decompression in parquet reader (#8252) @devavret
- Use nvcomp's snappy compressor in parquet writer (#8229) @devavret