diff --git a/CHANGELOG.md b/CHANGELOG.md index 3ccc1ccbc8b..68ff9abc9ea 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,9 +2,232 @@ Please see https://github.com/rapidsai/cudf/releases/tag/v22.02.00a for the latest changes to this development branch. -# cuDF 21.12.00 (Date TBD) +# cuDF 21.12.00 (9 Dec 2021) -Please see https://github.com/rapidsai/cudf/releases/tag/v21.12.00a for the latest changes to this development branch. +## 🚨 Breaking Changes + +- Update `bitmask_and` and `bitmask_or` to return a pair of resulting mask and count of unset bits ([#9616](https://github.com/rapidsai/cudf/pull/9616)) [@PointKernel](https://github.com/PointKernel) +- Remove sizeof and standardize on memory_usage ([#9544](https://github.com/rapidsai/cudf/pull/9544)) [@vyasr](https://github.com/vyasr) +- Add support for single-line regex anchors ^/$ in contains_re ([#9482](https://github.com/rapidsai/cudf/pull/9482)) [@davidwendt](https://github.com/davidwendt) +- Refactor sorting APIs ([#9464](https://github.com/rapidsai/cudf/pull/9464)) [@vyasr](https://github.com/vyasr) +- Update Java nvcomp JNI bindings to nvcomp 2.x API ([#9384](https://github.com/rapidsai/cudf/pull/9384)) [@jbrennan333](https://github.com/jbrennan333) +- Support Python UDFs written in terms of rows ([#9343](https://github.com/rapidsai/cudf/pull/9343)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- JNI: Support nested types in ORC writer ([#9334](https://github.com/rapidsai/cudf/pull/9334)) [@firestarman](https://github.com/firestarman) +- Optionally nullify out-of-bounds indices in segmented_gather(). ([#9318](https://github.com/rapidsai/cudf/pull/9318)) [@mythrocks](https://github.com/mythrocks) +- Refactor cuIO timestamp processing with `cuda::std::chrono` ([#9278](https://github.com/rapidsai/cudf/pull/9278)) [@PointKernel](https://github.com/PointKernel) +- Various internal MultiIndex improvements ([#9243](https://github.com/rapidsai/cudf/pull/9243)) [@vyasr](https://github.com/vyasr) + +## πŸ› Bug Fixes + +- Fix read_parquet bug for bytes input ([#9669](https://github.com/rapidsai/cudf/pull/9669)) [@rjzamora](https://github.com/rjzamora) +- Use `_gather` internal for `sort_*` ([#9668](https://github.com/rapidsai/cudf/pull/9668)) [@isVoid](https://github.com/isVoid) +- Fix behavior of equals for non-DataFrame Frames and add tests. ([#9653](https://github.com/rapidsai/cudf/pull/9653)) [@vyasr](https://github.com/vyasr) +- Dont recompute output size if it is already available ([#9649](https://github.com/rapidsai/cudf/pull/9649)) [@abellina](https://github.com/abellina) +- Fix read_parquet bug for extended dtypes from remote storage ([#9638](https://github.com/rapidsai/cudf/pull/9638)) [@rjzamora](https://github.com/rjzamora) +- add const when getting data from a JNI data wrapper ([#9637](https://github.com/rapidsai/cudf/pull/9637)) [@wjxiz1992](https://github.com/wjxiz1992) +- Fix debrotli issue on CUDA 11.5 ([#9632](https://github.com/rapidsai/cudf/pull/9632)) [@vuule](https://github.com/vuule) +- Use std::size_t when computing join output size ([#9626](https://github.com/rapidsai/cudf/pull/9626)) [@jlowe](https://github.com/jlowe) +- Fix `usecols` parameter handling in `dask_cudf.read_csv` ([#9618](https://github.com/rapidsai/cudf/pull/9618)) [@galipremsagar](https://github.com/galipremsagar) +- Add support for string `'nan', 'inf' & '-inf'` values while type-casting to `float` ([#9613](https://github.com/rapidsai/cudf/pull/9613)) [@galipremsagar](https://github.com/galipremsagar) +- Avoid passing NativeFileDatasource to pyarrow in read_parquet ([#9608](https://github.com/rapidsai/cudf/pull/9608)) [@rjzamora](https://github.com/rjzamora) +- Fix test failure with cuda 11.5 in row_bit_count tests. ([#9581](https://github.com/rapidsai/cudf/pull/9581)) [@nvdbaranec](https://github.com/nvdbaranec) +- Correct _LIBCUDACXX_CUDACC_VER value computation ([#9579](https://github.com/rapidsai/cudf/pull/9579)) [@robertmaynard](https://github.com/robertmaynard) +- Increase max RLE stream size estimate to avoid potential overflows ([#9568](https://github.com/rapidsai/cudf/pull/9568)) [@vuule](https://github.com/vuule) +- Fix edge case in tdigest scalar generation for groups containing all nulls. ([#9551](https://github.com/rapidsai/cudf/pull/9551)) [@nvdbaranec](https://github.com/nvdbaranec) +- Fix pytests failing in `cuda-11.5` environment ([#9547](https://github.com/rapidsai/cudf/pull/9547)) [@galipremsagar](https://github.com/galipremsagar) +- compile libnvcomp with PTDS if requested ([#9540](https://github.com/rapidsai/cudf/pull/9540)) [@jbrennan333](https://github.com/jbrennan333) +- Fix `segmented_gather()` for null LIST rows ([#9537](https://github.com/rapidsai/cudf/pull/9537)) [@mythrocks](https://github.com/mythrocks) +- Deprecate DataFrame.label_encoding, use private _label_encoding method internally. ([#9535](https://github.com/rapidsai/cudf/pull/9535)) [@bdice](https://github.com/bdice) +- Fix several test and benchmark issues related to bitmask allocations. ([#9521](https://github.com/rapidsai/cudf/pull/9521)) [@nvdbaranec](https://github.com/nvdbaranec) +- Fix for inserting duplicates in groupby result cache ([#9508](https://github.com/rapidsai/cudf/pull/9508)) [@karthikeyann](https://github.com/karthikeyann) +- Fix mismatched types error in clip() when using non int64 numeric types ([#9498](https://github.com/rapidsai/cudf/pull/9498)) [@davidwendt](https://github.com/davidwendt) +- Match conda pinnings for style checks (revert part of #9412, #9433). ([#9490](https://github.com/rapidsai/cudf/pull/9490)) [@bdice](https://github.com/bdice) +- Make sure all dask-cudf supported aggs are handled in `_tree_node_agg` ([#9487](https://github.com/rapidsai/cudf/pull/9487)) [@charlesbluca](https://github.com/charlesbluca) +- Resolve `hash_columns` `FutureWarning` in `dask_cudf` ([#9481](https://github.com/rapidsai/cudf/pull/9481)) [@pentschev](https://github.com/pentschev) +- Add fixed point to AllTypes in libcudf unit tests ([#9472](https://github.com/rapidsai/cudf/pull/9472)) [@karthikeyann](https://github.com/karthikeyann) +- Fix regex handling of embedded null characters ([#9470](https://github.com/rapidsai/cudf/pull/9470)) [@davidwendt](https://github.com/davidwendt) +- Fix memcheck error in copy-if-else ([#9467](https://github.com/rapidsai/cudf/pull/9467)) [@davidwendt](https://github.com/davidwendt) +- Fix bug in dask_cudf.read_parquet for index=False ([#9453](https://github.com/rapidsai/cudf/pull/9453)) [@rjzamora](https://github.com/rjzamora) +- Preserve the decimal scale when creating a default scalar ([#9449](https://github.com/rapidsai/cudf/pull/9449)) [@revans2](https://github.com/revans2) +- Push down parent nulls when flattening nested columns. ([#9443](https://github.com/rapidsai/cudf/pull/9443)) [@mythrocks](https://github.com/mythrocks) +- Fix memcheck error in gtest SegmentedGatherTest/GatherSliced ([#9442](https://github.com/rapidsai/cudf/pull/9442)) [@davidwendt](https://github.com/davidwendt) +- Revert "Fix quantile division / partition handling for dask-cudf sort… ([#9438](https://github.com/rapidsai/cudf/pull/9438)) [@charlesbluca](https://github.com/charlesbluca) +- Allow int-like objects for the `decimals` argument in `round` ([#9428](https://github.com/rapidsai/cudf/pull/9428)) [@shwina](https://github.com/shwina) +- Fix stream compaction's `drop_duplicates` API to use stable sort ([#9417](https://github.com/rapidsai/cudf/pull/9417)) [@ttnghia](https://github.com/ttnghia) +- Skip Comparing Uniform Window Results in Var/std Tests ([#9416](https://github.com/rapidsai/cudf/pull/9416)) [@isVoid](https://github.com/isVoid) +- Fix `StructColumn.to_pandas` type handling issues ([#9388](https://github.com/rapidsai/cudf/pull/9388)) [@galipremsagar](https://github.com/galipremsagar) +- Correct issues in the build dir cudf-config.cmake ([#9386](https://github.com/rapidsai/cudf/pull/9386)) [@robertmaynard](https://github.com/robertmaynard) +- Fix Java table partition test to account for non-deterministic ordering ([#9385](https://github.com/rapidsai/cudf/pull/9385)) [@jlowe](https://github.com/jlowe) +- Fix timestamp truncation/overflow bugs in orc/parquet ([#9382](https://github.com/rapidsai/cudf/pull/9382)) [@PointKernel](https://github.com/PointKernel) +- Fix the crash in stats code ([#9368](https://github.com/rapidsai/cudf/pull/9368)) [@devavret](https://github.com/devavret) +- Make Series.hash_encode results reproducible. ([#9366](https://github.com/rapidsai/cudf/pull/9366)) [@bdice](https://github.com/bdice) +- Fix libcudf compile warnings on debug 11.4 build ([#9360](https://github.com/rapidsai/cudf/pull/9360)) [@davidwendt](https://github.com/davidwendt) +- Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes ([#9359](https://github.com/rapidsai/cudf/pull/9359)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- Set pass_filenames: false in mypy pre-commit configuration. ([#9349](https://github.com/rapidsai/cudf/pull/9349)) [@bdice](https://github.com/bdice) +- Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData ([#9348](https://github.com/rapidsai/cudf/pull/9348)) [@davidwendt](https://github.com/davidwendt) +- Fix memcheck error in groupby-tdigest get_scalar_minmax ([#9339](https://github.com/rapidsai/cudf/pull/9339)) [@davidwendt](https://github.com/davidwendt) +- Optimizations for `cudf.concat` when `axis=1` ([#9333](https://github.com/rapidsai/cudf/pull/9333)) [@galipremsagar](https://github.com/galipremsagar) +- Use f-string in join helper warning message. ([#9325](https://github.com/rapidsai/cudf/pull/9325)) [@bdice](https://github.com/bdice) +- Avoid casting to list or struct dtypes in dask_cudf.read_parquet ([#9314](https://github.com/rapidsai/cudf/pull/9314)) [@rjzamora](https://github.com/rjzamora) +- Fix null count in statistics for parquet ([#9303](https://github.com/rapidsai/cudf/pull/9303)) [@devavret](https://github.com/devavret) +- Potential overflow of `decimal32` when casting to `int64_t` ([#9287](https://github.com/rapidsai/cudf/pull/9287)) [@codereport](https://github.com/codereport) +- Fix quantile division / partition handling for dask-cudf sort on null dataframes ([#9259](https://github.com/rapidsai/cudf/pull/9259)) [@charlesbluca](https://github.com/charlesbluca) +- Updating cudf version also updates rapids cmake branch ([#9249](https://github.com/rapidsai/cudf/pull/9249)) [@robertmaynard](https://github.com/robertmaynard) +- Implement `one_hot_encoding` in libcudf and bind to python ([#9229](https://github.com/rapidsai/cudf/pull/9229)) [@isVoid](https://github.com/isVoid) +- BUG FIX: CSV Writer ignores the header parameter when no metadata is provided ([#8740](https://github.com/rapidsai/cudf/pull/8740)) [@skirui-source](https://github.com/skirui-source) + +## πŸ“– Documentation + +- Update Documentation to use `TYPED_TEST_SUITE` ([#9654](https://github.com/rapidsai/cudf/pull/9654)) [@codereport](https://github.com/codereport) +- Add dedicated page for `StringHandling` in python docs ([#9624](https://github.com/rapidsai/cudf/pull/9624)) [@galipremsagar](https://github.com/galipremsagar) +- Update docstring of `DataFrame.merge` ([#9572](https://github.com/rapidsai/cudf/pull/9572)) [@galipremsagar](https://github.com/galipremsagar) +- Use raw strings to avoid SyntaxErrors in parsed docstrings. ([#9526](https://github.com/rapidsai/cudf/pull/9526)) [@bdice](https://github.com/bdice) +- Add example to docstrings in `rolling.apply` ([#9522](https://github.com/rapidsai/cudf/pull/9522)) [@isVoid](https://github.com/isVoid) +- Update help message to escape quotes in ./build.sh --cmake-args. ([#9494](https://github.com/rapidsai/cudf/pull/9494)) [@bdice](https://github.com/bdice) +- Improve Python docstring formatting. ([#9493](https://github.com/rapidsai/cudf/pull/9493)) [@bdice](https://github.com/bdice) +- Update table of I/O supported types ([#9476](https://github.com/rapidsai/cudf/pull/9476)) [@vuule](https://github.com/vuule) +- Document invalid regex patterns as undefined behavior ([#9473](https://github.com/rapidsai/cudf/pull/9473)) [@davidwendt](https://github.com/davidwendt) +- Miscellaneous documentation fixes to `cudf` ([#9471](https://github.com/rapidsai/cudf/pull/9471)) [@galipremsagar](https://github.com/galipremsagar) +- Fix many documentation errors in libcudf. ([#9355](https://github.com/rapidsai/cudf/pull/9355)) [@karthikeyann](https://github.com/karthikeyann) +- Fixing SubwordTokenizer docs issue ([#9354](https://github.com/rapidsai/cudf/pull/9354)) [@mayankanand007](https://github.com/mayankanand007) +- Improved deprecation warnings. ([#9347](https://github.com/rapidsai/cudf/pull/9347)) [@bdice](https://github.com/bdice) +- doc reorder mr, stream to stream, mr ([#9308](https://github.com/rapidsai/cudf/pull/9308)) [@karthikeyann](https://github.com/karthikeyann) +- Deprecate method parameters to DataFrame.join, DataFrame.merge. ([#9291](https://github.com/rapidsai/cudf/pull/9291)) [@bdice](https://github.com/bdice) +- Added deprecation warning for `.label_encoding()` ([#9289](https://github.com/rapidsai/cudf/pull/9289)) [@mayankanand007](https://github.com/mayankanand007) + +## πŸš€ New Features + +- Enable Series.divide and DataFrame.divide ([#9630](https://github.com/rapidsai/cudf/pull/9630)) [@vyasr](https://github.com/vyasr) +- Update `bitmask_and` and `bitmask_or` to return a pair of resulting mask and count of unset bits ([#9616](https://github.com/rapidsai/cudf/pull/9616)) [@PointKernel](https://github.com/PointKernel) +- Add handling of mixed numeric types in `to_dlpack` ([#9585](https://github.com/rapidsai/cudf/pull/9585)) [@galipremsagar](https://github.com/galipremsagar) +- Support re.Pattern object for pat arg in str.replace ([#9573](https://github.com/rapidsai/cudf/pull/9573)) [@davidwendt](https://github.com/davidwendt) +- Add JNI for `lists::drop_list_duplicates` with keys-values input column ([#9553](https://github.com/rapidsai/cudf/pull/9553)) [@ttnghia](https://github.com/ttnghia) +- Support structs column in `min`, `max`, `argmin` and `argmax` groupby aggregate() and scan() ([#9545](https://github.com/rapidsai/cudf/pull/9545)) [@ttnghia](https://github.com/ttnghia) +- Move libcudacxx to use `rapids_cpm` and use newer versions ([#9539](https://github.com/rapidsai/cudf/pull/9539)) [@robertmaynard](https://github.com/robertmaynard) +- Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) ([#9518](https://github.com/rapidsai/cudf/pull/9518)) [@davidwendt](https://github.com/davidwendt) +- Support `args=` in `apply` ([#9514](https://github.com/rapidsai/cudf/pull/9514)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- Add groupby scan min/max support for strings values ([#9502](https://github.com/rapidsai/cudf/pull/9502)) [@davidwendt](https://github.com/davidwendt) +- Add list output option to character_ngrams() function ([#9499](https://github.com/rapidsai/cudf/pull/9499)) [@davidwendt](https://github.com/davidwendt) +- More granular column selection in ORC reader ([#9496](https://github.com/rapidsai/cudf/pull/9496)) [@vuule](https://github.com/vuule) +- add min_periods, ddof to groupby covariance, & correlation aggregation ([#9492](https://github.com/rapidsai/cudf/pull/9492)) [@karthikeyann](https://github.com/karthikeyann) +- Implement Series.datetime.floor ([#9488](https://github.com/rapidsai/cudf/pull/9488)) [@skirui-source](https://github.com/skirui-source) +- Enable linting of CMake files using pre-commit ([#9484](https://github.com/rapidsai/cudf/pull/9484)) [@vyasr](https://github.com/vyasr) +- Add support for single-line regex anchors ^/$ in contains_re ([#9482](https://github.com/rapidsai/cudf/pull/9482)) [@davidwendt](https://github.com/davidwendt) +- Augment `order_by` to Accept a List of `null_precedence` ([#9455](https://github.com/rapidsai/cudf/pull/9455)) [@isVoid](https://github.com/isVoid) +- Add format API for list column of strings ([#9454](https://github.com/rapidsai/cudf/pull/9454)) [@davidwendt](https://github.com/davidwendt) +- Enable Datetime/Timedelta dtypes in Masked UDFs ([#9451](https://github.com/rapidsai/cudf/pull/9451)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- Add cudf python groupby.diff ([#9446](https://github.com/rapidsai/cudf/pull/9446)) [@karthikeyann](https://github.com/karthikeyann) +- Implement `lists::stable_sort_lists` for stable sorting of elements within each row of lists column ([#9425](https://github.com/rapidsai/cudf/pull/9425)) [@ttnghia](https://github.com/ttnghia) +- add ctest memcheck using cuda-sanitizer ([#9414](https://github.com/rapidsai/cudf/pull/9414)) [@karthikeyann](https://github.com/karthikeyann) +- Support Unary Operations in Masked UDF ([#9409](https://github.com/rapidsai/cudf/pull/9409)) [@isVoid](https://github.com/isVoid) +- Move Several Series Function to Frame ([#9394](https://github.com/rapidsai/cudf/pull/9394)) [@isVoid](https://github.com/isVoid) +- MD5 Python hash API ([#9390](https://github.com/rapidsai/cudf/pull/9390)) [@bdice](https://github.com/bdice) +- Add cudf strings is_title API ([#9380](https://github.com/rapidsai/cudf/pull/9380)) [@davidwendt](https://github.com/davidwendt) +- Enable casting to int64, uint64, and double in AST code. ([#9379](https://github.com/rapidsai/cudf/pull/9379)) [@vyasr](https://github.com/vyasr) +- Add support for writing ORC with map columns ([#9369](https://github.com/rapidsai/cudf/pull/9369)) [@vuule](https://github.com/vuule) +- extract_list_elements() with column_view indices ([#9367](https://github.com/rapidsai/cudf/pull/9367)) [@mythrocks](https://github.com/mythrocks) +- Reimplement `lists::drop_list_duplicates` for keys-values lists columns ([#9345](https://github.com/rapidsai/cudf/pull/9345)) [@ttnghia](https://github.com/ttnghia) +- Support Python UDFs written in terms of rows ([#9343](https://github.com/rapidsai/cudf/pull/9343)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- JNI: Support nested types in ORC writer ([#9334](https://github.com/rapidsai/cudf/pull/9334)) [@firestarman](https://github.com/firestarman) +- Optionally nullify out-of-bounds indices in segmented_gather(). ([#9318](https://github.com/rapidsai/cudf/pull/9318)) [@mythrocks](https://github.com/mythrocks) +- Add shallow hash function and shallow equality comparison for column_view ([#9312](https://github.com/rapidsai/cudf/pull/9312)) [@karthikeyann](https://github.com/karthikeyann) +- Add CudaMemoryBuffer for cudaMalloc memory using RMM cuda_memory_resource ([#9311](https://github.com/rapidsai/cudf/pull/9311)) [@rongou](https://github.com/rongou) +- Add parameters to control row index stride and stripe size in ORC writer ([#9310](https://github.com/rapidsai/cudf/pull/9310)) [@vuule](https://github.com/vuule) +- Add `na_position` param to dask-cudf `sort_values` ([#9264](https://github.com/rapidsai/cudf/pull/9264)) [@charlesbluca](https://github.com/charlesbluca) +- Add `ascending` parameter for dask-cudf `sort_values` ([#9250](https://github.com/rapidsai/cudf/pull/9250)) [@charlesbluca](https://github.com/charlesbluca) +- New array conversion methods ([#9236](https://github.com/rapidsai/cudf/pull/9236)) [@vyasr](https://github.com/vyasr) +- Series `apply` method backed by masked UDFs ([#9217](https://github.com/rapidsai/cudf/pull/9217)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- Grouping by frequency and resampling ([#9178](https://github.com/rapidsai/cudf/pull/9178)) [@shwina](https://github.com/shwina) +- Pure-python masked UDFs ([#9174](https://github.com/rapidsai/cudf/pull/9174)) [@brandon-b-miller](https://github.com/brandon-b-miller) +- Add Covariance, Pearson correlation for sort groupby (libcudf) ([#9154](https://github.com/rapidsai/cudf/pull/9154)) [@karthikeyann](https://github.com/karthikeyann) +- Add `calendrical_month_sequence` in c++ and `date_range` in python ([#8886](https://github.com/rapidsai/cudf/pull/8886)) [@shwina](https://github.com/shwina) + +## πŸ› οΈ Improvements + +- Followup to PR 9088 comments ([#9659](https://github.com/rapidsai/cudf/pull/9659)) [@cwharris](https://github.com/cwharris) +- Update cuCollections to version that supports installed libcudacxx ([#9633](https://github.com/rapidsai/cudf/pull/9633)) [@robertmaynard](https://github.com/robertmaynard) +- Add `11.5` dev.yml to `cudf` ([#9617](https://github.com/rapidsai/cudf/pull/9617)) [@galipremsagar](https://github.com/galipremsagar) +- Add `xfail` for parquet reader `11.5` issue ([#9612](https://github.com/rapidsai/cudf/pull/9612)) [@galipremsagar](https://github.com/galipremsagar) +- remove deprecated Rmm.initialize method ([#9607](https://github.com/rapidsai/cudf/pull/9607)) [@rongou](https://github.com/rongou) +- Use HostColumnVectorCore for child columns in JCudfSerialization.unpackHostColumnVectors ([#9596](https://github.com/rapidsai/cudf/pull/9596)) [@sperlingxx](https://github.com/sperlingxx) +- Set RMM pool to a fixed size in JNI ([#9583](https://github.com/rapidsai/cudf/pull/9583)) [@rongou](https://github.com/rongou) +- Use nvCOMP for Snappy compression/decompression ([#9582](https://github.com/rapidsai/cudf/pull/9582)) [@vuule](https://github.com/vuule) +- Build CUDA version agnostic packages for dask-cudf ([#9578](https://github.com/rapidsai/cudf/pull/9578)) [@Ethyling](https://github.com/Ethyling) +- Fixed tests warning: "TYPED_TEST_CASE is deprecated, please use TYPED_TEST_SUITE" ([#9574](https://github.com/rapidsai/cudf/pull/9574)) [@ttnghia](https://github.com/ttnghia) +- Enable CMake format in CI and fix style ([#9570](https://github.com/rapidsai/cudf/pull/9570)) [@vyasr](https://github.com/vyasr) +- Add NVTX Start/End Ranges to JNI ([#9563](https://github.com/rapidsai/cudf/pull/9563)) [@abellina](https://github.com/abellina) +- Add librdkafka and python-confluent-kafka to dev conda environments s… ([#9562](https://github.com/rapidsai/cudf/pull/9562)) [@jdye64](https://github.com/jdye64) +- Add offsets_begin/end() to strings_column_view ([#9559](https://github.com/rapidsai/cudf/pull/9559)) [@davidwendt](https://github.com/davidwendt) +- remove alignment options for RMM jni ([#9550](https://github.com/rapidsai/cudf/pull/9550)) [@rongou](https://github.com/rongou) +- Add axis parameter passthrough to `DataFrame` and `Series` take for pandas API compatibility ([#9549](https://github.com/rapidsai/cudf/pull/9549)) [@dantegd](https://github.com/dantegd) +- Remove sizeof and standardize on memory_usage ([#9544](https://github.com/rapidsai/cudf/pull/9544)) [@vyasr](https://github.com/vyasr) +- Adds cudaProfilerStart/cudaProfilerStop in JNI api ([#9543](https://github.com/rapidsai/cudf/pull/9543)) [@abellina](https://github.com/abellina) +- Generalize comparison binary operations ([#9542](https://github.com/rapidsai/cudf/pull/9542)) [@vyasr](https://github.com/vyasr) +- Expose APIs to wrap CUDA or RMM allocations with a Java device buffer instance ([#9538](https://github.com/rapidsai/cudf/pull/9538)) [@jlowe](https://github.com/jlowe) +- Add scan sum support for duration types to libcudf ([#9536](https://github.com/rapidsai/cudf/pull/9536)) [@davidwendt](https://github.com/davidwendt) +- Force inlining to improve AST performance ([#9530](https://github.com/rapidsai/cudf/pull/9530)) [@vyasr](https://github.com/vyasr) +- Generalize some more indexed frame methods ([#9529](https://github.com/rapidsai/cudf/pull/9529)) [@vyasr](https://github.com/vyasr) +- Add Java bindings for rolling window stddev aggregation ([#9527](https://github.com/rapidsai/cudf/pull/9527)) [@razajafri](https://github.com/razajafri) +- catch rmm::out_of_memory exceptions in jni ([#9525](https://github.com/rapidsai/cudf/pull/9525)) [@rongou](https://github.com/rongou) +- Add an overload of `make_empty_column` with `type_id` parameter ([#9524](https://github.com/rapidsai/cudf/pull/9524)) [@ttnghia](https://github.com/ttnghia) +- Accelerate conditional inner joins with larger right tables ([#9523](https://github.com/rapidsai/cudf/pull/9523)) [@vyasr](https://github.com/vyasr) +- Initial pass of generalizing `decimal` support in `cudf` python layer ([#9517](https://github.com/rapidsai/cudf/pull/9517)) [@galipremsagar](https://github.com/galipremsagar) +- Cleanup for flattening nested columns ([#9509](https://github.com/rapidsai/cudf/pull/9509)) [@rwlee](https://github.com/rwlee) +- Enable running tests using RMM arena and async memory resources ([#9506](https://github.com/rapidsai/cudf/pull/9506)) [@rongou](https://github.com/rongou) +- Remove dependency on six. ([#9495](https://github.com/rapidsai/cudf/pull/9495)) [@bdice](https://github.com/bdice) +- Cleanup some libcudf strings gtests ([#9489](https://github.com/rapidsai/cudf/pull/9489)) [@davidwendt](https://github.com/davidwendt) +- Rename strings/array_tests.cu to strings/array_tests.cpp ([#9480](https://github.com/rapidsai/cudf/pull/9480)) [@davidwendt](https://github.com/davidwendt) +- Refactor sorting APIs ([#9464](https://github.com/rapidsai/cudf/pull/9464)) [@vyasr](https://github.com/vyasr) +- Implement DataFrame.hash_values, deprecate DataFrame.hash_columns. ([#9458](https://github.com/rapidsai/cudf/pull/9458)) [@bdice](https://github.com/bdice) +- Deprecate Series.hash_encode. ([#9457](https://github.com/rapidsai/cudf/pull/9457)) [@bdice](https://github.com/bdice) +- Update `conda` recipes for Enhanced Compatibility effort ([#9456](https://github.com/rapidsai/cudf/pull/9456)) [@ajschmidt8](https://github.com/ajschmidt8) +- Small clean up to simplify column selection code in ORC reader ([#9444](https://github.com/rapidsai/cudf/pull/9444)) [@vuule](https://github.com/vuule) +- add missing stream to scalar.is_valid() wherever stream is available ([#9436](https://github.com/rapidsai/cudf/pull/9436)) [@karthikeyann](https://github.com/karthikeyann) +- Adds Deprecation Warnings to `one_hot_encoding` and Implement `get_dummies` with Cython API ([#9435](https://github.com/rapidsai/cudf/pull/9435)) [@isVoid](https://github.com/isVoid) +- Update pre-commit hook URLs. ([#9433](https://github.com/rapidsai/cudf/pull/9433)) [@bdice](https://github.com/bdice) +- Remove pyarrow import in `dask_cudf.io.parquet` ([#9429](https://github.com/rapidsai/cudf/pull/9429)) [@charlesbluca](https://github.com/charlesbluca) +- Miscellaneous improvements for UDFs ([#9422](https://github.com/rapidsai/cudf/pull/9422)) [@isVoid](https://github.com/isVoid) +- Use pre-commit for CI ([#9412](https://github.com/rapidsai/cudf/pull/9412)) [@vyasr](https://github.com/vyasr) +- Update to UCX-Py 0.23 ([#9407](https://github.com/rapidsai/cudf/pull/9407)) [@pentschev](https://github.com/pentschev) +- Expose OutOfBoundsPolicy in JNI for Table.gather ([#9406](https://github.com/rapidsai/cudf/pull/9406)) [@abellina](https://github.com/abellina) +- Improvements to tdigest aggregation code. ([#9403](https://github.com/rapidsai/cudf/pull/9403)) [@nvdbaranec](https://github.com/nvdbaranec) +- Add Java API to deserialize a table to host columns ([#9402](https://github.com/rapidsai/cudf/pull/9402)) [@jlowe](https://github.com/jlowe) +- Frame copy to use __class__ instead of type() ([#9397](https://github.com/rapidsai/cudf/pull/9397)) [@madsbk](https://github.com/madsbk) +- Change all DeprecationWarnings to FutureWarning. ([#9392](https://github.com/rapidsai/cudf/pull/9392)) [@bdice](https://github.com/bdice) +- Update Java nvcomp JNI bindings to nvcomp 2.x API ([#9384](https://github.com/rapidsai/cudf/pull/9384)) [@jbrennan333](https://github.com/jbrennan333) +- Add IndexedFrame class and move SingleColumnFrame to a separate module ([#9378](https://github.com/rapidsai/cudf/pull/9378)) [@vyasr](https://github.com/vyasr) +- Support Arrow NativeFile and PythonFile for remote ORC storage ([#9377](https://github.com/rapidsai/cudf/pull/9377)) [@rjzamora](https://github.com/rjzamora) +- Use Arrow PythonFile for remote CSV storage ([#9376](https://github.com/rapidsai/cudf/pull/9376)) [@rjzamora](https://github.com/rjzamora) +- Add multi-threaded writing to GDS writes ([#9372](https://github.com/rapidsai/cudf/pull/9372)) [@devavret](https://github.com/devavret) +- Miscellaneous column cleanup ([#9370](https://github.com/rapidsai/cudf/pull/9370)) [@vyasr](https://github.com/vyasr) +- Use single kernel to extract all groups in cudf::strings::extract ([#9358](https://github.com/rapidsai/cudf/pull/9358)) [@davidwendt](https://github.com/davidwendt) +- Consolidate binary ops into `Frame` ([#9357](https://github.com/rapidsai/cudf/pull/9357)) [@isVoid](https://github.com/isVoid) +- Move rank scan implementations from scan_inclusive.cu to rank_scan.cu ([#9351](https://github.com/rapidsai/cudf/pull/9351)) [@davidwendt](https://github.com/davidwendt) +- Remove usage of deprecated thrust::host_space_tag. ([#9350](https://github.com/rapidsai/cudf/pull/9350)) [@bdice](https://github.com/bdice) +- Use Default Memory Resource for Temporaries in `reduction.cpp` ([#9344](https://github.com/rapidsai/cudf/pull/9344)) [@isVoid](https://github.com/isVoid) +- Fix Cython compilation warnings. ([#9327](https://github.com/rapidsai/cudf/pull/9327)) [@bdice](https://github.com/bdice) +- Fix some unused variable warnings in libcudf ([#9326](https://github.com/rapidsai/cudf/pull/9326)) [@davidwendt](https://github.com/davidwendt) +- Use optional-iterator for copy-if-else kernel ([#9324](https://github.com/rapidsai/cudf/pull/9324)) [@davidwendt](https://github.com/davidwendt) +- Remove Table class ([#9315](https://github.com/rapidsai/cudf/pull/9315)) [@vyasr](https://github.com/vyasr) +- Unpin `dask` and `distributed` in CI ([#9307](https://github.com/rapidsai/cudf/pull/9307)) [@galipremsagar](https://github.com/galipremsagar) +- Add optional-iterator support to indexalator ([#9306](https://github.com/rapidsai/cudf/pull/9306)) [@davidwendt](https://github.com/davidwendt) +- Consolidate more methods in Frame ([#9305](https://github.com/rapidsai/cudf/pull/9305)) [@vyasr](https://github.com/vyasr) +- Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf ([#9304](https://github.com/rapidsai/cudf/pull/9304)) [@rjzamora](https://github.com/rjzamora) +- Pin mypy in .pre-commit-config.yaml to match conda environment pinning. ([#9300](https://github.com/rapidsai/cudf/pull/9300)) [@bdice](https://github.com/bdice) +- Use gather.hpp when gather-map exists in device memory ([#9299](https://github.com/rapidsai/cudf/pull/9299)) [@davidwendt](https://github.com/davidwendt) +- Fix Automerger for `Branch-21.12` from `branch-21.10` ([#9285](https://github.com/rapidsai/cudf/pull/9285)) [@galipremsagar](https://github.com/galipremsagar) +- Refactor cuIO timestamp processing with `cuda::std::chrono` ([#9278](https://github.com/rapidsai/cudf/pull/9278)) [@PointKernel](https://github.com/PointKernel) +- Change strings copy_if_else to use optional-iterator instead of pair-iterator ([#9266](https://github.com/rapidsai/cudf/pull/9266)) [@davidwendt](https://github.com/davidwendt) +- Update cudf java bindings to 21.12.0-SNAPSHOT ([#9248](https://github.com/rapidsai/cudf/pull/9248)) [@pxLi](https://github.com/pxLi) +- Various internal MultiIndex improvements ([#9243](https://github.com/rapidsai/cudf/pull/9243)) [@vyasr](https://github.com/vyasr) +- Add detail interface for `split` and `slice(table_view)`, refactors both function with `host_span` ([#9226](https://github.com/rapidsai/cudf/pull/9226)) [@isVoid](https://github.com/isVoid) +- Refactor MD5 implementation. ([#9212](https://github.com/rapidsai/cudf/pull/9212)) [@bdice](https://github.com/bdice) +- Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. ([#9195](https://github.com/rapidsai/cudf/pull/9195)) [@karthikeyann](https://github.com/karthikeyann) +- Use nvcomp's snappy decompressor in avro reader ([#9181](https://github.com/rapidsai/cudf/pull/9181)) [@devavret](https://github.com/devavret) +- Add `isocalendar` API support ([#9169](https://github.com/rapidsai/cudf/pull/9169)) [@marlenezw](https://github.com/marlenezw) +- Simplify read_json by removing unnecessary reader/impl classes ([#9088](https://github.com/rapidsai/cudf/pull/9088)) [@cwharris](https://github.com/cwharris) +- Simplify read_csv by removing unnecessary reader/impl classes ([#9041](https://github.com/rapidsai/cudf/pull/9041)) [@cwharris](https://github.com/cwharris) +- Refactor hash join with cuCollections multimap ([#8934](https://github.com/rapidsai/cudf/pull/8934)) [@PointKernel](https://github.com/PointKernel) # cuDF 21.10.00 (7 Oct 2021) @@ -1481,7 +1704,7 @@ Please see https://github.com/rapidsai/cudf/releases/tag/v21.12.00a for the late - PR #6459 Add `map` method to series - PR #6379 Add list hashing functionality to MD5 - PR #6498 Add helper method to ColumnBuilder with some nits -- PR #6336 Add `join` functionality in cudf concat +- PR #6336 Add `join` functionality in cudf concat - PR #6653 Replaced SHFL_XOR calls with cub::WarpReduce - PR #6751 Rework ColumnViewAccess and its usage - PR #6698 Remove macros from ORC reader and writer diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index aae62fbd47c..6d1c0528832 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -86,7 +86,7 @@ git submodule update --init --remote --recursive ```bash # create the conda environment (assuming in base `cudf` directory) # note: RAPIDS currently doesn't support `channel_priority: strict`; use `channel_priority: flexible` instead -conda env create --name cudf_dev --file conda/environments/cudf_dev_cuda11.0.yml +conda env create --name cudf_dev --file conda/environments/cudf_dev_cuda11.5.yml # activate the environment conda activate cudf_dev ``` diff --git a/build.sh b/build.sh index d0ccd4821e0..adf6e220744 100755 --- a/build.sh +++ b/build.sh @@ -172,6 +172,12 @@ if buildAll || hasArg libcudf; then echo "Building for *ALL* supported GPU architectures..." fi + # get the current count before the compile starts + FILES_IN_CCACHE="" + if [ -x "$(command -v ccache)" ]; then + FILES_IN_CCACHE=$(ccache -s | grep "files in cache") + fi + cmake -S $REPODIR/cpp -B ${LIB_BUILD_DIR} \ -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX} \ ${CUDF_CMAKE_CUDA_ARCHITECTURES} \ @@ -185,7 +191,19 @@ if buildAll || hasArg libcudf; then cd ${LIB_BUILD_DIR} + compile_start=$(date +%s) cmake --build . -j${PARALLEL_LEVEL} ${VERBOSE_FLAG} + compile_end=$(date +%s) + compile_total=$(( compile_end - compile_start )) + + # Record build times + if [[ -f "${LIB_BUILD_DIR}/.ninja_log" ]]; then + echo "Formatting build times" + python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt xml > ${LIB_BUILD_DIR}/ninja_log.xml + message="$FILES_IN_CCACHE

$PARALLEL_LEVEL parallel build time is $compile_total seconds" + echo "$message" + python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt html --msg "$message" > ${LIB_BUILD_DIR}/ninja_log.html + fi if [[ ${INSTALL_TARGET} != "" ]]; then cmake --build . -j${PARALLEL_LEVEL} --target install ${VERBOSE_FLAG} diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh index d8b5cc7ba4c..a557a2ef066 100755 --- a/ci/gpu/build.sh +++ b/ci/gpu/build.sh @@ -33,6 +33,9 @@ export MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'` # Dask & Distributed git tag export DASK_DISTRIBUTED_GIT_TAG='2021.11.2' +# ucx-py version +export UCX_PY_VERSION='0.24.*' + ################################################################################ # TRAP - Setup trap for removing jitify cache ################################################################################ @@ -83,7 +86,7 @@ gpuci_mamba_retry install -y \ "rapids-notebook-env=$MINOR_VERSION.*" \ "dask-cuda=${MINOR_VERSION}" \ "rmm=$MINOR_VERSION.*" \ - "ucx-py=0.24.*" + "ucx-py=${UCX_PY_VERSION}" # https://docs.rapids.ai/maintainers/depmgmt/ # gpuci_mamba_retry remove --force rapids-build-env rapids-notebook-env @@ -166,16 +169,26 @@ else gpuci_logger "Check GPU usage" nvidia-smi - gpuci_logger "GoogleTests" set -x cd $LIB_BUILD_DIR + gpuci_logger "GoogleTests" + for gt in gtests/* ; do test_name=$(basename ${gt}) echo "Running GoogleTest $test_name" ${gt} --gtest_output=xml:"$WORKSPACE/test-results/" done + # Copy libcudf build time results + echo "Checking for build time log $LIB_BUILD_DIR/ninja_log.html" + if [[ -f "$LIB_BUILD_DIR/ninja_log.html" ]]; then + gpuci_logger "Copying build time results" + cp "$LIB_BUILD_DIR/ninja_log.xml" "$WORKSPACE/test-results/buildtimes-junit.xml" + mkdir -p "$WORKSPACE/build-metrics" + cp "$LIB_BUILD_DIR/ninja_log.html" "$WORKSPACE/build-metrics/BuildMetrics.html" + fi + ################################################################################ # MEMCHECK - Run compute-sanitizer on GoogleTest (only in nightly builds) ################################################################################ @@ -206,7 +219,7 @@ else KAFKA_CONDA_FILE=${KAFKA_CONDA_FILE//-/=} #convert to conda install gpuci_logger "Installing $CUDF_CONDA_FILE & $KAFKA_CONDA_FILE" - conda install -c ${CONDA_ARTIFACT_PATH} "$CUDF_CONDA_FILE" "$KAFKA_CONDA_FILE" + gpuci_mamba_retry install -c ${CONDA_ARTIFACT_PATH} "$CUDF_CONDA_FILE" "$KAFKA_CONDA_FILE" install_dask diff --git a/ci/gpu/java.sh b/ci/gpu/java.sh index bada16bd40e..6f7038d21d7 100755 --- a/ci/gpu/java.sh +++ b/ci/gpu/java.sh @@ -30,6 +30,9 @@ export CONDA_ARTIFACT_PATH="$WORKSPACE/ci/artifacts/cudf/cpu/.conda-bld/" export GIT_DESCRIBE_TAG=`git describe --tags` export MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'` +# ucx-py version +export UCX_PY_VERSION='0.24.*' + ################################################################################ # TRAP - Setup trap for removing jitify cache ################################################################################ @@ -80,7 +83,7 @@ gpuci_conda_retry install -y \ "rapids-notebook-env=$MINOR_VERSION.*" \ "dask-cuda=${MINOR_VERSION}" \ "rmm=$MINOR_VERSION.*" \ - "ucx-py=0.24.*" \ + "ucx-py=${UCX_PY_VERSION}" \ "openjdk=8.*" \ "maven" diff --git a/ci/release/update-version.sh b/ci/release/update-version.sh index 86432a92128..1105b9c194d 100755 --- a/ci/release/update-version.sh +++ b/ci/release/update-version.sh @@ -21,6 +21,7 @@ CURRENT_SHORT_TAG=${CURRENT_MAJOR}.${CURRENT_MINOR} NEXT_MAJOR=$(echo $NEXT_FULL_TAG | awk '{split($0, a, "."); print a[1]}') NEXT_MINOR=$(echo $NEXT_FULL_TAG | awk '{split($0, a, "."); print a[2]}') NEXT_SHORT_TAG=${NEXT_MAJOR}.${NEXT_MINOR} +NEXT_UCX_PY_VERSION="$(curl -sL https://version.gpuci.io/rapids/${NEXT_SHORT_TAG}).*" echo "Preparing release $CURRENT_TAG => $NEXT_FULL_TAG" @@ -62,3 +63,7 @@ sed_runner "s/cudf=${CURRENT_SHORT_TAG}/cudf=${NEXT_SHORT_TAG}/g" README.md # Libcudf examples update sed_runner "s/CUDF_TAG branch-${CURRENT_SHORT_TAG}/CUDF_TAG branch-${NEXT_SHORT_TAG}/" cpp/examples/basic/CMakeLists.txt + +# ucx-py version update +sed_runner "s/export UCX_PY_VERSION=.*/export UCX_PY_VERSION='${NEXT_UCX_PY_VERSION}'/g" ci/gpu/build.sh +sed_runner "s/export UCX_PY_VERSION=.*/export UCX_PY_VERSION='${NEXT_UCX_PY_VERSION}'/g" ci/gpu/java.sh diff --git a/conda/environments/cudf_dev_cuda11.0.yml b/conda/environments/cudf_dev_cuda11.0.yml deleted file mode 100644 index e7b92eddd9e..00000000000 --- a/conda/environments/cudf_dev_cuda11.0.yml +++ /dev/null @@ -1,69 +0,0 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. - -name: cudf_dev -channels: - - rapidsai - - nvidia - - rapidsai-nightly - - conda-forge -dependencies: - - clang=11.1.0 - - clang-tools=11.1.0 - - cupy>=9.5.0,<10.0.0a0 - - rmm=22.02.* - - cmake>=3.20.1 - - cmake_setuptools>=0.1.3 - - python>=3.7,<3.9 - - numba>=0.54 - - numpy - - pandas>=1.0,<1.4.0dev0 - - pyarrow=5.0.0=*cuda - - fastavro>=0.22.9 - - python-snappy>=0.6.0 - - notebook>=0.5.0 - - cython>=0.29,<0.30 - - fsspec>=0.6.0 - - pytest - - pytest-benchmark - - pytest-xdist - - sphinx - - sphinxcontrib-websupport - - nbsphinx - - numpydoc - - ipython - - pandoc=<2.0.0 - - cudatoolkit=11.0 - - pip - - flake8=3.8.3 - - black=19.10 - - isort=5.6.4 - - mypy=0.782 - - pydocstyle=6.1.1 - - typing_extensions - - pre-commit - - dask>=2021.11.1,<=2021.11.2 - - distributed>=2021.11.1,<=2021.11.2 - - streamz - - arrow-cpp=5.0.0 - - dlpack>=0.5,<0.6.0a0 - - arrow-cpp-proc * cuda - - double-conversion - - rapidjson - - hypothesis - - sphinx-markdown-tables - - sphinx-copybutton - - mimesis<4.1 - - packaging - - protobuf - - nvtx>=0.2.1 - - cachetools - - transformers<=4.10.3 - - pydata-sphinx-theme - - librdkafka=1.7.0 - - python-confluent-kafka=1.7.0 - - pip: - - git+https://github.com/dask/dask.git@main - - git+https://github.com/dask/distributed.git@main - - git+https://github.com/python-streamz/streamz.git@master - - pyorc - - ptxcompiler # [linux64] diff --git a/conda/environments/cudf_dev_cuda11.2.yml b/conda/environments/cudf_dev_cuda11.2.yml deleted file mode 100644 index 6fe8ed0fafe..00000000000 --- a/conda/environments/cudf_dev_cuda11.2.yml +++ /dev/null @@ -1,69 +0,0 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. - -name: cudf_dev -channels: - - rapidsai - - nvidia - - rapidsai-nightly - - conda-forge -dependencies: - - clang=11.1.0 - - clang-tools=11.1.0 - - cupy>=9.5.0,<10.0.0a0 - - rmm=22.02.* - - cmake>=3.20.1 - - cmake_setuptools>=0.1.3 - - python>=3.7,<3.9 - - numba>=0.54 - - numpy - - pandas>=1.0,<1.4.0dev0 - - pyarrow=5.0.0=*cuda - - fastavro>=0.22.9 - - python-snappy>=0.6.0 - - notebook>=0.5.0 - - cython>=0.29,<0.30 - - fsspec>=0.6.0 - - pytest - - pytest-benchmark - - pytest-xdist - - sphinx - - sphinxcontrib-websupport - - nbsphinx - - numpydoc - - ipython - - pandoc=<2.0.0 - - cudatoolkit=11.2 - - pip - - flake8=3.8.3 - - black=19.10 - - isort=5.6.4 - - mypy=0.782 - - pydocstyle=6.1.1 - - typing_extensions - - pre-commit - - dask>=2021.11.1,<=2021.11.2 - - distributed>=2021.11.1,<=2021.11.2 - - streamz - - arrow-cpp=5.0.0 - - dlpack>=0.5,<0.6.0a0 - - arrow-cpp-proc * cuda - - double-conversion - - rapidjson - - hypothesis - - sphinx-markdown-tables - - sphinx-copybutton - - mimesis<4.1 - - packaging - - protobuf - - nvtx>=0.2.1 - - cachetools - - transformers<=4.10.3 - - pydata-sphinx-theme - - librdkafka=1.7.0 - - python-confluent-kafka=1.7.0 - - pip: - - git+https://github.com/dask/dask.git@main - - git+https://github.com/dask/distributed.git@main - - git+https://github.com/python-streamz/streamz.git@master - - pyorc - - ptxcompiler # [linux64] diff --git a/conda/recipes/cudf/meta.yaml b/conda/recipes/cudf/meta.yaml index 46eefbc825f..2600ab358cc 100644 --- a/conda/recipes/cudf/meta.yaml +++ b/conda/recipes/cudf/meta.yaml @@ -3,7 +3,7 @@ {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %} {% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %} {% set py_version=environ.get('CONDA_PY', 36) %} -{% set cuda_version='.'.join(environ.get('CUDA', '10.1').split('.')[:2]) %} +{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %} {% set cuda_major=cuda_version.split('.')[0] %} package: diff --git a/conda/recipes/cudf_kafka/meta.yaml b/conda/recipes/cudf_kafka/meta.yaml index af27d888b46..e450d306cbe 100644 --- a/conda/recipes/cudf_kafka/meta.yaml +++ b/conda/recipes/cudf_kafka/meta.yaml @@ -3,7 +3,7 @@ {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %} {% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %} {% set py_version=environ.get('CONDA_PY', 36) %} -{% set cuda_version='.'.join(environ.get('CUDA', '10.1').split('.')[:2]) %} +{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %} package: name: cudf_kafka diff --git a/conda/recipes/custreamz/meta.yaml b/conda/recipes/custreamz/meta.yaml index db8af9b0bed..a8b096d4892 100644 --- a/conda/recipes/custreamz/meta.yaml +++ b/conda/recipes/custreamz/meta.yaml @@ -3,7 +3,7 @@ {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %} {% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %} {% set py_version=environ.get('CONDA_PY', 36) %} -{% set cuda_version='.'.join(environ.get('CUDA', '10.1').split('.')[:2]) %} +{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %} package: name: custreamz @@ -29,7 +29,7 @@ requirements: - cudf_kafka {{ version }} run: - python - - streamz + - streamz - cudf {{ version }} - dask>=2021.11.1,<=2021.11.2 - distributed>=2021.11.1,<=2021.11.2 diff --git a/conda/recipes/dask-cudf/meta.yaml b/conda/recipes/dask-cudf/meta.yaml index d90de2d628c..da8bcea430a 100644 --- a/conda/recipes/dask-cudf/meta.yaml +++ b/conda/recipes/dask-cudf/meta.yaml @@ -3,7 +3,7 @@ {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %} {% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %} {% set py_version=environ.get('CONDA_PY', 36) %} -{% set cuda_version='.'.join(environ.get('CUDA', '10.1').split('.')[:2]) %} +{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %} {% set cuda_major=cuda_version.split('.')[0] %} package: @@ -40,6 +40,8 @@ requirements: test: # [linux64] requires: # [linux64] - cudatoolkit {{ cuda_version }}.* # [linux64] + imports: # [linux64] + - dask_cudf # [linux64] about: diff --git a/conda/recipes/libcudf/meta.yaml b/conda/recipes/libcudf/meta.yaml index e78110f3233..bd9b76e4890 100644 --- a/conda/recipes/libcudf/meta.yaml +++ b/conda/recipes/libcudf/meta.yaml @@ -2,7 +2,7 @@ {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %} {% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %} -{% set cuda_version='.'.join(environ.get('CUDA', '10.1').split('.')[:2]) %} +{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %} {% set cuda_major=cuda_version.split('.')[0] %} package: diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 32760168eaf..622cfe29f13 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -346,6 +346,7 @@ add_library( src/lists/lists_column_factories.cu src/lists/lists_column_view.cu src/lists/segmented_sort.cu + src/lists/sequences.cu src/merge/merge.cu src/partitioning/partitioning.cu src/partitioning/round_robin.cu @@ -419,7 +420,8 @@ add_library( src/strings/copying/concatenate.cu src/strings/copying/copying.cu src/strings/copying/shift.cu - src/strings/extract.cu + src/strings/extract/extract.cu + src/strings/extract/extract_all.cu src/strings/filling/fill.cu src/strings/filter_chars.cu src/strings/findall.cu diff --git a/cpp/benchmarks/common/generate_benchmark_input.cpp b/cpp/benchmarks/common/generate_benchmark_input.cpp index 0ec2590bdb5..995cea13c27 100644 --- a/cpp/benchmarks/common/generate_benchmark_input.cpp +++ b/cpp/benchmarks/common/generate_benchmark_input.cpp @@ -161,8 +161,29 @@ struct random_value_fn()>> { */ template struct random_value_fn()>> { - random_value_fn(distribution_params const&) {} - T operator()(std::mt19937& engine) { CUDF_FAIL("Not implemented"); } + using rep = typename T::rep; + rep const lower_bound; + rep const upper_bound; + distribution_fn dist; + std::optional scale; + + random_value_fn(distribution_params const& desc) + : lower_bound{desc.lower_bound}, + upper_bound{desc.upper_bound}, + dist{make_distribution(desc.id, desc.lower_bound, desc.upper_bound)} + { + } + + T operator()(std::mt19937& engine) + { + if (not scale.has_value()) { + int const max_scale = std::numeric_limits::digits10; + auto scale_dist = make_distribution(distribution_id::NORMAL, -max_scale, max_scale); + scale = numeric::scale_type{std::max(std::min(scale_dist(engine), max_scale), -max_scale)}; + } + // Clamp the generated random value to the specified range + return T{std::max(std::min(dist(engine), upper_bound), lower_bound), *scale}; + } }; /** diff --git a/cpp/benchmarks/common/generate_benchmark_input.hpp b/cpp/benchmarks/common/generate_benchmark_input.hpp index 6ea57c0a7ad..3dbc6561839 100644 --- a/cpp/benchmarks/common/generate_benchmark_input.hpp +++ b/cpp/benchmarks/common/generate_benchmark_input.hpp @@ -216,6 +216,7 @@ class data_profile { distribution_params string_dist_desc{{distribution_id::NORMAL, 0, 32}}; distribution_params list_dist_desc{ cudf::type_id::INT32, {distribution_id::GEOMETRIC, 0, 100}, 2}; + std::map> decimal_params; double bool_probability = 0.5; double null_frequency = 0.01; @@ -284,9 +285,17 @@ class data_profile { } template ()>* = nullptr> - distribution_params get_distribution_params() const + distribution_params get_distribution_params() const { - CUDF_FAIL("Not implemented"); + using rep = typename T::rep; + auto it = decimal_params.find(cudf::type_to_id()); + if (it == decimal_params.end()) { + auto const range = default_range(); + return distribution_params{default_distribution_id(), range.first, range.second}; + } else { + auto& desc = it->second; + return {desc.id, static_cast(desc.lower_bound), static_cast(desc.upper_bound)}; + } } auto get_bool_probability() const { return bool_probability; } diff --git a/cpp/benchmarks/common/random_distribution_factory.hpp b/cpp/benchmarks/common/random_distribution_factory.hpp index c21fb645573..65dc8b4dd4d 100644 --- a/cpp/benchmarks/common/random_distribution_factory.hpp +++ b/cpp/benchmarks/common/random_distribution_factory.hpp @@ -21,19 +21,24 @@ #include #include +/** + * @brief Generates a normal(binomial) distribution between zero and upper_bound. + */ template ::value, T>* = nullptr> -auto make_normal_dist(T range_start, T range_end) +auto make_normal_dist(T upper_bound) { - using uT = typename std::make_unsigned::type; - uT const range_size = range_end - range_start; - return std::binomial_distribution(range_size, 0.5); + using uT = typename std::make_unsigned::type; + return std::binomial_distribution(upper_bound, 0.5); } +/** + * @brief Generates a normal distribution between zero and upper_bound. + */ template ()>* = nullptr> -auto make_normal_dist(T range_start, T range_end) +auto make_normal_dist(T upper_bound) { - T const mean = range_start / 2 + range_end / 2; - T const stddev = range_end / 6 - range_start / 6; + T const mean = upper_bound / 2; + T const stddev = upper_bound / 6; return std::normal_distribution(mean, stddev); } @@ -82,8 +87,8 @@ distribution_fn make_distribution(distribution_id did, T lower_bound, T upper { switch (did) { case distribution_id::NORMAL: - return [lower_bound, dist = make_normal_dist(lower_bound, upper_bound)]( - std::mt19937& engine) mutable -> T { return dist(engine) - lower_bound; }; + return [lower_bound, dist = make_normal_dist(upper_bound - lower_bound)]( + std::mt19937& engine) mutable -> T { return dist(engine) + lower_bound; }; case distribution_id::UNIFORM: return [dist = make_uniform_dist(lower_bound, upper_bound)]( std::mt19937& engine) mutable -> T { return dist(engine); }; @@ -104,8 +109,8 @@ distribution_fn make_distribution(distribution_id dist_id, T lower_bound, T u { switch (dist_id) { case distribution_id::NORMAL: - return [dist = make_normal_dist(lower_bound, upper_bound)]( - std::mt19937& engine) mutable -> T { return dist(engine); }; + return [lower_bound, dist = make_normal_dist(upper_bound - lower_bound)]( + std::mt19937& engine) mutable -> T { return dist(engine) + lower_bound; }; case distribution_id::UNIFORM: return [dist = make_uniform_dist(lower_bound, upper_bound)]( std::mt19937& engine) mutable -> T { return dist(engine); }; diff --git a/cpp/benchmarks/io/csv/csv_reader_benchmark.cpp b/cpp/benchmarks/io/csv/csv_reader_benchmark.cpp index 3f5549a3148..77bf4b03a14 100644 --- a/cpp/benchmarks/io/csv/csv_reader_benchmark.cpp +++ b/cpp/benchmarks/io/csv/csv_reader_benchmark.cpp @@ -70,6 +70,7 @@ void BM_csv_read_varying_options(benchmark::State& state) auto const data_types = dtypes_for_column_selection(get_type_or_group({int32_t(type_group_id::INTEGRAL), int32_t(type_group_id::FLOATING_POINT), + int32_t(type_group_id::FIXED_POINT), int32_t(type_group_id::TIMESTAMP), int32_t(cudf::type_id::STRING)}), col_sel); @@ -143,6 +144,7 @@ void BM_csv_read_varying_options(benchmark::State& state) RD_BENCHMARK_DEFINE_ALL_SOURCES(CSV_RD_BM_INPUTS_DEFINE, integral, type_group_id::INTEGRAL); RD_BENCHMARK_DEFINE_ALL_SOURCES(CSV_RD_BM_INPUTS_DEFINE, floats, type_group_id::FLOATING_POINT); +RD_BENCHMARK_DEFINE_ALL_SOURCES(CSV_RD_BM_INPUTS_DEFINE, decimal, type_group_id::FIXED_POINT); RD_BENCHMARK_DEFINE_ALL_SOURCES(CSV_RD_BM_INPUTS_DEFINE, timestamps, type_group_id::TIMESTAMP); RD_BENCHMARK_DEFINE_ALL_SOURCES(CSV_RD_BM_INPUTS_DEFINE, string, cudf::type_id::STRING); diff --git a/cpp/benchmarks/io/csv/csv_writer_benchmark.cpp b/cpp/benchmarks/io/csv/csv_writer_benchmark.cpp index fdd7c63eece..9baab6b2571 100644 --- a/cpp/benchmarks/io/csv/csv_writer_benchmark.cpp +++ b/cpp/benchmarks/io/csv/csv_writer_benchmark.cpp @@ -63,6 +63,7 @@ void BM_csv_write_varying_options(benchmark::State& state) auto const data_types = get_type_or_group({int32_t(type_group_id::INTEGRAL), int32_t(type_group_id::FLOATING_POINT), + int32_t(type_group_id::FIXED_POINT), int32_t(type_group_id::TIMESTAMP), int32_t(cudf::type_id::STRING)}); @@ -96,6 +97,7 @@ void BM_csv_write_varying_options(benchmark::State& state) WR_BENCHMARK_DEFINE_ALL_SINKS(CSV_WR_BM_INOUTS_DEFINE, integral, type_group_id::INTEGRAL); WR_BENCHMARK_DEFINE_ALL_SINKS(CSV_WR_BM_INOUTS_DEFINE, floats, type_group_id::FLOATING_POINT); +WR_BENCHMARK_DEFINE_ALL_SINKS(CSV_WR_BM_INOUTS_DEFINE, decimal, type_group_id::FIXED_POINT); WR_BENCHMARK_DEFINE_ALL_SINKS(CSV_WR_BM_INOUTS_DEFINE, timestamps, type_group_id::TIMESTAMP); WR_BENCHMARK_DEFINE_ALL_SINKS(CSV_WR_BM_INOUTS_DEFINE, string, cudf::type_id::STRING); diff --git a/cpp/benchmarks/io/orc/orc_reader_benchmark.cpp b/cpp/benchmarks/io/orc/orc_reader_benchmark.cpp index f0624e40149..6ab8d8d09c0 100644 --- a/cpp/benchmarks/io/orc/orc_reader_benchmark.cpp +++ b/cpp/benchmarks/io/orc/orc_reader_benchmark.cpp @@ -91,8 +91,10 @@ void BM_orc_read_varying_options(benchmark::State& state) auto const data_types = dtypes_for_column_selection(get_type_or_group({int32_t(type_group_id::INTEGRAL_SIGNED), int32_t(type_group_id::FLOATING_POINT), + int32_t(type_group_id::FIXED_POINT), int32_t(type_group_id::TIMESTAMP), - int32_t(cudf::type_id::STRING)}), + int32_t(cudf::type_id::STRING), + int32_t(cudf::type_id::LIST)}), col_sel); auto const tbl = create_random_table(data_types, data_types.size(), table_size_bytes{data_size}); auto const view = tbl->view(); @@ -158,6 +160,7 @@ void BM_orc_read_varying_options(benchmark::State& state) RD_BENCHMARK_DEFINE_ALL_SOURCES(ORC_RD_BM_INPUTS_DEFINE, integral, type_group_id::INTEGRAL_SIGNED); RD_BENCHMARK_DEFINE_ALL_SOURCES(ORC_RD_BM_INPUTS_DEFINE, floats, type_group_id::FLOATING_POINT); +RD_BENCHMARK_DEFINE_ALL_SOURCES(ORC_RD_BM_INPUTS_DEFINE, decimal, type_group_id::FIXED_POINT); RD_BENCHMARK_DEFINE_ALL_SOURCES(ORC_RD_BM_INPUTS_DEFINE, timestamps, type_group_id::TIMESTAMP); RD_BENCHMARK_DEFINE_ALL_SOURCES(ORC_RD_BM_INPUTS_DEFINE, string, cudf::type_id::STRING); RD_BENCHMARK_DEFINE_ALL_SOURCES(ORC_RD_BM_INPUTS_DEFINE, list, cudf::type_id::LIST); diff --git a/cpp/benchmarks/io/orc/orc_writer_benchmark.cpp b/cpp/benchmarks/io/orc/orc_writer_benchmark.cpp index bfa7d4fc6d9..933b3d02e08 100644 --- a/cpp/benchmarks/io/orc/orc_writer_benchmark.cpp +++ b/cpp/benchmarks/io/orc/orc_writer_benchmark.cpp @@ -70,8 +70,10 @@ void BM_orc_write_varying_options(benchmark::State& state) auto const data_types = get_type_or_group({int32_t(type_group_id::INTEGRAL_SIGNED), int32_t(type_group_id::FLOATING_POINT), + int32_t(type_group_id::FIXED_POINT), int32_t(type_group_id::TIMESTAMP), - int32_t(cudf::type_id::STRING)}); + int32_t(cudf::type_id::STRING), + int32_t(cudf::type_id::LIST)}); auto const tbl = create_random_table(data_types, data_types.size(), table_size_bytes{data_size}); auto const view = tbl->view(); @@ -101,6 +103,7 @@ void BM_orc_write_varying_options(benchmark::State& state) WR_BENCHMARK_DEFINE_ALL_SINKS(ORC_WR_BM_INOUTS_DEFINE, integral, type_group_id::INTEGRAL_SIGNED); WR_BENCHMARK_DEFINE_ALL_SINKS(ORC_WR_BM_INOUTS_DEFINE, floats, type_group_id::FLOATING_POINT); +WR_BENCHMARK_DEFINE_ALL_SINKS(ORC_WR_BM_INOUTS_DEFINE, decimal, type_group_id::FIXED_POINT); WR_BENCHMARK_DEFINE_ALL_SINKS(ORC_WR_BM_INOUTS_DEFINE, timestamps, type_group_id::TIMESTAMP); WR_BENCHMARK_DEFINE_ALL_SINKS(ORC_WR_BM_INOUTS_DEFINE, string, cudf::type_id::STRING); WR_BENCHMARK_DEFINE_ALL_SINKS(ORC_WR_BM_INOUTS_DEFINE, list, cudf::type_id::LIST); diff --git a/cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp b/cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp index 045aa0e043b..a68ce2bd1a1 100644 --- a/cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp +++ b/cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp @@ -92,8 +92,10 @@ void BM_parq_read_varying_options(benchmark::State& state) auto const data_types = dtypes_for_column_selection(get_type_or_group({int32_t(type_group_id::INTEGRAL), int32_t(type_group_id::FLOATING_POINT), + int32_t(type_group_id::FIXED_POINT), int32_t(type_group_id::TIMESTAMP), - int32_t(cudf::type_id::STRING)}), + int32_t(cudf::type_id::STRING), + int32_t(cudf::type_id::LIST)}), col_sel); auto const tbl = create_random_table(data_types, data_types.size(), table_size_bytes{data_size}); auto const view = tbl->view(); @@ -160,6 +162,7 @@ void BM_parq_read_varying_options(benchmark::State& state) RD_BENCHMARK_DEFINE_ALL_SOURCES(PARQ_RD_BM_INPUTS_DEFINE, integral, type_group_id::INTEGRAL); RD_BENCHMARK_DEFINE_ALL_SOURCES(PARQ_RD_BM_INPUTS_DEFINE, floats, type_group_id::FLOATING_POINT); +RD_BENCHMARK_DEFINE_ALL_SOURCES(PARQ_RD_BM_INPUTS_DEFINE, decimal, type_group_id::FIXED_POINT); RD_BENCHMARK_DEFINE_ALL_SOURCES(PARQ_RD_BM_INPUTS_DEFINE, timestamps, type_group_id::TIMESTAMP); RD_BENCHMARK_DEFINE_ALL_SOURCES(PARQ_RD_BM_INPUTS_DEFINE, string, cudf::type_id::STRING); RD_BENCHMARK_DEFINE_ALL_SOURCES(PARQ_RD_BM_INPUTS_DEFINE, list, cudf::type_id::LIST); diff --git a/cpp/benchmarks/io/parquet/parquet_writer_benchmark.cpp b/cpp/benchmarks/io/parquet/parquet_writer_benchmark.cpp index b4c11179c35..1af7e206692 100644 --- a/cpp/benchmarks/io/parquet/parquet_writer_benchmark.cpp +++ b/cpp/benchmarks/io/parquet/parquet_writer_benchmark.cpp @@ -71,8 +71,10 @@ void BM_parq_write_varying_options(benchmark::State& state) auto const data_types = get_type_or_group({int32_t(type_group_id::INTEGRAL_SIGNED), int32_t(type_group_id::FLOATING_POINT), + int32_t(type_group_id::FIXED_POINT), int32_t(type_group_id::TIMESTAMP), - int32_t(cudf::type_id::STRING)}); + int32_t(cudf::type_id::STRING), + int32_t(cudf::type_id::LIST)}); auto const tbl = create_random_table(data_types, data_types.size(), table_size_bytes{data_size}); auto const view = tbl->view(); @@ -85,7 +87,7 @@ void BM_parq_write_varying_options(benchmark::State& state) cudf_io::parquet_writer_options::builder(source_sink.make_sink_info(), view) .compression(compression) .stats_level(enable_stats) - .column_chunks_file_path(file_path); + .column_chunks_file_paths({file_path}); cudf_io::write_parquet(options); } @@ -103,6 +105,7 @@ void BM_parq_write_varying_options(benchmark::State& state) WR_BENCHMARK_DEFINE_ALL_SINKS(PARQ_WR_BM_INOUTS_DEFINE, integral, type_group_id::INTEGRAL); WR_BENCHMARK_DEFINE_ALL_SINKS(PARQ_WR_BM_INOUTS_DEFINE, floats, type_group_id::FLOATING_POINT); +WR_BENCHMARK_DEFINE_ALL_SINKS(PARQ_WR_BM_INOUTS_DEFINE, decimal, type_group_id::FIXED_POINT); WR_BENCHMARK_DEFINE_ALL_SINKS(PARQ_WR_BM_INOUTS_DEFINE, timestamps, type_group_id::TIMESTAMP); WR_BENCHMARK_DEFINE_ALL_SINKS(PARQ_WR_BM_INOUTS_DEFINE, string, cudf::type_id::STRING); WR_BENCHMARK_DEFINE_ALL_SINKS(PARQ_WR_BM_INOUTS_DEFINE, list, cudf::type_id::LIST); diff --git a/cpp/cmake/thirdparty/get_cucollections.cmake b/cpp/cmake/thirdparty/get_cucollections.cmake index b58bdb55de3..16e7a58b020 100644 --- a/cpp/cmake/thirdparty/get_cucollections.cmake +++ b/cpp/cmake/thirdparty/get_cucollections.cmake @@ -21,7 +21,7 @@ function(find_and_configure_cucollections) cuco 0.0 GLOBAL_TARGETS cuco::cuco CPM_ARGS GITHUB_REPOSITORY NVIDIA/cuCollections - GIT_TAG 6433e8ad7571f14cc5384051b049029c60dd1ce0 + GIT_TAG 193de1aa74f5721717f991ca757dc610c852bb17 OPTIONS "BUILD_TESTS OFF" "BUILD_BENCHMARKS OFF" "BUILD_EXAMPLES OFF" ) diff --git a/cpp/cmake/thirdparty/get_thrust.cmake b/cpp/cmake/thirdparty/get_thrust.cmake index 574bfa26a0c..fcf9f0d73ee 100644 --- a/cpp/cmake/thirdparty/get_thrust.cmake +++ b/cpp/cmake/thirdparty/get_thrust.cmake @@ -80,6 +80,6 @@ function(find_and_configure_thrust VERSION) endif() endfunction() -set(CUDF_MIN_VERSION_Thrust 1.12.0) +set(CUDF_MIN_VERSION_Thrust 1.15.0) find_and_configure_thrust(${CUDF_MIN_VERSION_Thrust}) diff --git a/cpp/include/cudf/datetime.hpp b/cpp/include/cudf/datetime.hpp index 17bea935dfd..117119cd40f 100644 --- a/cpp/include/cudf/datetime.hpp +++ b/cpp/include/cudf/datetime.hpp @@ -285,280 +285,66 @@ std::unique_ptr extract_quarter( cudf::column_view const& column, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); -/** @} */ // end of group - -/** - * @brief Round up to the nearest day - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_day( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round up to the nearest hour - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_hour( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round up to the nearest minute - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_minute( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round up to the nearest second - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_second( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round up to the nearest millisecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_millisecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round up to the nearest microsecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_microsecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round up to the nearest nanosecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr ceil_nanosecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - /** - * @brief Round down to the nearest day - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. + * @brief Fixed frequencies supported by datetime rounding functions ceil, floor, round. * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column */ -std::unique_ptr floor_day( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); +enum class rounding_frequency : int32_t { + DAY, + HOUR, + MINUTE, + SECOND, + MILLISECOND, + MICROSECOND, + NANOSECOND +}; /** - * @brief Round down to the nearest hour + * @brief Round datetimes up to the nearest multiple of the given frequency. * - * @param column cudf::column_view of the input datetime values + * @param column cudf::column_view of the input datetime values. + * @param freq rounding_frequency indicating the frequency to round up to. * @param mr Device memory resource used to allocate device memory of the returned column. * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column + * @throw cudf::logic_error if input column datatype is not TIMESTAMP. + * @return cudf::column of the same datetime resolution as the input column. */ -std::unique_ptr floor_hour( +std::unique_ptr ceil_datetimes( cudf::column_view const& column, + rounding_frequency freq, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); /** - * @brief Round down to the nearest minute + * @brief Round datetimes down to the nearest multiple of the given frequency. * - * @param column cudf::column_view of the input datetime values + * @param column cudf::column_view of the input datetime values. + * @param freq rounding_frequency indicating the frequency to round down to. * @param mr Device memory resource used to allocate device memory of the returned column. * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column + * @throw cudf::logic_error if input column datatype is not TIMESTAMP. + * @return cudf::column of the same datetime resolution as the input column. */ -std::unique_ptr floor_minute( +std::unique_ptr floor_datetimes( cudf::column_view const& column, + rounding_frequency freq, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); /** - * @brief Round down to the nearest second + * @brief Round datetimes to the nearest multiple of the given frequency. * - * @param column cudf::column_view of the input datetime values + * @param column cudf::column_view of the input datetime values. + * @param freq rounding_frequency indicating the frequency to round to. * @param mr Device memory resource used to allocate device memory of the returned column. * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column + * @throw cudf::logic_error if input column datatype is not TIMESTAMP. + * @return cudf::column of the same datetime resolution as the input column. */ -std::unique_ptr floor_second( +std::unique_ptr round_datetimes( cudf::column_view const& column, + rounding_frequency freq, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); -/** - * @brief Round down to the nearest millisecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr floor_millisecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round down to the nearest microsecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr floor_microsecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round down to the nearest nanosecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr floor_nanosecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest day - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_day( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest hour - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_hour( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest minute - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_minute( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest second - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_second( - cudf::column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest millisecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_millisecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest microsecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_microsecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); - -/** - * @brief Round to the nearest nanosecond - * - * @param column cudf::column_view of the input datetime values - * @param mr Device memory resource used to allocate device memory of the returned column. - * - * @throw cudf::logic_error if input column datatype is not TIMESTAMP - * @return cudf::column of the same datetime resolution as the input column - */ -std::unique_ptr round_nanosecond( - column_view const& column, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); +/** @} */ // end of group } // namespace datetime } // namespace cudf diff --git a/cpp/include/cudf/detail/hashing.hpp b/cpp/include/cudf/detail/hashing.hpp index bd5c8a42a51..0fc807593fb 100644 --- a/cpp/include/cudf/detail/hashing.hpp +++ b/cpp/include/cudf/detail/hashing.hpp @@ -32,17 +32,15 @@ namespace detail { */ std::unique_ptr hash( table_view const& input, - hash_id hash_function = hash_id::HASH_MURMUR3, - cudf::host_span initial_hash = {}, - uint32_t seed = 0, - rmm::cuda_stream_view stream = rmm::cuda_stream_default, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + hash_id hash_function = hash_id::HASH_MURMUR3, + uint32_t seed = 0, + rmm::cuda_stream_view stream = rmm::cuda_stream_default, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); std::unique_ptr murmur_hash3_32( table_view const& input, - cudf::host_span initial_hash = {}, - rmm::cuda_stream_view stream = rmm::cuda_stream_default, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + rmm::cuda_stream_view stream = rmm::cuda_stream_default, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); std::unique_ptr md5_hash( table_view const& input, diff --git a/cpp/include/cudf/detail/merge.cuh b/cpp/include/cudf/detail/merge.cuh index f141d9b5d59..ee5cb5c265d 100644 --- a/cpp/include/cudf/detail/merge.cuh +++ b/cpp/include/cudf/detail/merge.cuh @@ -80,14 +80,10 @@ struct tagged_element_relational_comparator { __device__ weak_ordering compare(index_type lhs_tagged_index, index_type rhs_tagged_index) const noexcept { - side const l_side = thrust::get<0>(lhs_tagged_index); - side const r_side = thrust::get<0>(rhs_tagged_index); - - cudf::size_type const l_indx = thrust::get<1>(lhs_tagged_index); - cudf::size_type const r_indx = thrust::get<1>(rhs_tagged_index); + auto const [l_side, l_indx] = lhs_tagged_index; + auto const [r_side, r_indx] = rhs_tagged_index; column_device_view const* ptr_left_dview{l_side == side::LEFT ? &lhs : &rhs}; - column_device_view const* ptr_right_dview{r_side == side::LEFT ? &lhs : &rhs}; auto erl_comparator = element_relational_comparator( diff --git a/cpp/include/cudf/dictionary/dictionary_column_view.hpp b/cpp/include/cudf/dictionary/dictionary_column_view.hpp index 1da52e67e06..42f8310040e 100644 --- a/cpp/include/cudf/dictionary/dictionary_column_view.hpp +++ b/cpp/include/cudf/dictionary/dictionary_column_view.hpp @@ -77,6 +77,11 @@ class dictionary_column_view : private column_view { */ column_view keys() const noexcept; + /** + * @brief Returns the `data_type` of the keys child column. + */ + data_type keys_type() const noexcept; + /** * @brief Returns the number of rows in the keys column. */ diff --git a/cpp/include/cudf/filling.hpp b/cpp/include/cudf/filling.hpp index aff0d20a467..905a897eb40 100644 --- a/cpp/include/cudf/filling.hpp +++ b/cpp/include/cudf/filling.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2020, NVIDIA CORPORATION. + * Copyright (c) 2019-2021, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -169,7 +169,7 @@ std::unique_ptr repeat( * @param init First value in the sequence * @param step Increment value * @param mr Device memory resource used to allocate the returned column's device memory - * @return std::unique_ptr The result table containing the sequence + * @return The result column containing the generated sequence */ std::unique_ptr sequence( size_type size, @@ -195,7 +195,7 @@ std::unique_ptr sequence( * @param size Size of the output column * @param init First value in the sequence * @param mr Device memory resource used to allocate the returned column's device memory - * @return std::unique_ptr The result table containing the sequence + * @return The result column containing the generated sequence */ std::unique_ptr sequence( size_type size, @@ -223,7 +223,7 @@ std::unique_ptr sequence( * @param months Months to increment * @param mr Device memory resource used to allocate the returned column's device memory * - * @returns Timestamps column with sequences of months. + * @return Timestamps column with sequences of months. */ std::unique_ptr calendrical_month_sequence( size_type size, diff --git a/cpp/include/cudf/hashing.hpp b/cpp/include/cudf/hashing.hpp index 6b281c3f7f4..cce05042917 100644 --- a/cpp/include/cudf/hashing.hpp +++ b/cpp/include/cudf/hashing.hpp @@ -31,8 +31,6 @@ namespace cudf { * * @param input The table of columns to hash. * @param hash_function The hash function enum to use. - * @param initial_hash Optional host_span of initial hash values for each column. - * If this span is empty then each element will be hashed as-is. * @param seed Optional seed value to use for the hash function. * @param mr Device memory resource used to allocate the returned column's device memory. * @@ -40,10 +38,9 @@ namespace cudf { */ std::unique_ptr hash( table_view const& input, - hash_id hash_function = hash_id::HASH_MURMUR3, - cudf::host_span initial_hash = {}, - uint32_t seed = DEFAULT_HASH_SEED, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + hash_id hash_function = hash_id::HASH_MURMUR3, + uint32_t seed = DEFAULT_HASH_SEED, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); /** @} */ // end of group } // namespace cudf diff --git a/cpp/include/cudf/io/data_sink.hpp b/cpp/include/cudf/io/data_sink.hpp index 42421aed716..2c1966ee6ba 100644 --- a/cpp/include/cudf/io/data_sink.hpp +++ b/cpp/include/cudf/io/data_sink.hpp @@ -69,6 +69,22 @@ class data_sink { */ static std::unique_ptr create(cudf::io::data_sink* const user_sink); + /** + * @brief Creates a vector of data sinks, one per element in the input vector. + * + * @param[in] args vector of parameters + */ + template + static std::vector> create(std::vector const& args) + { + std::vector> sinks; + sinks.reserve(args.size()); + std::transform(args.cbegin(), args.cend(), std::back_inserter(sinks), [](auto const& arg) { + return data_sink::create(arg); + }); + return sinks; + } + /** * @brief Base class destructor */ diff --git a/cpp/include/cudf/io/detail/parquet.hpp b/cpp/include/cudf/io/detail/parquet.hpp index a18bd450640..9af2e3f278d 100644 --- a/cpp/include/cudf/io/detail/parquet.hpp +++ b/cpp/include/cudf/io/detail/parquet.hpp @@ -89,13 +89,13 @@ class writer { /** * @brief Constructor for output to a file. * - * @param sink The data sink to write the data to + * @param sinks The data sinks to write the data to * @param options Settings for controlling writing behavior * @param mode Option to write at once or in chunks * @param stream CUDA stream used for device memory operations and kernel launches * @param mr Device memory resource to use for device memory allocation */ - explicit writer(std::unique_ptr sink, + explicit writer(std::vector> sinks, parquet_writer_options const& options, SingleWriteMode mode, rmm::cuda_stream_view stream, @@ -104,7 +104,7 @@ class writer { /** * @brief Constructor for writer to handle chunked parquet options. * - * @param sink The data sink to write the data to + * @param sinks The data sinks to write the data to * @param options Settings for controlling writing behavior for chunked writer * @param mode Option to write at once or in chunks * @param stream CUDA stream used for device memory operations and kernel launches @@ -112,7 +112,7 @@ class writer { * * @return A parquet-compatible blob that contains the data for all rowgroups in the list */ - explicit writer(std::unique_ptr sink, + explicit writer(std::vector> sinks, chunked_parquet_writer_options const& options, SingleWriteMode mode, rmm::cuda_stream_view stream, @@ -127,8 +127,10 @@ class writer { * @brief Writes a single subtable as part of a larger parquet file/table write. * * @param[in] table The table information to be written + * @param[in] partitions Optional partitions to divide the table into. If specified, must be same + * size as number of sinks. */ - void write(table_view const& table); + void write(table_view const& table, std::vector const& partitions = {}); /** * @brief Finishes the chunked/streamed write process. @@ -138,7 +140,8 @@ class writer { * @return A parquet-compatible blob that contains the data for all rowgroups in the list only if * `column_chunks_file_path` is provided, else null. */ - std::unique_ptr> close(std::string const& column_chunks_file_path = ""); + std::unique_ptr> close( + std::vector const& column_chunks_file_path = {}); /** * @brief Merges multiple metadata blobs returned by write_all into a single metadata blob diff --git a/cpp/include/cudf/io/orc.hpp b/cpp/include/cudf/io/orc.hpp index 16588185f3d..b3a2f6bcbbb 100644 --- a/cpp/include/cudf/io/orc.hpp +++ b/cpp/include/cudf/io/orc.hpp @@ -454,6 +454,8 @@ class orc_writer_options { table_view _table; // Optional associated metadata const table_input_metadata* _metadata = nullptr; + // Optional footer key_value_metadata + std::map _user_data; friend orc_writer_options_builder; @@ -530,6 +532,11 @@ class orc_writer_options { */ table_input_metadata const* get_metadata() const { return _metadata; } + /** + * @brief Returns Key-Value footer metadata information. + */ + std::map const& get_key_value_metadata() const { return _user_data; } + // Setters /** @@ -591,6 +598,16 @@ class orc_writer_options { * @param meta Associated metadata. */ void set_metadata(table_input_metadata const* meta) { _metadata = meta; } + + /** + * @brief Sets metadata. + * + * @param metadata Key-Value footer metadata + */ + void set_key_value_metadata(std::map metadata) + { + _user_data = std::move(metadata); + } }; class orc_writer_options_builder { @@ -698,6 +715,18 @@ class orc_writer_options_builder { return *this; } + /** + * @brief Sets Key-Value footer metadata. + * + * @param metadata Key-Value footer metadata + * @return this for chaining. + */ + orc_writer_options_builder& key_value_metadata(std::map metadata) + { + options._user_data = std::move(metadata); + return *this; + } + /** * @brief move orc_writer_options member once it's built. */ @@ -753,6 +782,8 @@ class chunked_orc_writer_options { size_type _row_index_stride = default_row_index_stride; // Optional associated metadata const table_input_metadata* _metadata = nullptr; + // Optional footer key_value_metadata + std::map _user_data; friend chunked_orc_writer_options_builder; @@ -819,6 +850,11 @@ class chunked_orc_writer_options { */ table_input_metadata const* get_metadata() const { return _metadata; } + /** + * @brief Returns Key-Value footer metadata information. + */ + std::map const& get_key_value_metadata() const { return _user_data; } + // Setters /** @@ -873,6 +909,16 @@ class chunked_orc_writer_options { * @param meta Associated metadata. */ void metadata(table_input_metadata const* meta) { _metadata = meta; } + + /** + * @brief Sets Key-Value footer metadata. + * + * @param metadata Key-Value footer metadata + */ + void set_key_value_metadata(std::map metadata) + { + _user_data = std::move(metadata); + } }; class chunked_orc_writer_options_builder { @@ -965,6 +1011,19 @@ class chunked_orc_writer_options_builder { return *this; } + /** + * @brief Sets Key-Value footer metadata. + * + * @param metadata Key-Value footer metadata + * @return this for chaining. + */ + chunked_orc_writer_options_builder& key_value_metadata( + std::map metadata) + { + options._user_data = std::move(metadata); + return *this; + } + /** * @brief move chunked_orc_writer_options member once it's built. */ diff --git a/cpp/include/cudf/io/parquet.hpp b/cpp/include/cudf/io/parquet.hpp index 2215f24b550..740f7a8b2db 100644 --- a/cpp/include/cudf/io/parquet.hpp +++ b/cpp/include/cudf/io/parquet.hpp @@ -364,13 +364,17 @@ class parquet_writer_options { statistics_freq _stats_level = statistics_freq::STATISTICS_ROWGROUP; // Sets of columns to output table_view _table; + // Partitions described as {start_row, num_rows} pairs + std::vector _partitions; // Optional associated metadata table_input_metadata const* _metadata = nullptr; + // Optional footer key_value_metadata + std::vector> _user_data; // Parquet writer can write INT96 or TIMESTAMP_MICROS. Defaults to TIMESTAMP_MICROS. // If true then overrides any per-column setting in _metadata. bool _write_timestamps_as_int96 = false; - // Column chunks file path to be set in the raw output metadata - std::string _column_chunks_file_path; + // Column chunks file paths to be set in the raw output metadata. One per output file + std::vector _column_chunks_file_paths; // Maximum size of each row group (unless smaller than a single page) size_t _row_group_size_bytes = default_row_group_size_bytes; // Maximum number of rows in row group (unless smaller than a single page) @@ -434,20 +438,36 @@ class parquet_writer_options { */ table_view get_table() const { return _table; } + /** + * @brief Returns partitions. + */ + std::vector const& get_partitions() const { return _partitions; } + /** * @brief Returns associated metadata. */ table_input_metadata const* get_metadata() const { return _metadata; } + /** + * @brief Returns Key-Value footer metadata information. + */ + std::vector> const& get_key_value_metadata() const + { + return _user_data; + } + /** * @brief Returns `true` if timestamps will be written as INT96 */ bool is_enabled_int96_timestamps() const { return _write_timestamps_as_int96; } /** - * @brief Returns Column chunks file path to be set in the raw output metadata. + * @brief Returns Column chunks file paths to be set in the raw output metadata. */ - std::string get_column_chunks_file_path() const { return _column_chunks_file_path; } + std::vector const& get_column_chunks_file_paths() const + { + return _column_chunks_file_paths; + } /** * @brief Returns maximum row group size, in bytes. @@ -459,6 +479,19 @@ class parquet_writer_options { */ auto get_row_group_size_rows() const { return _row_group_size_rows; } + /** + * @brief Sets partitions. + * + * @param partitions Partitions of input table in {start_row, num_rows} pairs. If specified, must + * be same size as number of sinks in sink_info + */ + void set_partitions(std::vector partitions) + { + CUDF_EXPECTS(partitions.size() == _sink.num_sinks(), + "Mismatch between number of sinks and number of partitions"); + _partitions = std::move(partitions); + } + /** * @brief Sets metadata. * @@ -466,6 +499,18 @@ class parquet_writer_options { */ void set_metadata(table_input_metadata const* metadata) { _metadata = metadata; } + /** + * @brief Sets metadata. + * + * @param metadata Key-Value footer metadata + */ + void set_key_value_metadata(std::vector> metadata) + { + CUDF_EXPECTS(metadata.size() == _sink.num_sinks(), + "Mismatch between number of sinks and number of metadata maps"); + _user_data = std::move(metadata); + } + /** * @brief Sets the level of statistics. * @@ -491,11 +536,14 @@ class parquet_writer_options { /** * @brief Sets column chunks file path to be set in the raw output metadata. * - * @param file_path String which indicates file path. + * @param file_paths Vector of Strings which indicates file path. Must be same size as number of + * data sinks in sink info */ - void set_column_chunks_file_path(std::string file_path) + void set_column_chunks_file_paths(std::vector file_paths) { - _column_chunks_file_path.assign(file_path); + CUDF_EXPECTS(file_paths.size() == _sink.num_sinks(), + "Mismatch between number of sinks and number of chunk paths to set"); + _column_chunks_file_paths = std::move(file_paths); } /** @@ -543,6 +591,21 @@ class parquet_writer_options_builder { { } + /** + * @brief Sets partitions in parquet_writer_options. + * + * @param partitions Partitions of input table in {start_row, num_rows} pairs. If specified, must + * be same size as number of sinks in sink_info + * @return this for chaining. + */ + parquet_writer_options_builder& partitions(std::vector partitions) + { + CUDF_EXPECTS(partitions.size() == options._sink.num_sinks(), + "Mismatch between number of sinks and number of partitions"); + options.set_partitions(std::move(partitions)); + return *this; + } + /** * @brief Sets metadata in parquet_writer_options. * @@ -555,6 +618,21 @@ class parquet_writer_options_builder { return *this; } + /** + * @brief Sets Key-Value footer metadata in parquet_writer_options. + * + * @param metadata Key-Value footer metadata + * @return this for chaining. + */ + parquet_writer_options_builder& key_value_metadata( + std::vector> metadata) + { + CUDF_EXPECTS(metadata.size() == options._sink.num_sinks(), + "Mismatch between number of sinks and number of metadata maps"); + options._user_data = std::move(metadata); + return *this; + } + /** * @brief Sets the level of statistics in parquet_writer_options. * @@ -582,12 +660,15 @@ class parquet_writer_options_builder { /** * @brief Sets column chunks file path to be set in the raw output metadata. * - * @param file_path String which indicates file path. + * @param file_paths Vector of Strings which indicates file path. Must be same size as number of + * data sinks * @return this for chaining. */ - parquet_writer_options_builder& column_chunks_file_path(std::string file_path) + parquet_writer_options_builder& column_chunks_file_paths(std::vector file_paths) { - options._column_chunks_file_path.assign(file_path); + CUDF_EXPECTS(file_paths.size() == options._sink.num_sinks(), + "Mismatch between number of sinks and number of chunk paths to set"); + options.set_column_chunks_file_paths(std::move(file_paths)); return *this; } @@ -690,6 +771,8 @@ class chunked_parquet_writer_options { statistics_freq _stats_level = statistics_freq::STATISTICS_ROWGROUP; // Optional associated metadata. table_input_metadata const* _metadata = nullptr; + // Optional footer key_value_metadata + std::vector> _user_data; // Parquet writer can write INT96 or TIMESTAMP_MICROS. Defaults to TIMESTAMP_MICROS. // If true then overrides any per-column setting in _metadata. bool _write_timestamps_as_int96 = false; @@ -735,6 +818,14 @@ class chunked_parquet_writer_options { */ table_input_metadata const* get_metadata() const { return _metadata; } + /** + * @brief Returns Key-Value footer metadata information. + */ + std::vector> const& get_key_value_metadata() const + { + return _user_data; + } + /** * @brief Returns `true` if timestamps will be written as INT96 */ @@ -757,6 +848,18 @@ class chunked_parquet_writer_options { */ void set_metadata(table_input_metadata const* metadata) { _metadata = metadata; } + /** + * @brief Sets Key-Value footer metadata. + * + * @param metadata Key-Value footer metadata + */ + void set_key_value_metadata(std::vector> metadata) + { + CUDF_EXPECTS(metadata.size() == _sink.num_sinks(), + "Mismatch between number of sinks and number of metadata maps"); + _user_data = std::move(metadata); + } + /** * @brief Sets the level of statistics in parquet_writer_options. * @@ -841,6 +944,21 @@ class chunked_parquet_writer_options_builder { return *this; } + /** + * @brief Sets Key-Value footer metadata in parquet_writer_options. + * + * @param metadata Key-Value footer metadata + * @return this for chaining. + */ + chunked_parquet_writer_options_builder& key_value_metadata( + std::vector> metadata) + { + CUDF_EXPECTS(metadata.size() == options._sink.num_sinks(), + "Mismatch between number of sinks and number of metadata maps"); + options.set_key_value_metadata(std::move(metadata)); + return *this; + } + /** * @brief Sets Sets the level of statistics in chunked_parquet_writer_options. * @@ -958,18 +1076,25 @@ class parquet_chunked_writer { * @brief Writes table to output. * * @param[in] table Table that needs to be written + * @param[in] partitions Optional partitions to divide the table into. If specified, must be same + * size as number of sinks. + * + * @throws cudf::logic_error If the number of partitions is not the smae as number of sinks * @return returns reference of the class object */ - parquet_chunked_writer& write(table_view const& table); + parquet_chunked_writer& write(table_view const& table, + std::vector const& partitions = {}); /** * @brief Finishes the chunked/streamed write process. * - * @param[in] column_chunks_file_path Column chunks file path to be set in the raw output metadata + * @param[in] column_chunks_file_paths Column chunks file path to be set in the raw output + * metadata * @return A parquet-compatible blob that contains the data for all rowgroups in the list only if - * `column_chunks_file_path` is provided, else null. + * `column_chunks_file_paths` is provided, else null. */ - std::unique_ptr> close(std::string const& column_chunks_file_path = ""); + std::unique_ptr> close( + std::vector const& column_chunks_file_paths = {}); // Unique pointer to impl writer class std::unique_ptr writer; diff --git a/cpp/include/cudf/io/types.hpp b/cpp/include/cudf/io/types.hpp index cf6be8a20af..512a90b3249 100644 --- a/cpp/include/cudf/io/types.hpp +++ b/cpp/include/cudf/io/types.hpp @@ -151,61 +151,93 @@ struct host_buffer { * @brief Source information for read interfaces */ struct source_info { - io_type type = io_type::FILEPATH; - std::vector filepaths; - std::vector buffers; - std::vector> files; - std::vector user_sources; + std::vector> _files; source_info() = default; explicit source_info(std::vector const& file_paths) - : type(io_type::FILEPATH), filepaths(file_paths) + : _type(io_type::FILEPATH), _filepaths(file_paths) { } explicit source_info(std::string const& file_path) - : type(io_type::FILEPATH), filepaths({file_path}) + : _type(io_type::FILEPATH), _filepaths({file_path}) { } explicit source_info(std::vector const& host_buffers) - : type(io_type::HOST_BUFFER), buffers(host_buffers) + : _type(io_type::HOST_BUFFER), _buffers(host_buffers) { } explicit source_info(const char* host_data, size_t size) - : type(io_type::HOST_BUFFER), buffers({{host_data, size}}) + : _type(io_type::HOST_BUFFER), _buffers({{host_data, size}}) { } explicit source_info(std::vector const& sources) - : type(io_type::USER_IMPLEMENTED), user_sources(sources) + : _type(io_type::USER_IMPLEMENTED), _user_sources(sources) { } explicit source_info(cudf::io::datasource* source) - : type(io_type::USER_IMPLEMENTED), user_sources({source}) + : _type(io_type::USER_IMPLEMENTED), _user_sources({source}) { } + + auto type() const { return _type; } + auto const& filepaths() const { return _filepaths; } + auto const& buffers() const { return _buffers; } + auto const& files() const { return _files; } + auto const& user_sources() const { return _user_sources; } + + private: + io_type _type = io_type::FILEPATH; + std::vector _filepaths; + std::vector _buffers; + std::vector _user_sources; }; /** * @brief Destination information for write interfaces */ struct sink_info { - io_type type = io_type::VOID; - std::string filepath; - std::vector* buffer = nullptr; - cudf::io::data_sink* user_sink = nullptr; - sink_info() = default; + sink_info(size_t num_sinks) : _type(io_type::VOID), _num_sinks(num_sinks) {} - explicit sink_info(const std::string& file_path) : type(io_type::FILEPATH), filepath(file_path) {} + explicit sink_info(std::vector const& file_paths) + : _type(io_type::FILEPATH), _num_sinks(file_paths.size()), _filepaths(file_paths) + { + } + explicit sink_info(std::string const& file_path) + : _type(io_type::FILEPATH), _filepaths({file_path}) + { + } - explicit sink_info(std::vector* buffer) : type(io_type::HOST_BUFFER), buffer(buffer) {} + explicit sink_info(std::vector*> const& buffers) + : _type(io_type::HOST_BUFFER), _num_sinks(buffers.size()), _buffers(buffers) + { + } + explicit sink_info(std::vector* buffer) : _type(io_type::HOST_BUFFER), _buffers({buffer}) {} - explicit sink_info(class cudf::io::data_sink* user_sink_) - : type(io_type::USER_IMPLEMENTED), user_sink(user_sink_) + explicit sink_info(std::vector const& user_sinks) + : _type(io_type::USER_IMPLEMENTED), _num_sinks(user_sinks.size()), _user_sinks(user_sinks) { } + explicit sink_info(class cudf::io::data_sink* user_sink) + : _type(io_type::USER_IMPLEMENTED), _user_sinks({user_sink}) + { + } + + auto type() const { return _type; } + auto num_sinks() const { return _num_sinks; } + auto const& filepaths() const { return _filepaths; } + auto const& buffers() const { return _buffers; } + auto const& user_sinks() const { return _user_sinks; } + + private: + io_type _type = io_type::VOID; + size_t _num_sinks = 1; + std::vector _filepaths; + std::vector*> _buffers; + std::vector _user_sinks; }; class table_input_metadata; @@ -369,12 +401,21 @@ class table_input_metadata { * The constructed table_input_metadata has the same structure as the passed table_view * * @param table The table_view to construct metadata for - * @param user_data Optional Additional metadata to encode, as key-value pairs */ - table_input_metadata(table_view const& table, std::map user_data = {}); + table_input_metadata(table_view const& table); std::vector column_metadata; - std::map user_data; //!< Format-dependent metadata as key-values pairs +}; + +/** + * @brief Information used while writing partitioned datasets + * + * This information defines the slice of an input table to write to file. In partitioned dataset + * writing, one partition_info struct defines one partition and corresponds to one output file + */ +struct partition_info { + size_type start_row; + size_type num_rows; }; } // namespace io diff --git a/cpp/include/cudf/lists/contains.hpp b/cpp/include/cudf/lists/contains.hpp index 7cd40bb2f86..d529677d505 100644 --- a/cpp/include/cudf/lists/contains.hpp +++ b/cpp/include/cudf/lists/contains.hpp @@ -27,7 +27,7 @@ namespace lists { */ /** - * @brief Create a column of bool values indicating whether the specified scalar + * @brief Create a column of `bool` values indicating whether the specified scalar * is an element of each row of a list column. * * The output column has as many elements as the input `lists` column. @@ -51,7 +51,7 @@ std::unique_ptr contains( rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); /** - * @brief Create a column of bool values indicating whether the list rows of the first + * @brief Create a column of `bool` values indicating whether the list rows of the first * column contain the corresponding values in the second column * * The output column has as many elements as the input `lists` column. @@ -74,6 +74,104 @@ std::unique_ptr contains( cudf::column_view const& search_keys, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); +/** + * @brief Create a column of `bool` values indicating whether each row in the `lists` column + * contains at least one null element. + * + * The output column has as many elements as the input `lists` column. + * Output `column[i]` is set to null the list row `lists[i]` is null. + * Otherwise, `column[i]` is set to a non-null boolean value, depending on whether that list + * contains a null element. + * (Empty list rows are considered *NOT* to contain a null element.) + * + * @param lists Lists column whose `n` rows are to be searched + * @param mr Device memory resource used to allocate the returned column's device memory. + * @return std::unique_ptr BOOL8 column of `n` rows with the result of the lookup + */ +std::unique_ptr contains_nulls( + cudf::lists_column_view const& lists, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + +/** + * @brief Option to choose whether `index_of()` returns the first or last match + * of a search key in a list row + */ +enum class duplicate_find_option : int32_t { + FIND_FIRST = 0, ///< Finds first instance of a search key in a list row. + FIND_LAST ///< Finds last instance of a search key in a list row. +}; + +/** + * @brief Create a column of `size_type` values indicating the position of a search key + * within each list row in the `lists` column + * + * The output column has as many elements as there are rows in the input `lists` column. + * Output `column[i]` contains a 0-based index indicating the position of the search key + * in each list, counting from the beginning of the list. + * Note: + * 1. If the `search_key` is null, all output rows are set to null. + * 2. If the row `lists[i]` is null, `output[i]` is also null. + * 3. If the row `lists[i]` does not contain the `search_key`, `output[i]` is set to `-1`. + * 4. In all other cases, `output[i]` is set to a non-negative `size_type` index. + * + * If the `find_option` is set to `FIND_FIRST`, the position of the first match for + * `search_key` is returned. + * If `find_option == FIND_LAST`, the position of the last match in the list row is + * returned. + * + * @param lists Lists column whose `n` rows are to be searched + * @param search_key The scalar key to be looked up in each list row + * @param find_option Whether to return the position of the first match (`FIND_FIRST`) or + * last (`FIND_LAST`) + * @param mr Device memory resource used to allocate the returned column's device memory. + * @return std::unique_ptr INT32 column of `n` rows with the location of the `search_key` + * + * @throw cudf::logic_error If `search_key` type does not match the element type in `lists` + * @throw cudf::logic_error If `search_key` is of a nested type, or `lists` contains nested + * elements (LIST, STRUCT) + */ +std::unique_ptr index_of( + cudf::lists_column_view const& lists, + cudf::scalar const& search_key, + duplicate_find_option find_option = duplicate_find_option::FIND_FIRST, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + +/** + * @brief Create a column of `size_type` values indicating the position of a search key + * row within the corresponding list row in the `lists` column + * + * The output column has as many elements as there are rows in the input `lists` column. + * Output `column[i]` contains a 0-based index indicating the position of each search key + * row in its corresponding list row, counting from the beginning of the list. + * Note: + * 1. If `search_keys[i]` is null, `output[i]` is also null. + * 2. If the row `lists[i]` is null, `output[i]` is also null. + * 3. If the row `lists[i]` does not contain `search_key[i]`, `output[i]` is set to `-1`. + * 4. In all other cases, `output[i]` is set to a non-negative `size_type` index. + * + * If the `find_option` is set to `FIND_FIRST`, the position of the first match for + * `search_key` is returned. + * If `find_option == FIND_LAST`, the position of the last match in the list row is + * returned. + * + * @param lists Lists column whose `n` rows are to be searched + * @param search_keys A column of search keys to be looked up in each corresponding row of + * `lists` + * @param find_option Whether to return the position of the first match (`FIND_FIRST`) or + * last (`FIND_LAST`) + * @param mr Device memory resource used to allocate the returned column's device memory. + * @return std::unique_ptr INT32 column of `n` rows with the location of the `search_key` + * + * @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows + * @throw cudf::logic_error If `search_keys` type does not match the element type in `lists` + * @throw cudf::logic_error If `lists` or `search_keys` contains nested elements (LIST, STRUCT) + */ +std::unique_ptr index_of( + cudf::lists_column_view const& lists, + cudf::column_view const& search_keys, + duplicate_find_option find_option = duplicate_find_option::FIND_FIRST, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + /** @} */ // end of group } // namespace lists } // namespace cudf diff --git a/cpp/include/cudf/lists/filling.hpp b/cpp/include/cudf/lists/filling.hpp new file mode 100644 index 00000000000..74a4dac1e10 --- /dev/null +++ b/cpp/include/cudf/lists/filling.hpp @@ -0,0 +1,105 @@ +/* + * Copyright (c) 2021, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +#include + +namespace cudf::lists { +/** + * @addtogroup lists_filling + * @{ + * @file + * @brief Column APIs for individual list sequence + */ + +/** + * @brief Create a lists column in which each row contains a sequence of values specified by a tuple + * of (`start`, `size`) parameters. + * + * Create a lists column in which each row is a sequence of values starting from a `start` value, + * incrementing by one, and its cardinality is specified by a `size` value. The `start` and `size` + * values used to generate each list is taken from the corresponding row of the input @p starts and + * @p sizes columns. + * + * - @p sizes must be a column of integer types. + * - All the input columns must not have nulls. + * - If any row of the @p sizes column contains negative value, the output is undefined. + * + * @code{.pseudo} + * starts = [0, 1, 2, 3, 4] + * sizes = [0, 2, 2, 1, 3] + * + * output = [ [], [1, 2], [2, 3], [3], [4, 5, 6] ] + * @endcode + * + * @throws cudf::logic_error if @p sizes column is not of integer types. + * @throws cudf::logic_error if any input column has nulls. + * @throws cudf::logic_error if @p starts and @p sizes columns do not have the same size. + * + * @param starts First values in the result sequences. + * @param sizes Numbers of values in the result sequences. + * @param mr Device memory resource used to allocate the returned column's device memory. + * @return The result column containing generated sequences. + */ +std::unique_ptr sequences( + column_view const& starts, + column_view const& sizes, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + +/** + * @brief Create a lists column in which each row contains a sequence of values specified by a tuple + * of (`start`, `step`, `size`) parameters. + * + * Create a lists column in which each row is a sequence of values starting from a `start` value, + * incrementing by a `step` value, and its cardinality is specified by a `size` value. The values + * `start`, `step`, and `size` used to generate each list is taken from the corresponding row of the + * input @p starts, @p steps, and @p sizes columns. + * + * - @p sizes must be a column of integer types. + * - @p starts and @p steps columns must have the same type. + * - All the input columns must not have nulls. + * - If any row of the @p sizes column contains negative value, the output is undefined. + * + * @code{.pseudo} + * starts = [0, 1, 2, 3, 4] + * steps = [2, 1, 1, 1, -3] + * sizes = [0, 2, 2, 1, 3] + * + * output = [ [], [1, 2], [2, 3], [3], [4, 1, -2] ] + * @endcode + * + * @throws cudf::logic_error if @p sizes column is not of integer types. + * @throws cudf::logic_error if any input column has nulls. + * @throws cudf::logic_error if @p starts and @p steps columns have different types. + * @throws cudf::logic_error if @p starts, @p steps, and @p sizes columns do not have the same size. + * + * @param starts First values in the result sequences. + * @param steps Increment values for the result sequences. + * @param sizes Numbers of values in the result sequences. + * @param mr Device memory resource used to allocate the returned column's device memory. + * @return The result column containing generated sequences. + */ +std::unique_ptr sequences( + column_view const& starts, + column_view const& steps, + column_view const& sizes, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + +/** @} */ // end of group +} // namespace cudf::lists diff --git a/cpp/include/cudf/strings/detail/merge.cuh b/cpp/include/cudf/strings/detail/merge.cuh index a132d8c7229..dba1c24be93 100644 --- a/cpp/include/cudf/strings/detail/merge.cuh +++ b/cpp/include/cudf/strings/detail/merge.cuh @@ -68,8 +68,7 @@ std::unique_ptr merge(strings_column_view const& lhs, // build offsets column auto offsets_transformer = [d_lhs, d_rhs] __device__(auto index_pair) { - auto side = thrust::get<0>(index_pair); - auto index = thrust::get<1>(index_pair); + auto const [side, index] = index_pair; if (side == side::LEFT ? d_lhs.is_null(index) : d_rhs.is_null(index)) return 0; auto d_str = side == side::LEFT ? d_lhs.element(index) : d_rhs.element(index); @@ -90,9 +89,7 @@ std::unique_ptr merge(strings_column_view const& lhs, thrust::make_counting_iterator(0), strings_count, [d_lhs, d_rhs, begin, d_offsets, d_chars] __device__(size_type idx) { - index_type index_pair = begin[idx]; - auto side = thrust::get<0>(index_pair); - auto index = thrust::get<1>(index_pair); + auto const [side, index] = begin[idx]; if (side == side::LEFT ? d_lhs.is_null(index) : d_rhs.is_null(index)) return; auto d_str = side == side::LEFT ? d_lhs.element(index) : d_rhs.element(index); diff --git a/cpp/include/cudf/strings/detail/strings_column_factories.cuh b/cpp/include/cudf/strings/detail/strings_column_factories.cuh index b35f5df2903..9da3c6b0e91 100644 --- a/cpp/include/cudf/strings/detail/strings_column_factories.cuh +++ b/cpp/include/cudf/strings/detail/strings_column_factories.cuh @@ -33,6 +33,12 @@ namespace cudf { namespace strings { namespace detail { +/** + * @brief Basic type expected for iterators passed to `make_strings_column` that represent string + * data in device memory. + */ +using string_index_pair = thrust::pair; + /** * @brief Average string byte-length threshold for deciding character-level * vs. row-level parallel algorithm. @@ -64,8 +70,6 @@ std::unique_ptr make_strings_column(IndexPairIterator begin, size_type strings_count = thrust::distance(begin, end); if (strings_count == 0) return make_empty_column(type_id::STRING); - using string_index_pair = thrust::pair; - // check total size is not too large for cudf column auto size_checker = [] __device__(string_index_pair const& item) { return (item.first != nullptr) ? item.second : 0; diff --git a/cpp/include/cudf/strings/extract.hpp b/cpp/include/cudf/strings/extract.hpp index 6f5902266b2..466f71aace0 100644 --- a/cpp/include/cudf/strings/extract.hpp +++ b/cpp/include/cudf/strings/extract.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019, NVIDIA CORPORATION. + * Copyright (c) 2019-2021, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -27,20 +27,21 @@ namespace strings { */ /** - * @brief Returns a vector of strings columns for each matching group specified in the given regular - * expression pattern. + * @brief Returns a table of strings columns where each column corresponds to the matching + * group specified in the given regular expression pattern. * * All the strings for the first group will go in the first output column; the second group - * go in the second column and so on. Null entries are added if the string does match. + * go in the second column and so on. Null entries are added to the columns in row `i` if + * the string at row `i` does not match. * * Any null string entries return corresponding null output column entries. * * @code{.pseudo} * Example: - * s = ["a1","b2","c3"] - * r = extract(s,"([ab])(\\d)") - * r is now [["a","b",null], - * ["1","2",null]] + * s = ["a1", "b2", "c3"] + * r = extract(s, "([ab])(\\d)") + * r is now [ ["a", "b", null], + * ["1", "2", null] ] * @endcode * * See the @ref md_regex "Regex Features" page for details on patterns supported by this API. @@ -55,6 +56,39 @@ std::unique_ptr
extract( std::string const& pattern, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); +/** + * @brief Returns a lists column of strings where each string column row corresponds to the + * matching group specified in the given regular expression pattern. + * + * All the matching groups for the first row will go in the first row output column; the second + * row results will go into the second row output column and so on. + * + * A null output row will result if the corresponding input string row does not match or + * that input row is null. + * + * @code{.pseudo} + * Example: + * s = ["a1 b4", "b2", "c3 a5", "b", null] + * r = extract_all(s,"([ab])(\\d)") + * r is now [ ["a", "1", "b", "4"], + * ["b", "2"], + * ["a", "5"], + * null, + * null ] + * @endcode + * + * See the @ref md_regex "Regex Features" page for details on patterns supported by this API. + * + * @param strings Strings instance for this operation. + * @param pattern The regular expression pattern with group indicators. + * @param mr Device memory resource used to allocate any returned device memory. + * @return Lists column containing strings extracted from the input column. + */ +std::unique_ptr extract_all( + strings_column_view const& strings, + std::string const& pattern, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + /** @} */ // end of doxygen group } // namespace strings } // namespace cudf diff --git a/cpp/include/cudf/strings/replace_re.hpp b/cpp/include/cudf/strings/replace_re.hpp index 087d1a94603..a2c4eba1636 100644 --- a/cpp/include/cudf/strings/replace_re.hpp +++ b/cpp/include/cudf/strings/replace_re.hpp @@ -17,6 +17,7 @@ #include #include +#include #include namespace cudf { @@ -37,22 +38,25 @@ namespace strings { * * @param strings Strings instance for this operation. * @param pattern The regular expression pattern to search within each string. - * @param repl The string used to replace the matched sequence in each string. + * @param replacement The string used to replace the matched sequence in each string. * Default is an empty string. - * @param maxrepl The maximum number of times to replace the matched pattern within each string. + * @param max_replace_count The maximum number of times to replace the matched pattern + * within each string. Default replaces every substring that is matched. + * @param flags Regex flags for interpreting special characters in the pattern. * @param mr Device memory resource used to allocate the returned column's device memory. * @return New strings column. */ std::unique_ptr replace_re( strings_column_view const& strings, std::string const& pattern, - string_scalar const& repl = string_scalar(""), - size_type maxrepl = -1, - rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); + string_scalar const& replacement = string_scalar(""), + std::optional max_replace_count = std::nullopt, + regex_flags const flags = regex_flags::DEFAULT, + rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); /** * @brief For each string, replaces any character sequence matching the given patterns - * with the corresponding string in the repls column. + * with the corresponding string in the `replacements` column. * * Any null string entries return corresponding null output column entries. * @@ -60,14 +64,16 @@ std::unique_ptr replace_re( * * @param strings Strings instance for this operation. * @param patterns The regular expression patterns to search within each string. - * @param repls The strings used for replacement. + * @param replacements The strings used for replacement. + * @param flags Regex flags for interpreting special characters in the patterns. * @param mr Device memory resource used to allocate the returned column's device memory. * @return New strings column. */ std::unique_ptr replace_re( strings_column_view const& strings, std::vector const& patterns, - strings_column_view const& repls, + strings_column_view const& replacements, + regex_flags const flags = regex_flags::DEFAULT, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); /** @@ -83,6 +89,7 @@ std::unique_ptr replace_re( * @param strings Strings instance for this operation. * @param pattern The regular expression patterns to search within each string. * @param replacement The replacement template for creating the output string. + * @param flags Regex flags for interpreting special characters in the pattern. * @param mr Device memory resource used to allocate the returned column's device memory. * @return New strings column. */ @@ -90,6 +97,7 @@ std::unique_ptr replace_with_backrefs( strings_column_view const& strings, std::string const& pattern, std::string const& replacement, + regex_flags const flags = regex_flags::DEFAULT, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); } // namespace strings diff --git a/cpp/include/cudf/table/row_operators.cuh b/cpp/include/cudf/table/row_operators.cuh index 0f3ca073380..32ddd1ef49a 100644 --- a/cpp/include/cudf/table/row_operators.cuh +++ b/cpp/include/cudf/table/row_operators.cuh @@ -539,52 +539,4 @@ class row_hasher { uint32_t _seed{DEFAULT_HASH_SEED}; }; -/** - * @brief Computes the hash value of a row in the given table, combined with an - * initial hash value for each column. - * - * @tparam hash_function Hash functor to use for hashing elements. - * @tparam Nullate A cudf::nullate type describing how to check for nulls. - */ -template