Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v23.08 #13781

Merged
merged 189 commits into from
Aug 9, 2023
Merged

[RELEASE] cudf v23.08 #13781

merged 189 commits into from
Aug 9, 2023

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-23.08 and v23.08 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-23.08 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-23.08 into main for the release

raydouglass and others added 30 commits May 19, 2023 09:51
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
I originally placed the exception handler into a separate C++ header file that could be included by the Cython header because I figured that reading C++ inlined in Cython would be more confusing to devs. Unfortunately, the current approach complicates the build system due to the need to ensure that the directory containing the C++ header is always in the include path, which becomes problematic depending on where the files including the exception handler are (anywhere outside of `_lib` becomes problematic). Inlining is the simplest solution to this problem.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Bradley Dice (https://github.com/bdice)

URL: #13411
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Bump up JNI version to 23.08.0-SNAPSHOT in branch-23.08

Authors:
  - Peixin (https://github.com/pxLi)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: #13401
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Cleans up source files for nvtext and io-text pytests. The pytests are placed into separate files: `test_io_text.py` for the io-text pytests and `test_nvtext.py` for the nvtext pytests. Also removed the `python/cudf/cudf/tests/text` folder which contained 2 empty `.py` files.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13435
This PR attempts to allow using newer versions of scikit-build again.

cf. #13188

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13424
closes #13412
Remove weak references of cleaned resources when a resource is cleaned.
The cleaned objects are never leaked, it's safe to remove the weak references. 
This is to reduce the memory usage.

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - MithunR (https://github.com/mythrocks)

URL: #13378
Forward-merge branch-23.06 to branch-23.08
Depends on: rapidsai/rapids-cmake#393

Once the above PR is merged, this updated logic ensures that cudf places the custom versions of cccl packages in correct places, and can find them once installed.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13235
Remove/update repeated documentation text
Remove declaration repetitions in tdigest.hpp

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #13470
This moves the logic to update the columns returned from the JSON reader to java. It also updated the code to be able to deal with requested columns that were not in the data. It is not perfect because it will not work if the input file had no columns at all in it.

```
{}
{}
```

But it fixes issues for a file that has valid columns in it, but none of them are the columns that we requested.

This is a work around for #13473, but is not perfect.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - MithunR (https://github.com/mythrocks)

URL: #13477
Currently, chunked Parquet reader benchmark creates the chunked reader object once and reuses it for all iterations.
After the first iteration the source is fully read so each subsequent iteration returns a single, empty, chunk. 

This PR fixes the use of the chunked reader object.
The creation of the object is included in the benchmark timing.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13482
Removes the `max_rows_tensor` parameter is from the `nvtext::subword_tokenize` API since it is no longer required. The parameter was intended to size the temporary working memory for the internal functions. Since some general rework it was no longer used but never removed from the API.
Also updates the Python/Cython calls which had been hard-coding a default value anyway.

Reference issue #13458 found this issue.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Divye Gala (https://github.com/divyegala)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #13463
bdice and others added 8 commits July 26, 2023 05:07
In #12922, we missed adding a `cuda{{ cuda_major }}_` to the `custreamz` build tag. This PR fixes that.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: #13754
A fully unbounded window function (i.e. [unbounded_preceding, unbounded_following]) need not go through the window function machinery for execution. E.g. Consider the following:

```c++
auto grps = { 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 };
auto vals = { 3, 1, 4, 2, 6, 7, 8, 5, 9, 0 };
```

Running the `MIN` window function on the groups, over an `[UNBOUNDED, UNBOUNDED]` window should produce:

```c++
auto res = { 1, 1, 1, 1, 5, 5, 5, 5, 0, 0 };
```

This result could more easily be achieved using a grouped `MIN` aggregation, and replicating each group's result for every entry in the group.

This commit adds logic to detect fully unbounded windows, and use `groupby::aggregate()` (when one or more grouping keys are specified), or `reduce()` (when there are no grouping keys).

Tangentially, this change also adds the following:
1. A new overload of `cudf::groupby::groupby::aggregate()` that takes a `stream` parameter.
2. A `detail` header to declare the (pre-existing) `cudf::reduction::detail::reduce()` function.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Ray Douglass (https://github.com/raydouglass)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #13727
Fixes a typo in the `test.yaml` workflow. See rapidsai/rmm#1310.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #13763
…ge headers (#13707)

The current parquet reader assumes that repetition or definition level data with a bit length of 0 will have no data encoded in the header.  In the case of V2 headers, this assumption is false. This PR checks the V2 page header data to see if level data needs to be accounted for. Also fixes an error that was present in the RLE data decoder where the encoded length of the RLE data was not skipped properly.

Fixes #13655

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13707
…3769)

In #12922, we missed adding a `cuda{{ cuda_major }}_` to the `cudf-kafka` and `libcudf-example` build strings. This PR fixes that.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - https://github.com/jakirkham

URL: #13769
The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering. 

The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups.
New `column_name_reference` is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced: 
1. `named_to_reference_converter` - Converts column name references to column index references
2. `stats_expression_converter` - Converts the above output table filtering AST to 'Statistics AST'.

Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc.

- [x] #13472 
- [x] Convert column chunk min, max to cudf type column.
- [x] Add AST filter interface to parquet reader options
- [x] Convert AST to Statistics AST
- [x] Apply statistics AST on Stats values to get row_groups
- [x] Apply AST as filter on output columns.

Depends on #13472

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13348
Closes #11675
Adds `read_parquet_metadata` to libcudf.
The metadata has following information
- schema - (type, name, children)
- num_rows
- num_rowgroups
- key-value string metadata in file footer

To Reviewers: Request for adding more information in metadata. Refer #11214

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Divye Gala (https://github.com/divyegala)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13663
This PR relaxes cudf's protobuf pinnings to help with compatibility issues. `cudf` uses `protobuf` in two places.

The first place `protobuf` is used is at build time, to generate a Python module from a `.proto` file in `python/cudf/cmake/Modules/ProtobufHelpers.cmake`: https://github.com/rapidsai/cudf/blob/f8e5a89e983065e1202f1151dd499bea3102a537/python/cudf/cmake/Modules/ProtobufHelpers.cmake#L16-L17

The second place `protobuf` is used is in the generated file `python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py` which is [imported here](https://github.com/rapidsai/cudf/blob/f8e5a89e983065e1202f1151dd499bea3102a537/python/cudf/cudf/io/orc.py#L14-L16).

The generated Python module used at runtime should be compatible with newer versions of `protobuf` than the version used to build the Python module, from my understanding of https://protobuf.dev/support/cross-version-runtime-guarantee/. Therefore, we only require that the runtime pinning of `protobuf` is of the same major version and an equal-or-greater minor version. That allows us to relax this pinning.

Follow-up to #12864, see that PR for more context.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13770
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Jul 28, 2023
galipremsagar and others added 2 commits August 2, 2023 15:13
This PR pins `dask` & `distributed` to `2023.7.1` version for `23.08` release.

Authors:
   - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
   - Ray Douglass (https://github.com/raydouglass)
   - Peter Andreas Entschev (https://github.com/pentschev)
@raydouglass raydouglass merged commit d9589b7 into main Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.