Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v22.02 #10101

Merged
merged 231 commits into from
Feb 2, 2022
Merged

[RELEASE] cudf v22.02 #10101

merged 231 commits into from
Feb 2, 2022

Conversation

GPUtester
Copy link
Collaborator

❄️ Code freeze for branch-22.02 and v22.02 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-22.02 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-22.02 into main for the release

ajschmidt8 and others added 30 commits November 4, 2021 10:13
Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
This PR improves the C++ developer guide.

My primary goal was to fix some invalid links.

The diff is a bit large because of some minor changes in the interest of establishing consistent style and improving the reading/editing experience. (e.g. replacing a few instances of tabs with spaces, trimming trailing whitespace, wrapping sections that were not wrapped like the rest of the file, and correcting typos that I came across while reading). To save time, I recommend that reviewers use the option in GitHub's review tab that will ignore whitespace changes.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)

URL: #9675
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Signed-off-by: Peixin Li <[email protected]>

cudfjni version update.

NOTE: this includes change to use gpuci/cuda images since official cuda images is not ready yet on docker hub

Authors:
  - Peixin (https://github.com/pxLi)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #9681
…xed_point` (#9658)

This PR adds Java bindings for `is_fixed_point`

Authors:
  - Raza Jafri (https://github.com/razajafri)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - David Wendt (https://github.com/davidwendt)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #9658
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Fixes: #9642 

This PR fixes issue where null values being treated as `False` when `boolean` dtype was being passed to the `Series` constructor.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #9691
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Add additional checks for int8, int16

fixes [#/rapidsai/cudf/4127](NVIDIA/spark-rapids#4127)

Authors:
  - Raza Jafri (https://github.com/razajafri)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)

URL: #9707
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Closes #9615

Adds the following API to the Parquet writer:

- Set maximum row group size, in bytes (minimum of 512KB);
- Set maximum row group size, in rows (minimum of 5000).

The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation.

Other changes: 

- Fix naming in some ORC APIs to be consistent. 
- Change `rowgroup` to `row_group` in APIs, since Parquet specs refer to this as "row group", not "rowgroup". 
- Replace some `uint32_t` use in Parquet writer.
- Remove unused `target_page_size`.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Ashwin Srinath (https://github.com/shwina)

URL: #9677
This PR is a pretty thorough rewrite of the internals of merging. There is a ton of complexity imposed by matching all the different edge cases allowed by the pandas API, but I've tried to unify the logic for different code paths as much as possible. I've also added checks for a number of edge cases that were not previously being handled. I see about a 10% performance improvement for merges on small to medium data sizes from this PR (as expected, there's no change for large data where most time is spent in C++). There's also a substantial reduction in total code that should make it easier to address issues going forward. I'm still not entirely happy with the complexity of the result and I think that further simplification should be possible, but I think this is a sufficiently large step forward to be worth pushing forward in this state, especially if it helps enable other changes to joining.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #9516
This PR is a basic implementation of the [interchange dataframe protocol](https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py) for cudf.
As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on `pandas` with method like `from_pandas` and `to_pandas`.
This is a bad design as libraries should maintain an additional dependency to pandas peculiarities.
This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them.
Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details.
To illustrate:

- `df_obj =  cudf_dataframe.__dataframe__()`

`df_obj` can be consumed by any library implementing the protocol.
- `df = cudf.from_dataframe(any_supported_dataframe)`

here we create  a `cudf dataframe` from any dataframe object supporting the protocol.

So far, it supports the following:

-  Column dtypes: `uint8`, `int`, `float`, `bool` and `categorical`.
-  Missing values are handled for all these dtypes.
-  `string` support is on the way.

Additionally, we support dataframe from CPU device like `pandas`. But it is not testable here  as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol.

Authors:
  - Ismaël Koné (https://github.com/iskode)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9071
Depends on #9040 and (unfortunately) #9041

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #9089
Follow-up to #9571  where we add `ceil` and `floor` support for `Series`.

Here we add `ceil` and `floor` support to `DatetimeIndex` class. This PR is dependent on #9571 getting merged first since it assumes the `libcudf` implementation for `floor` exists.

Authors:
  - Mayank Anand (https://github.com/mayankanand007)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Ashwin Srinath (https://github.com/shwina)

URL: #9554
This PR continues to address #8974, adding support for structs in `min` and `max` reduction.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - https://github.com/nvdbaranec

URL: #9697
Regular spell check fixes in comments and docs.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #9682
…#9715)

Closes #9620 

Fixes an edge case described in https://docs.python.org/3/library/re.html#re.MULTILINE
where the '$' EOL regex pattern character (without `MULTILINE` set) should match at the very end of a string and also just before the end of the string if the end of that string contains a new-line.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Sheilah Kirui (https://github.com/skirui-source)

URL: #9715
codereport and others added 5 commits January 20, 2022 16:05
This PR is adding clang-tidy to cudf and adding the initial checks. Note more checks will be enabled in the future.

Relevant PRs:
* `rmm`: rapidsai/rmm#857
* `cuml`: rapidsai/cuml#1945

To do list:
* [x] Add `.clang-tidy` file
* [x] Add python script
* [x] Apply `modernize-` changes
* [x] Revert `cxxopts` changes
* [x] Fixed Python parquet failures
* [x] Ignore `cxxopts` file
* [x] Ignore the `build/_deps` directories

Splitting out the following into a separate PR so we can get the changes merged for 22.02 (#10064):
* ~~[ ] Disable `clang-diagnostic-errors/warnings`~~
* ~~[ ] Fix include files being skipped~~
* ~~[ ] Set up CI script~~
* ~~[ ] Clean up python script~~

Authors:
  - Conor Hoekstra (https://github.com/codereport)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)
  - Mark Harris (https://github.com/harrism)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9860
The `libcudacxx.patch` was required to fix issues with libcudacxx 1.6 and incorrect detection of the arm nvcc 11.4 compiler. 

As we move to libcudacxx 1.7 this patch is not needed, and should be removed.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Mark Harris (https://github.com/harrism)

URL: #10057
…storage (#9589)

**Important Note**: ~Marking this as WIP until the `fsspec.parquet` module is available in a filesystem_spec release~ (fsspec.parquet module is available)

This PR modifies `cudf.read_parquet` and `dask_cudf.read_parquet` to leverage the new `fsspec.parquet.open_parquet_file` function for optimized data transfer/caching from remote storage. The ~long-term~ goal is to remove the temporary data-transfer optimizations that we currently use in cudf.read_parquet.

**Performance Motivation**:

```python
In [1]: import cudf, dask_cudf
   ...: path = [
   ...:     "gs://my-bucket/criteo-parquet/day_0.parquet",
   ...:     "gs://my-bucket/criteo-parquet/day_1.parquet",
   ...: ]

# cudf BEFORE
In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…)
CPU times: user 11.1 s, sys: 11.5 s, total: 22.6 s
Wall time: 24.4 s

# cudf AFTER
In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…)
CPU times: user 3.48 s, sys: 722 ms, total: 4.2 s
Wall time: 6.32 s

# (Threaded) Dask-cudf BEFORE
In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute()
CPU times: user 27.1 s, sys: 15.5 s, total: 42.6 s
Wall time: 57.6 s

# (Threaded) Dask-cudf AFTER
In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute()
CPU times: user 3.43 s, sys: 851 ms, total: 4.28 s
Wall time: 13.1 s
```

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - https://github.com/brandon-b-miller
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #9589
As we will remove Python 3.7, we need to update the Python version in the upload scripts

Authors:
  - Jordan Jacobelli (https://github.com/Ethyling)

Approvers:
  - Sevag Hanssian (https://github.com/sevagh)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #10092
Depends on #10041.

The erstwhile ORC writer API exposed only a binary choice to choose
the level of statistics: ENABLED/DISABLED.
This commit allows the ORC writer to further choose whether statistics
are collected at the ROW_GROUP or STRIPE level.

This commit also includes the relevant changes to `java/` and `python/`.

Authors:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Christopher Harris (https://github.com/cwharris)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #10058
@codecov
Copy link

codecov bot commented Jan 21, 2022

Codecov Report

Merging #10101 (cfcb3ac) into main (41a20f6) will decrease coverage by 0.14%.
The diff coverage is n/a.

❗ Current head cfcb3ac differs from pull request most recent head a7d88cd. Consider uploading reports for the commit a7d88cd to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10101      +/-   ##
==========================================
- Coverage   10.56%   10.42%   -0.15%     
==========================================
  Files         116      119       +3     
  Lines       18677    20606    +1929     
==========================================
+ Hits         1974     2148     +174     
- Misses      16703    18458    +1755     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/backends.py 83.13% <0.00%> (-2.58%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.66% <0.00%> (-0.72%) ⬇️
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_typing.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/avro.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06540b9...a7d88cd. Read the comment docs.

@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jan 21, 2022
revans2 and others added 4 commits January 21, 2022 14:24
I know that this is past the freeze date. This is a fix for a P1 bug that we just found when trying to build Scalar values of Lists and Structs that contain Decimal128 values. We might be able to work around it some other way, but it would take a lot of changes to the existing Spark plugin code to do that so I wanted to try this first.

Authors:
   - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
   - Kuhu Shukla (https://github.com/kuhushukla)
   - Niranjan Artal (https://github.com/nartal1)
…n `_drop_na_rows` (#10123)

Currently when `drop_nan == False`, variable `data_columns` was not created and referenced below. This PR fixes that.

Authors:
   - Michael Wang (https://github.com/isVoid)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - Bradley Dice (https://github.com/bdice)
Always upload all cudf packages

Authors:
   - Ray Douglass (https://github.com/raydouglass)

Approvers:
   - AJ Schmidt (https://github.com/ajschmidt8)
   - Jordan Jacobelli (https://github.com/Ethyling)
@raydouglass raydouglass merged commit f39f559 into main Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.