-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] cudf v22.02 #10101
[RELEASE] cudf v22.02 #10101
Conversation
Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
This PR improves the C++ developer guide. My primary goal was to fix some invalid links. The diff is a bit large because of some minor changes in the interest of establishing consistent style and improving the reading/editing experience. (e.g. replacing a few instances of tabs with spaces, trimming trailing whitespace, wrapping sections that were not wrapped like the rest of the file, and correcting typos that I came across while reading). To save time, I recommend that reviewers use the option in GitHub's review tab that will ignore whitespace changes. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Karthikeyan (https://github.com/karthikeyann) URL: #9675
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Signed-off-by: Peixin Li <[email protected]> cudfjni version update. NOTE: this includes change to use gpuci/cuda images since official cuda images is not ready yet on docker hub Authors: - Peixin (https://github.com/pxLi) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #9681
…xed_point` (#9658) This PR adds Java bindings for `is_fixed_point` Authors: - Raza Jafri (https://github.com/razajafri) Approvers: - Nghia Truong (https://github.com/ttnghia) - Robert (Bobby) Evans (https://github.com/revans2) - David Wendt (https://github.com/davidwendt) - Mike Wilson (https://github.com/hyperbolic2346) URL: #9658
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Fixes: #9642 This PR fixes issue where null values being treated as `False` when `boolean` dtype was being passed to the `Series` constructor. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #9691
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Fixes: #7102 Replaces: [#9488](https://github.com/rapidsai/cudf/pull/9488/files) Authors: - Sheilah Kirui (https://github.com/skirui-source) - Mayank Anand (https://github.com/mayankanand007) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Michael Wang (https://github.com/isVoid) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9571
Add additional checks for int8, int16 fixes [#/rapidsai/cudf/4127](NVIDIA/spark-rapids#4127) Authors: - Raza Jafri (https://github.com/razajafri) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #9707
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Closes #9615 Adds the following API to the Parquet writer: - Set maximum row group size, in bytes (minimum of 512KB); - Set maximum row group size, in rows (minimum of 5000). The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation. Other changes: - Fix naming in some ORC APIs to be consistent. - Change `rowgroup` to `row_group` in APIs, since Parquet specs refer to this as "row group", not "rowgroup". - Replace some `uint32_t` use in Parquet writer. - Remove unused `target_page_size`. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Ashwin Srinath (https://github.com/shwina) URL: #9677
This PR is a pretty thorough rewrite of the internals of merging. There is a ton of complexity imposed by matching all the different edge cases allowed by the pandas API, but I've tried to unify the logic for different code paths as much as possible. I've also added checks for a number of edge cases that were not previously being handled. I see about a 10% performance improvement for merges on small to medium data sizes from this PR (as expected, there's no change for large data where most time is spent in C++). There's also a substantial reduction in total code that should make it easier to address issues going forward. I'm still not entirely happy with the complexity of the result and I think that further simplification should be possible, but I think this is a sufficiently large step forward to be worth pushing forward in this state, especially if it helps enable other changes to joining. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #9516
This PR is a basic implementation of the [interchange dataframe protocol](https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py) for cudf. As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on `pandas` with method like `from_pandas` and `to_pandas`. This is a bad design as libraries should maintain an additional dependency to pandas peculiarities. This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them. Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details. To illustrate: - `df_obj = cudf_dataframe.__dataframe__()` `df_obj` can be consumed by any library implementing the protocol. - `df = cudf.from_dataframe(any_supported_dataframe)` here we create a `cudf dataframe` from any dataframe object supporting the protocol. So far, it supports the following: - Column dtypes: `uint8`, `int`, `float`, `bool` and `categorical`. - Missing values are handled for all these dtypes. - `string` support is on the way. Additionally, we support dataframe from CPU device like `pandas`. But it is not testable here as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol. Authors: - Ismaël Koné (https://github.com/iskode) - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9071
Depends on #9040 and (unfortunately) #9041 Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) URL: #9089
Follow-up to #9571 where we add `ceil` and `floor` support for `Series`. Here we add `ceil` and `floor` support to `DatetimeIndex` class. This PR is dependent on #9571 getting merged first since it assumes the `libcudf` implementation for `floor` exists. Authors: - Mayank Anand (https://github.com/mayankanand007) Approvers: - Michael Wang (https://github.com/isVoid) - Ashwin Srinath (https://github.com/shwina) URL: #9554
This PR continues to address #8974, adding support for structs in `min` and `max` reduction. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec URL: #9697
Regular spell check fixes in comments and docs. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) - Vukasin Milovanovic (https://github.com/vuule) URL: #9682
…#9715) Closes #9620 Fixes an edge case described in https://docs.python.org/3/library/re.html#re.MULTILINE where the '$' EOL regex pattern character (without `MULTILINE` set) should match at the very end of a string and also just before the end of the string if the end of that string contains a new-line. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) - Sheilah Kirui (https://github.com/skirui-source) URL: #9715
This PR is adding clang-tidy to cudf and adding the initial checks. Note more checks will be enabled in the future. Relevant PRs: * `rmm`: rapidsai/rmm#857 * `cuml`: rapidsai/cuml#1945 To do list: * [x] Add `.clang-tidy` file * [x] Add python script * [x] Apply `modernize-` changes * [x] Revert `cxxopts` changes * [x] Fixed Python parquet failures * [x] Ignore `cxxopts` file * [x] Ignore the `build/_deps` directories Splitting out the following into a separate PR so we can get the changes merged for 22.02 (#10064): * ~~[ ] Disable `clang-diagnostic-errors/warnings`~~ * ~~[ ] Fix include files being skipped~~ * ~~[ ] Set up CI script~~ * ~~[ ] Clean up python script~~ Authors: - Conor Hoekstra (https://github.com/codereport) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Mark Harris (https://github.com/harrism) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9860
The `libcudacxx.patch` was required to fix issues with libcudacxx 1.6 and incorrect detection of the arm nvcc 11.4 compiler. As we move to libcudacxx 1.7 this patch is not needed, and should be removed. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Mark Harris (https://github.com/harrism) URL: #10057
…storage (#9589) **Important Note**: ~Marking this as WIP until the `fsspec.parquet` module is available in a filesystem_spec release~ (fsspec.parquet module is available) This PR modifies `cudf.read_parquet` and `dask_cudf.read_parquet` to leverage the new `fsspec.parquet.open_parquet_file` function for optimized data transfer/caching from remote storage. The ~long-term~ goal is to remove the temporary data-transfer optimizations that we currently use in cudf.read_parquet. **Performance Motivation**: ```python In [1]: import cudf, dask_cudf ...: path = [ ...: "gs://my-bucket/criteo-parquet/day_0.parquet", ...: "gs://my-bucket/criteo-parquet/day_1.parquet", ...: ] # cudf BEFORE In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…) CPU times: user 11.1 s, sys: 11.5 s, total: 22.6 s Wall time: 24.4 s # cudf AFTER In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…) CPU times: user 3.48 s, sys: 722 ms, total: 4.2 s Wall time: 6.32 s # (Threaded) Dask-cudf BEFORE In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute() CPU times: user 27.1 s, sys: 15.5 s, total: 42.6 s Wall time: 57.6 s # (Threaded) Dask-cudf AFTER In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute() CPU times: user 3.43 s, sys: 851 ms, total: 4.28 s Wall time: 13.1 s ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller - Benjamin Zaitlen (https://github.com/quasiben) URL: #9589
As we will remove Python 3.7, we need to update the Python version in the upload scripts Authors: - Jordan Jacobelli (https://github.com/Ethyling) Approvers: - Sevag Hanssian (https://github.com/sevagh) - AJ Schmidt (https://github.com/ajschmidt8) URL: #10092
Depends on #10041. The erstwhile ORC writer API exposed only a binary choice to choose the level of statistics: ENABLED/DISABLED. This commit allows the ORC writer to further choose whether statistics are collected at the ROW_GROUP or STRIPE level. This commit also includes the relevant changes to `java/` and `python/`. Authors: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Jason Lowe (https://github.com/jlowe) - GALI PREM SAGAR (https://github.com/galipremsagar) - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) URL: #10058
Codecov Report
@@ Coverage Diff @@
## main #10101 +/- ##
==========================================
- Coverage 10.56% 10.42% -0.15%
==========================================
Files 116 119 +3
Lines 18677 20606 +1929
==========================================
+ Hits 1974 2148 +174
- Misses 16703 18458 +1755
Continue to review full report at Codecov.
|
I know that this is past the freeze date. This is a fix for a P1 bug that we just found when trying to build Scalar values of Lists and Structs that contain Decimal128 values. We might be able to work around it some other way, but it would take a lot of changes to the existing Spark plugin code to do that so I wanted to try this first. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Kuhu Shukla (https://github.com/kuhushukla) - Niranjan Artal (https://github.com/nartal1)
…n `_drop_na_rows` (#10123) Currently when `drop_nan == False`, variable `data_columns` was not created and referenced below. This PR fixes that. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice)
Always upload all cudf packages Authors: - Ray Douglass (https://github.com/raydouglass) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Jordan Jacobelli (https://github.com/Ethyling)
❄️ Code freeze for
branch-22.02
and v22.02 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-22.02
until release (merging of this PR).What is the purpose of this PR?
branch-22.02
intomain
for the release