Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpuCI] Forward-merge branch-21.10 to branch-21.12 [skip gpuci] #9274

Merged
merged 25 commits into from
Sep 23, 2021

Conversation

GPUtester
Copy link
Collaborator

Forward-merge triggered by push to branch-21.10 that creates a PR to keep branch-21.12 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

trxcllnt and others added 19 commits August 25, 2021 14:15
Removes `-g` from the compile commands generated by distutils to compile Cython files.

This will make our container images, conda packages, and python wheels smaller.
Signed-off-by: Jordan Jacobelli <[email protected]>
Fixes: #9234

- [x] This PR introduces optimizations to `sort_index` when there is an already sorted `Index` object and avoids sorting them and performing a `take` operation on them. This **alleviates** a lot of **memory pressure** and has **a 3x to 6x speed up.**

On a T4 GPU:

`This PR`:
```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000})

In [3]: %timeit df.sort_index()
174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

`branch-21.10`:

Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.


`THIS PR`:

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```


`branch-21.10`:

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

- [x] Also expands params to `Series.sort_index` and refactored the common implementation to `Frame._sort_index`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #9238
This PR fixes the `gather` API for structs columns when the input is a sliced column. Previously, `gather` calls `child_begin()` and `child_end()` to access the children column so if the input structs column is sliced then the output is incorrect.

This closes #9213, and is blocked by #9194 due to conflict work.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Mark Harris (https://github.com/harrism)

URL: #9218
When #9030 was merged it incorrectly resolved `get_cucollections.cmake` to use features of `rapids_cpm_find` but still call `CPMFindPackage`. This corrects the issues by properly calling `rapids_cpm_find`.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Mark Harris (https://github.com/harrism)

URL: #9189
libcudf doesn't expose zlib in the public facing API, and therefore C++ consumers don't need to also link / include zlib.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Keith Kraus (https://github.com/kkraus14)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Mark Harris (https://github.com/harrism)

URL: #9204
Only run imports tests on x86_64
Provides the Python/Cython bindings for #8702 multibyte_split. This PR depends on #8702 being merged first.

Closes #8557

Authors:
  - Jeremy Dyer (https://github.com/jdye64)
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #8998
Temporary workaround for `arm64`
Importing cudf on arm64 CPU only nodes is currently not working due to a difference in reported gpu devices between arm64 and amd64

Authors:
  - Jordan Jacobelli (https://github.com/Ethyling)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: #9252
Fixes #8905.

Attempting groupby aggregations with `LIST` keys leads to silent
failures and bad results.
For instance, attempting hash-based `groupby` aggregations with `LIST`
keys only fails on DEBUG builds, thus:
```
/home/myth/dev/cudf/2/cpp/include/cudf/table/row_operators.cuh:447: unsigned int cudf:
:element_hasher_with_seed<hash_function, has_nulls>::operator()(cudf::column_device_view, signed in
t) const [with T = cudf::list_view; void *<anonymous> = (void *)nullptr; hash_function = default_ha
sh; __nv_bool has_nulls = false]: block: [0,0,0], thread: [0,0,0] Assertion `false && "Unsupported
type in hash."` failed.
```
In RELEASE builds, a copy of the input `LIST` column is returned, causing
each output row to be interpreted as its own group.

This commit adds an explicit failure for unsupported groupby key types,
i.e. those that don't support equality comparisons (like `LIST`).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9227
Fixes: #9254 

This PR fixes `deserialize` in `cudf.MultiIndex` so that there is no data-corruption happening when there are duplicate names.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9258
This PR add support for struct type into the existing `drop_list_duplicates` API. This is the first time a nested type is supported in this function. Some more code cleanup has also been done.

To be clear: Only structs of basic types and structs of structs are supported. Structs of lists are not, due to their complex nature.

Closes #8972.
Blocked by #9218 (it is merged).

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/nvdbaranec
  - Mark Harris (https://github.com/harrism)

URL: #9202
…9263)

Closes #9156

This PR simplifies the parameters when calling thrust::reduce_by_key for the argmin/argmax aggregations in cudf::groupby. The illegalMemoryAccess found in #9156 was due to invalid data being passed from thrust::reduce_by_key through to the BinaryPredicate function as documented in NVIDIA/thrust#1525

The invalid data being passed is only a real issue for strings columns where the device pointer was neither nullptr nor a valid address. The new logic provides only size_type values to thrust::reduce_by_key so invalid values can only be out-of-bounds for the input column which is easily checked before retrieving the string_view objects within the ArgMin and ArgMax operators.

This the same as #9244 but based on 21.10

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #9263
…lean (#9192)

Currently, we map boolean type to `pa.int8` because the bitwidth of cudf boolean mismatches that in arrow. However the implication of this mapping is subtle and may cause unwanted result such as:

```python
>>> cudf.StructDtype({
    "a": np.bool_,
    "b": np.int8,
})
StructDtype({'a': dtype('int8'), 'b': dtype('int8')})
```

This PR changes the mapping back to `pa.bool_`, and use explicit type handling when we are dealing with type conversion to arrow.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - https://github.com/brandon-b-miller
  - H. Thomson Comer (https://github.com/thomcom)

URL: #9192
Fixes a Java column vector leak in TableTest#testParquetWriteMap.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #9271
Forward-merge `branch-21.08` into `branch-21.10`
This changes the calls in java/cudf to check for an empty input and return an empty result instead of crashing.

Fixes #9253

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #9262
@GPUtester GPUtester requested review from a team as code owners September 22, 2021 19:00
@GPUtester
Copy link
Collaborator Author

FAILURE - Unable to forward-merge due to conflicts, manual merge is necessary. Do not use the Resolve conflicts option in this PR, follow these instructions https://docs.rapids.ai/maintainers/gpuci/#forward-mergers
IMPORTANT: Before merging and approving this PR, be sure to change the merging strategy to Create a Merge Commit. Otherwise, history will be lost and the branches become incompatible.

@github-actions github-actions bot added CMake CMake build issue conda labels Sep 22, 2021
@github-actions github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. gpuCI libcudf Affects libcudf (C++/CUDA) code. labels Sep 22, 2021
isVoid and others added 6 commits September 22, 2021 19:05
Closes #8660 

Per discussions in thread #8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series.

Example: 
```python
>>> import cudf, dask_cudf as dgd
>>> ds = dgd.from_cudf(cudf.Series(
...     [{'a': 42, 'b': 'str1', 'c': [-1]},
...      {'a': 0,  'b': 'str2', 'c': [400, 500]},
...      {'a': 7,  'b': '',     'c': []}]), npartitions=2)
>>> ds.struct.explode().compute()
    a     b           c
0  42  str1        [-1]
1   0  str2  [400, 500]
2   7                []
```

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #9086
…view (#9185)

Fixes #9140 
Added `shallow_hash(column_view)`
Added unit tests

It computes hash values based on the shallow states of `column_view`:
type, size, data pointer, null_mask pointer,  offset, and the hash value of the children. 
`null_count` is not used since it is a cached value and it may vary based on contents of `null_mask`, and may be pre-computed or not.

Fixes #9139
Added `is_shallow_equivalent(column_view, column_view)` ~shallow_equal~
Added unit tests

It compares two column_views based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the column_view of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)

URL: #9185
This PR strips the pyarrow-NativeFile component out of #9225 (since those changes are not yet stable).  I feel that it is reasonable to start by merging these fsspec-specific optimizations for 21.10, because they are stable and already result in a significant performance boost over the existing approach to remote storage. I still think it is very important that we eventually plumb NativeFile support into python (cudf and dask_cudf), but we will likely need to target 21.12 for that improvement.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #9265
Fixes #7830, #8443

Features:
- Use the new table metadata type that matches the table hierarchy, `table_input_metadata`.
- Support struct columns in the writer.

Changes:
- Null masks are encoded as aligned rowgroups to avoid invalid bits when the number of encoded rows is not divisible by 8 (except for the last rowgroup in each stripe). This also affects list columns. The issue is equivalent to #6763 (boolean columns only).
- Added pushdown masks that are used to determine which child elements should not be encoded, including null mask bits.
- Use pushdown masks for rowgroup alignment, null mask encoding and value encoding.
- Separated the null mask encoding from value encoding - can be further moved to a separate kernel call.

Breaking because the table metadata type has changed.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Devavret Makkar (https://github.com/devavret)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)

URL: #9025
Aligns the function signature for `cudf.DataFrame.apply` with that of `pandas.DataFrame.apply`. This is needed so that dask can build on a common `apply` interface between backends among other reasons.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.