Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Automerger for Branch-21.12 from branch-21.10 #9285

Merged

Conversation

galipremsagar
Copy link
Contributor

This PR resolves conflicts to allow auto-merger from branch-21.10 to branch-21.12: #9274

trxcllnt and others added 26 commits August 25, 2021 14:15
Removes `-g` from the compile commands generated by distutils to compile Cython files.

This will make our container images, conda packages, and python wheels smaller.
Signed-off-by: Jordan Jacobelli <[email protected]>
Fixes: rapidsai#9234

- [x] This PR introduces optimizations to `sort_index` when there is an already sorted `Index` object and avoids sorting them and performing a `take` operation on them. This **alleviates** a lot of **memory pressure** and has **a 3x to 6x speed up.**

On a T4 GPU:

`This PR`:
```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000})

In [3]: %timeit df.sort_index()
174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

`branch-21.10`:

Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.


`THIS PR`:

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```


`branch-21.10`:

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

- [x] Also expands params to `Series.sort_index` and refactored the common implementation to `Frame._sort_index`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: rapidsai#9238
This PR fixes the `gather` API for structs columns when the input is a sliced column. Previously, `gather` calls `child_begin()` and `child_end()` to access the children column so if the input structs column is sliced then the output is incorrect.

This closes rapidsai#9213, and is blocked by rapidsai#9194 due to conflict work.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#9218
When rapidsai#9030 was merged it incorrectly resolved `get_cucollections.cmake` to use features of `rapids_cpm_find` but still call `CPMFindPackage`. This corrects the issues by properly calling `rapids_cpm_find`.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#9189
libcudf doesn't expose zlib in the public facing API, and therefore C++ consumers don't need to also link / include zlib.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Keith Kraus (https://github.com/kkraus14)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#9204
Only run imports tests on x86_64
Provides the Python/Cython bindings for rapidsai#8702 multibyte_split. This PR depends on rapidsai#8702 being merged first.

Closes rapidsai#8557

Authors:
  - Jeremy Dyer (https://github.com/jdye64)
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#8998
Temporary workaround for `arm64`
Importing cudf on arm64 CPU only nodes is currently not working due to a difference in reported gpu devices between arm64 and amd64

Authors:
  - Jordan Jacobelli (https://github.com/Ethyling)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#9252
Fixes rapidsai#8905.

Attempting groupby aggregations with `LIST` keys leads to silent
failures and bad results.
For instance, attempting hash-based `groupby` aggregations with `LIST`
keys only fails on DEBUG builds, thus:
```
/home/myth/dev/cudf/2/cpp/include/cudf/table/row_operators.cuh:447: unsigned int cudf:
:element_hasher_with_seed<hash_function, has_nulls>::operator()(cudf::column_device_view, signed in
t) const [with T = cudf::list_view; void *<anonymous> = (void *)nullptr; hash_function = default_ha
sh; __nv_bool has_nulls = false]: block: [0,0,0], thread: [0,0,0] Assertion `false && "Unsupported
type in hash."` failed.
```
In RELEASE builds, a copy of the input `LIST` column is returned, causing
each output row to be interpreted as its own group.

This commit adds an explicit failure for unsupported groupby key types,
i.e. those that don't support equality comparisons (like `LIST`).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: rapidsai#9227
Fixes: rapidsai#9254 

This PR fixes `deserialize` in `cudf.MultiIndex` so that there is no data-corruption happening when there are duplicate names.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#9258
This PR add support for struct type into the existing `drop_list_duplicates` API. This is the first time a nested type is supported in this function. Some more code cleanup has also been done.

To be clear: Only structs of basic types and structs of structs are supported. Structs of lists are not, due to their complex nature.

Closes rapidsai#8972.
Blocked by rapidsai#9218 (it is merged).

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/nvdbaranec
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#9202
…apidsai#9263)

Closes rapidsai#9156

This PR simplifies the parameters when calling thrust::reduce_by_key for the argmin/argmax aggregations in cudf::groupby. The illegalMemoryAccess found in rapidsai#9156 was due to invalid data being passed from thrust::reduce_by_key through to the BinaryPredicate function as documented in NVIDIA/thrust#1525

The invalid data being passed is only a real issue for strings columns where the device pointer was neither nullptr nor a valid address. The new logic provides only size_type values to thrust::reduce_by_key so invalid values can only be out-of-bounds for the input column which is easily checked before retrieving the string_view objects within the ArgMin and ArgMax operators.

This the same as rapidsai#9244 but based on 21.10

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#9263
…lean (rapidsai#9192)

Currently, we map boolean type to `pa.int8` because the bitwidth of cudf boolean mismatches that in arrow. However the implication of this mapping is subtle and may cause unwanted result such as:

```python
>>> cudf.StructDtype({
    "a": np.bool_,
    "b": np.int8,
})
StructDtype({'a': dtype('int8'), 'b': dtype('int8')})
```

This PR changes the mapping back to `pa.bool_`, and use explicit type handling when we are dealing with type conversion to arrow.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - https://github.com/brandon-b-miller
  - H. Thomson Comer (https://github.com/thomcom)

URL: rapidsai#9192
Fixes a Java column vector leak in TableTest#testParquetWriteMap.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: rapidsai#9271
Forward-merge `branch-21.08` into `branch-21.10`
This changes the calls in java/cudf to check for an empty input and return an empty result instead of crashing.

Fixes rapidsai#9253

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: rapidsai#9262
Closes rapidsai#8660 

Per discussions in thread rapidsai#8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series.

Example: 
```python
>>> import cudf, dask_cudf as dgd
>>> ds = dgd.from_cudf(cudf.Series(
...     [{'a': 42, 'b': 'str1', 'c': [-1]},
...      {'a': 0,  'b': 'str2', 'c': [400, 500]},
...      {'a': 7,  'b': '',     'c': []}]), npartitions=2)
>>> ds.struct.explode().compute()
    a     b           c
0  42  str1        [-1]
1   0  str2  [400, 500]
2   7                []
```

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: rapidsai#9086
…view (rapidsai#9185)

Fixes rapidsai#9140 
Added `shallow_hash(column_view)`
Added unit tests

It computes hash values based on the shallow states of `column_view`:
type, size, data pointer, null_mask pointer,  offset, and the hash value of the children. 
`null_count` is not used since it is a cached value and it may vary based on contents of `null_mask`, and may be pre-computed or not.

Fixes rapidsai#9139
Added `is_shallow_equivalent(column_view, column_view)` ~shallow_equal~
Added unit tests

It compares two column_views based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the column_view of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)

URL: rapidsai#9185
This PR strips the pyarrow-NativeFile component out of rapidsai#9225 (since those changes are not yet stable).  I feel that it is reasonable to start by merging these fsspec-specific optimizations for 21.10, because they are stable and already result in a significant performance boost over the existing approach to remote storage. I still think it is very important that we eventually plumb NativeFile support into python (cudf and dask_cudf), but we will likely need to target 21.12 for that improvement.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: rapidsai#9265
Fixes rapidsai#7830, rapidsai#8443

Features:
- Use the new table metadata type that matches the table hierarchy, `table_input_metadata`.
- Support struct columns in the writer.

Changes:
- Null masks are encoded as aligned rowgroups to avoid invalid bits when the number of encoded rows is not divisible by 8 (except for the last rowgroup in each stripe). This also affects list columns. The issue is equivalent to rapidsai#6763 (boolean columns only).
- Added pushdown masks that are used to determine which child elements should not be encoded, including null mask bits.
- Use pushdown masks for rowgroup alignment, null mask encoding and value encoding.
- Separated the null mask encoding from value encoding - can be further moved to a separate kernel call.

Breaking because the table metadata type has changed.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Devavret Makkar (https://github.com/devavret)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)

URL: rapidsai#9025
Aligns the function signature for `cudf.DataFrame.apply` with that of `pandas.DataFrame.apply`. This is needed so that dask can build on a common `apply` interface between backends among other reasons.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#9275
@galipremsagar galipremsagar requested review from a team as code owners September 23, 2021 16:02
@galipremsagar galipremsagar requested review from a team as code owners September 23, 2021 16:02
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Sep 23, 2021
@galipremsagar galipremsagar added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 23, 2021
@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.12@a4771b3). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head ab4bfaa differs from pull request most recent head 3ed97af. Consider uploading reports for the commit 3ed97af to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.12    #9285   +/-   ##
===============================================
  Coverage                ?   10.79%           
===============================================
  Files                   ?      116           
  Lines                   ?    18869           
  Branches                ?        0           
===============================================
  Hits                    ?     2036           
  Misses                  ?    16833           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4771b3...3ed97af. Read the comment docs.

@ajschmidt8 ajschmidt8 merged commit e0cf38b into rapidsai:branch-21.12 Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.