Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge rapidsai/branch 22.10 #1

Merged
merged 8 commits into from
Jul 28, 2022

Conversation

etseidl
Copy link
Owner

@etseidl etseidl commented Jul 28, 2022

No description provided.

GPUtester and others added 8 commits July 27, 2022 16:45
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
Update JNI version to 22.10.0-SNAPSHOT

jni cmake change has been covered in df24808#diff-85b071fcb8436e7b5eedefcb44bd66949e9d8062d160f7317e19fc50ace7c887

Authors:
  - Peixin (https://github.com/pxLi)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #11338
This PR fixes timeouts like the following that are happening across all our s3 tests in code-base:
```python
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/lib/python3.9/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/..pyenv/versions/3.9.6/lib/python3.9/http/client.py", line 1349, in getresponse
    response.begin()
  File "/..pyenv/versions/3.9.6/lib/python3.9/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/..pyenv/versions/3.9.6/lib/python3.9/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/..pyenv/versions/3.9.6/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out
```
This seems to be an issue with `werkzeug` package which was updated 1 day ago. So until there is an upstream fix for it, we will temporarily need to pin this package.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #11369
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
I wanted to get this up in a PR of its own so we could get some discussion going if necessary. This adds a `byte_array_view` which is almost identical to a `string_view`. The goal is to be able to get these for list columns of bytes, so `list<uint8>` and `list<int8>`. I didn't template it on the type, but instead selected `uint8_t` because `std::byte` is a `uint8_t`. My PR for writing byte arrays in parquet will use this to get the rows of byte data for statistics and writing. That PR is forthcoming. I left this code down in cuio statistics due to the usage and the previous discussions regarding `.element`. I needed to wrap the `device_span` because I need comparison operators for the cub reduce.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11322
This PR is needed for PR #11160 and another PR for writing list<int8>'s as byte arrays as well.

This PR will be the basis for the statistics written by #11160, but it should be noted that it will work due to an implementation detail. The union of the statistics types for strings and byte arrays is the same layout. This means the string statistics that are stored from the string column can be read as byte array statistics later when writing out the file and it will work simply because they can alias to each other.

Alternatives considered included changing the incoming table type in the case of strings being written as a byte array to be a byte array, which would only work because of the implementation detail that the columns are identical data layouts. Also considered was templating out multiple paths in [column_statistics.cuh:107](https://github.com/rapidsai/cudf/blob/e98feab966f6d1b9eba83323e851c955a1691865/cpp/src/io/statistics/column_statistics.cuh#L107) that we could use to branch based on the data type.

The latter seems more the correct way, but comes with a performance and complexity issue. The method used in the PR was mainly chosen due to simplicity and the existing reliance on the data formats being identical throughout the code. It seems unlikely these will ever decouple.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - MithunR (https://github.com/mythrocks)
  - https://github.com/nvdbaranec

URL: #11303
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
@etseidl etseidl merged commit 0b882ae into etseidl:feature/trunc_stats Jul 28, 2022
etseidl pushed a commit that referenced this pull request Jun 9, 2023
This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort.

Closes rapidsai#12422.

### Example:
```
#0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446
#1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113
#2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e
#3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88
#4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9
#5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027
#6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f
rapidsai#7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6
rapidsai#8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195
rapidsai#9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109
rapidsai#10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f
rapidsai#11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87
rapidsai#12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95
rapidsai#13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c
rapidsai#14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90
rapidsai#15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80
rapidsai#16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5
```

### Usage

In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use.

For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example:
```
try {
  // cudf API calls
} catch (cudf::logic_error const& e) {
  std::cout << e.what() << std::endl;
  std::cout << e.stacktrace() << std::endl;
  throw e;
} 
// similar with catching other exception types
```

### Follow-up work

The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace.


### Note:
 * This feature doesn't require libcudf to be built in Debug mode.
 * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions.
 * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging).
 * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jason Lowe (https://github.com/jlowe)

URL: rapidsai#13298
etseidl pushed a commit that referenced this pull request Sep 27, 2023
Pin conda packages to `aws-sdk-cpp<1.11`. The recent upgrade in version `1.11.*` has caused several issues with cleaning up (more details on changes can be read in [this link](https://github.com/aws/aws-sdk-cpp#version-111-is-now-available)), leading to Distributed and Dask-CUDA processes to segfault. The stack for one of those crashes looks like the following:

```
(gdb) bt
#0  0x00007f5125359a0c in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level(aws_logger*, unsigned int) () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so
#1  0x00007f5124968f83 in aws_event_loop_thread () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#2  0x00007f5124ad9359 in thread_fn () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1
#3  0x00007f519958f6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f5198b1361f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```

Such segfaults now manifest frequently in CI, and in some cases are reproducible with a hit rate of ~30%. Given the approaching release time, it's probably the safest option to just pin to an older version of the package while we don't pinpoint the exact cause for the issue and a patched build is released upstream.

The `aws-sdk-cpp` is statically-linked in the `pyarrow` pip package, which prevents us from using the same pinning technique. cuDF is currently pinned to `pyarrow=12.0.1` which seems to be built against `aws-sdk-cpp=1.10.*`, as per [recent build logs](https://github.com/apache/arrow/actions/runs/6276453828/job/17046177335?pr=37792#step:6:1372).

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#14173
etseidl pushed a commit that referenced this pull request Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants