Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] When libcudf throws exception, it should also print out the stack trace #12422

Closed
ttnghia opened this issue Dec 19, 2022 · 10 comments · Fixed by #13298
Closed

[FEA] When libcudf throws exception, it should also print out the stack trace #12422

ttnghia opened this issue Dec 19, 2022 · 10 comments · Fixed by #13298
Assignees
Labels
feature request New feature or request

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Dec 19, 2022

Currently, when an exception happens, it will be thrown with an error message. All we can see from it is just something like this:

RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/include/cudf/utilities/type_dispatcher.hpp:531: Invalid type_id.

We have no idea which APIs have been called and cause that exception. It would be much much better if we can print out the stack trace like this:

 0# void throw_with_trace<std::logic_error>(std::logic_error const&) at ../example/throwing_st.cpp:22
 1# oops(int) at ../example/throwing_st.cpp:38
 2# bar(int) at ../example/throwing_st.cpp:54
 3# foo(int) at ../example/throwing_st.cpp:59
 4# bar(int) at ../example/throwing_st.cpp:49
 5# foo(int) at ../example/throwing_st.cpp:59
 6# main at ../example/throwing_st.cpp:76
 7# 0x00007FAC113BEF45 in /lib/x86_64-linux-gnu/libc.so.6
 8# 0x0000000000402ED9

Ref: https://www.boost.org/doc/libs/develop/doc/html/stacktrace/getting_started.html#stacktrace.getting_started.exceptions_with_stacktrace

We can adopt one of the very well-known opensource stack trace libraries such as https://github.com/boostorg/stacktrace or https://github.com/bombela/backward-cpp or write our own stack trace code. I believe that having the ability to print the call stack would significantly benefit our developers.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Dec 19, 2022
@vyasr
Copy link
Contributor

vyasr commented Dec 19, 2022

I'm -1 on using a Boost solution. We ripped it out intentionally in #7932, it's a large dependency and I wouldn't want to make RAPIDS any harder to build than it already is.

I'm not opposed to providing a stack trace in general, though. A simpler solution for this would be nice. FWIW I adapted a much more hacked together solution in the stream identification library: https://github.com/rapidsai/cudf/blob/branch-23.02/cpp/tests/utilities/identify_stream_usage/identify_stream_usage.cpp#L45.

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 19, 2022

That's great. Should we extract that code out and make it a common utility function so CUDF_EXPECT can call it? I was afraid that implementing it ourselves would be difficult, didn't realize that we already have it.

@ttnghia ttnghia changed the title [FEA] When libcudf throws exception, also print out the stack trace [FEA] When libcudf throws exception, it should also print out the stack trace Dec 19, 2022
@bdice bdice removed the Needs Triage Need team to review and classify label Dec 20, 2022
@vyasr
Copy link
Contributor

vyasr commented Jan 4, 2023

Sure I'm OK with extracting that code. I have a weak preference for using a simple, well-maintained, header-only library that is better than my solution, but I don't know if it's worth the effort right now so moving my code out into somewhere that can be used more generally in libcudf is fine.

@davidwendt
Copy link
Contributor

There are over 1100 CUDF_EXPECTS calls in libcudf so I have a strong preference that this would not be a header-only solution.

@jlowe
Copy link
Contributor

jlowe commented Jan 4, 2023

The problem with libcudf logging an exception and then throwing it is that this is the log-and-throw anti-pattern. This usually means the exception gets logged at least twice, once by libcudf and another by the application, or worse, the application has some recovery logic and then we get some unwanted prints from libcudf.

This would be very bad for RMM, for example, as our application is performing recovery on many RMM exceptions and retrying. If libcudf printed a stacktrace when an output column is too big (i.e.: row count > maxint), that would be annoying output if the application were to catch, recover, and retry (e.g.: by splitting the input into smaller chunks and iterating).

@vyasr
Copy link
Contributor

vyasr commented Jan 9, 2023

There are over 1100 CUDF_EXPECTS calls in libcudf so I have a strong preference that this would not be a header-only solution.

Good point, I wasn't thinking about this too much. I'm less interested in it being header-only specifically, and more in it being lightweight. A single small header and source file would also be perfectly fine with me.

The problem with libcudf logging an exception and then throwing it is that this is the log-and-throw anti-pattern. This usually means the exception gets logged at least twice, once by libcudf and another by the application, or worse, the application has some recovery logic and then we get some unwanted prints from libcudf.

Perhaps I was misunderstanding @ttnghia's intent, but I don't think what I'm thinking of matches the antipattern you're referring to. We're not suggesting to do a try/catch/log/rethrow. These aren't exceptions coming from somewhere that we can't deal with. These are exceptional cases originating within cudf itself, and the suggestion is to simply provide a more informative exception message. We wouldn't print and then throw, we would concatenate the stack trace into the exception message and do a throw exception_type(reason_and_stack_trace). Does that address your concerns?

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 20, 2023

Today, again Spark team has to debug a subtle bug that only shows up in Dataproc + T4 GPU env. The bug can't be reproduced anywhere else, and only gives us a hint which doesn't help tracing back because it doesn't have stack trace:

ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-406-cuda11/thirdparty/cudf/cpp/src/io/utilities/hostdevice_vector.hpp:149: 700 cudaErrorIllegalAddress an illegal memory access was encountered
	at ai.rapids.cudf.Table.writeORCChunk(Native Method)
	at ai.rapids.cudf.Table.access$1400(Table.java:41)
	at ai.rapids.cudf.Table$ORCTableWriter.write(Table.java:1477)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatchWithRetry$5(ColumnarOutputWriter.scala:177)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatchWithRetry$5$adapted(ColumnarOutputWriter.scala:174)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)

I'm going to work on this ASAP. Given that cudf has a logging mechanism, we can log the stack trace without concatenating it into the thrown exception which should address the concern above.

@ttnghia ttnghia self-assigned this Apr 20, 2023
@jlowe
Copy link
Contributor

jlowe commented Apr 20, 2023

we can log the stack trace without concatenating it into the thrown exception

This is the log-then-throw anti-pattern I referred to above.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 20, 2023

Yes, and I'm trying to think of a way to overcome that. For example, logging using a different log level (such as debug level) which is turned off and doesn't print anything by default. But when we want to debug something, we can turn it back on to chase the issue.

If you have other ideas then please let me know.

@jlowe
Copy link
Contributor

jlowe commented Apr 20, 2023

Putting the stacktrace into the exception is usually the preferred route, either putting it in the exception message or attached to the exception object as a separate field/method that can be accessed for more detail. Then if the exception is not normally recovered by the application, the application has all the information needed to log it (and logged via the way the application wants to log).

rapids-bot bot pushed a commit that referenced this issue Jun 9, 2023
This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort.

Closes #12422.

### Example:
```
#0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446
#1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113
#2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e
#3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88
#4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9
#5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027
#6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f
#7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6
#8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195
#9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109
#10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f
#11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87
#12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95
#13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c
#14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90
#15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80
#16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5
```

### Usage

In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use.

For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example:
```
try {
  // cudf API calls
} catch (cudf::logic_error const& e) {
  std::cout << e.what() << std::endl;
  std::cout << e.stacktrace() << std::endl;
  throw e;
} 
// similar with catching other exception types
```

### Follow-up work

The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace.


### Note:
 * This feature doesn't require libcudf to be built in Debug mode.
 * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions.
 * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging).
 * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jason Lowe (https://github.com/jlowe)

URL: #13298
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants