[FEA] When libcudf throws exception, it should also print out the stack trace #12422

ttnghia · 2022-12-19T22:04:15Z

Currently, when an exception happens, it will be thrown with an error message. All we can see from it is just something like this:

RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/include/cudf/utilities/type_dispatcher.hpp:531: Invalid type_id.

We have no idea which APIs have been called and cause that exception. It would be much much better if we can print out the stack trace like this:

 0# void throw_with_trace<std::logic_error>(std::logic_error const&) at ../example/throwing_st.cpp:22
 1# oops(int) at ../example/throwing_st.cpp:38
 2# bar(int) at ../example/throwing_st.cpp:54
 3# foo(int) at ../example/throwing_st.cpp:59
 4# bar(int) at ../example/throwing_st.cpp:49
 5# foo(int) at ../example/throwing_st.cpp:59
 6# main at ../example/throwing_st.cpp:76
 7# 0x00007FAC113BEF45 in /lib/x86_64-linux-gnu/libc.so.6
 8# 0x0000000000402ED9

Ref: https://www.boost.org/doc/libs/develop/doc/html/stacktrace/getting_started.html#stacktrace.getting_started.exceptions_with_stacktrace

We can adopt one of the very well-known opensource stack trace libraries such as https://github.com/boostorg/stacktrace or https://github.com/bombela/backward-cpp or write our own stack trace code. I believe that having the ability to print the call stack would significantly benefit our developers.

The text was updated successfully, but these errors were encountered:

vyasr · 2022-12-19T22:55:52Z

I'm -1 on using a Boost solution. We ripped it out intentionally in #7932, it's a large dependency and I wouldn't want to make RAPIDS any harder to build than it already is.

I'm not opposed to providing a stack trace in general, though. A simpler solution for this would be nice. FWIW I adapted a much more hacked together solution in the stream identification library: https://github.com/rapidsai/cudf/blob/branch-23.02/cpp/tests/utilities/identify_stream_usage/identify_stream_usage.cpp#L45.

ttnghia · 2022-12-19T23:07:30Z

That's great. Should we extract that code out and make it a common utility function so CUDF_EXPECT can call it? I was afraid that implementing it ourselves would be difficult, didn't realize that we already have it.

vyasr · 2023-01-04T19:09:07Z

Sure I'm OK with extracting that code. I have a weak preference for using a simple, well-maintained, header-only library that is better than my solution, but I don't know if it's worth the effort right now so moving my code out into somewhere that can be used more generally in libcudf is fine.

davidwendt · 2023-01-04T20:11:24Z

There are over 1100 CUDF_EXPECTS calls in libcudf so I have a strong preference that this would not be a header-only solution.

jlowe · 2023-01-04T20:30:24Z

The problem with libcudf logging an exception and then throwing it is that this is the log-and-throw anti-pattern. This usually means the exception gets logged at least twice, once by libcudf and another by the application, or worse, the application has some recovery logic and then we get some unwanted prints from libcudf.

This would be very bad for RMM, for example, as our application is performing recovery on many RMM exceptions and retrying. If libcudf printed a stacktrace when an output column is too big (i.e.: row count > maxint), that would be annoying output if the application were to catch, recover, and retry (e.g.: by splitting the input into smaller chunks and iterating).

vyasr · 2023-01-09T19:12:35Z

There are over 1100 CUDF_EXPECTS calls in libcudf so I have a strong preference that this would not be a header-only solution.

Good point, I wasn't thinking about this too much. I'm less interested in it being header-only specifically, and more in it being lightweight. A single small header and source file would also be perfectly fine with me.

The problem with libcudf logging an exception and then throwing it is that this is the log-and-throw anti-pattern. This usually means the exception gets logged at least twice, once by libcudf and another by the application, or worse, the application has some recovery logic and then we get some unwanted prints from libcudf.

Perhaps I was misunderstanding @ttnghia's intent, but I don't think what I'm thinking of matches the antipattern you're referring to. We're not suggesting to do a try/catch/log/rethrow. These aren't exceptions coming from somewhere that we can't deal with. These are exceptional cases originating within cudf itself, and the suggestion is to simply provide a more informative exception message. We wouldn't print and then throw, we would concatenate the stack trace into the exception message and do a throw exception_type(reason_and_stack_trace). Does that address your concerns?

ttnghia · 2023-04-20T19:37:16Z

Today, again Spark team has to debug a subtle bug that only shows up in Dataproc + T4 GPU env. The bug can't be reproduced anywhere else, and only gives us a hint which doesn't help tracing back because it doesn't have stack trace:

ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-406-cuda11/thirdparty/cudf/cpp/src/io/utilities/hostdevice_vector.hpp:149: 700 cudaErrorIllegalAddress an illegal memory access was encountered
	at ai.rapids.cudf.Table.writeORCChunk(Native Method)
	at ai.rapids.cudf.Table.access$1400(Table.java:41)
	at ai.rapids.cudf.Table$ORCTableWriter.write(Table.java:1477)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatchWithRetry$5(ColumnarOutputWriter.scala:177)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatchWithRetry$5$adapted(ColumnarOutputWriter.scala:174)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)

I'm going to work on this ASAP. Given that cudf has a logging mechanism, we can log the stack trace without concatenating it into the thrown exception which should address the concern above.

jlowe · 2023-04-20T19:55:10Z

we can log the stack trace without concatenating it into the thrown exception

This is the log-then-throw anti-pattern I referred to above.

ttnghia · 2023-04-20T19:57:47Z

Yes, and I'm trying to think of a way to overcome that. For example, logging using a different log level (such as debug level) which is turned off and doesn't print anything by default. But when we want to debug something, we can turn it back on to chase the issue.

If you have other ideas then please let me know.

jlowe · 2023-04-20T20:02:08Z

Putting the stacktrace into the exception is usually the preferred route, either putting it in the exception message or attached to the exception object as a separate field/method that can be accessed for more detail. Then if the exception is not normally recovered by the application, the application has all the information needed to log it (and logged via the way the application wants to log).

This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort. Closes #12422. ### Example: ``` #0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446 #1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113 #2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e #3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88 #4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9 #5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027 #6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f #7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6 #8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195 #9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109 #10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f #11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87 #12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95 #13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c #14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90 #15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80 #16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5 ``` ### Usage In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use. For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example: ``` try { // cudf API calls } catch (cudf::logic_error const& e) { std::cout << e.what() << std::endl; std::cout << e.stacktrace() << std::endl; throw e; } // similar with catching other exception types ``` ### Follow-up work The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace. ### Note: * This feature doesn't require libcudf to be built in Debug mode. * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions. * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging). * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Jason Lowe (https://github.com/jlowe) URL: #13298

ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Dec 19, 2022

ttnghia changed the title ~~[FEA] When libcudf throws exception, also print out the stack trace~~ [FEA] When libcudf throws exception, it should also print out the stack trace Dec 19, 2022

bdice removed the Needs Triage Need team to review and classify label Dec 20, 2022

ttnghia self-assigned this Apr 20, 2023

ttnghia mentioned this issue May 6, 2023

Add stacktrace into cudf exception types #13298

Merged

rapids-bot bot closed this as completed in #13298 Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] When libcudf throws exception, it should also print out the stack trace #12422

[FEA] When libcudf throws exception, it should also print out the stack trace #12422

ttnghia commented Dec 19, 2022 •

edited

Loading

vyasr commented Dec 19, 2022

ttnghia commented Dec 19, 2022

vyasr commented Jan 4, 2023

davidwendt commented Jan 4, 2023

jlowe commented Jan 4, 2023

vyasr commented Jan 9, 2023

ttnghia commented Apr 20, 2023 •

edited

Loading

jlowe commented Apr 20, 2023

ttnghia commented Apr 20, 2023 •

edited

Loading

jlowe commented Apr 20, 2023

[FEA] When libcudf throws exception, it should also print out the stack trace #12422

[FEA] When libcudf throws exception, it should also print out the stack trace #12422

Comments

ttnghia commented Dec 19, 2022 • edited Loading

vyasr commented Dec 19, 2022

ttnghia commented Dec 19, 2022

vyasr commented Jan 4, 2023

davidwendt commented Jan 4, 2023

jlowe commented Jan 4, 2023

vyasr commented Jan 9, 2023

ttnghia commented Apr 20, 2023 • edited Loading

jlowe commented Apr 20, 2023

ttnghia commented Apr 20, 2023 • edited Loading

jlowe commented Apr 20, 2023

ttnghia commented Dec 19, 2022 •

edited

Loading

ttnghia commented Apr 20, 2023 •

edited

Loading

ttnghia commented Apr 20, 2023 •

edited

Loading