-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] When libcudf throws exception, it should also print out the stack trace #12422
Comments
I'm -1 on using a Boost solution. We ripped it out intentionally in #7932, it's a large dependency and I wouldn't want to make RAPIDS any harder to build than it already is. I'm not opposed to providing a stack trace in general, though. A simpler solution for this would be nice. FWIW I adapted a much more hacked together solution in the stream identification library: https://github.com/rapidsai/cudf/blob/branch-23.02/cpp/tests/utilities/identify_stream_usage/identify_stream_usage.cpp#L45. |
That's great. Should we extract that code out and make it a common utility function so |
Sure I'm OK with extracting that code. I have a weak preference for using a simple, well-maintained, header-only library that is better than my solution, but I don't know if it's worth the effort right now so moving my code out into somewhere that can be used more generally in libcudf is fine. |
There are over 1100 |
The problem with libcudf logging an exception and then throwing it is that this is the log-and-throw anti-pattern. This usually means the exception gets logged at least twice, once by libcudf and another by the application, or worse, the application has some recovery logic and then we get some unwanted prints from libcudf. This would be very bad for RMM, for example, as our application is performing recovery on many RMM exceptions and retrying. If libcudf printed a stacktrace when an output column is too big (i.e.: row count > maxint), that would be annoying output if the application were to catch, recover, and retry (e.g.: by splitting the input into smaller chunks and iterating). |
Good point, I wasn't thinking about this too much. I'm less interested in it being header-only specifically, and more in it being lightweight. A single small header and source file would also be perfectly fine with me.
Perhaps I was misunderstanding @ttnghia's intent, but I don't think what I'm thinking of matches the antipattern you're referring to. We're not suggesting to do a try/catch/log/rethrow. These aren't exceptions coming from somewhere that we can't deal with. These are exceptional cases originating within cudf itself, and the suggestion is to simply provide a more informative exception message. We wouldn't print and then throw, we would concatenate the stack trace into the exception message and do a |
Today, again Spark team has to debug a subtle bug that only shows up in Dataproc + T4 GPU env. The bug can't be reproduced anywhere else, and only gives us a hint which doesn't help tracing back because it doesn't have stack trace:
I'm going to work on this ASAP. Given that cudf has a logging mechanism, we can log the stack trace without concatenating it into the thrown exception which should address the concern above. |
This is the log-then-throw anti-pattern I referred to above. |
Yes, and I'm trying to think of a way to overcome that. For example, logging using a different log level (such as If you have other ideas then please let me know. |
Putting the stacktrace into the exception is usually the preferred route, either putting it in the exception message or attached to the exception object as a separate field/method that can be accessed for more detail. Then if the exception is not normally recovered by the application, the application has all the information needed to log it (and logged via the way the application wants to log). |
This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort. Closes #12422. ### Example: ``` #0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446 #1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113 #2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e #3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88 #4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9 #5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027 #6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f #7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6 #8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195 #9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109 #10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f #11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87 #12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95 #13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c #14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90 #15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80 #16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5 ``` ### Usage In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use. For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example: ``` try { // cudf API calls } catch (cudf::logic_error const& e) { std::cout << e.what() << std::endl; std::cout << e.stacktrace() << std::endl; throw e; } // similar with catching other exception types ``` ### Follow-up work The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace. ### Note: * This feature doesn't require libcudf to be built in Debug mode. * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions. * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging). * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Jason Lowe (https://github.com/jlowe) URL: #13298
Currently, when an exception happens, it will be thrown with an error message. All we can see from it is just something like this:
We have no idea which APIs have been called and cause that exception. It would be much much better if we can print out the stack trace like this:
Ref: https://www.boost.org/doc/libs/develop/doc/html/stacktrace/getting_started.html#stacktrace.getting_started.exceptions_with_stacktrace
We can adopt one of the very well-known opensource stack trace libraries such as https://github.com/boostorg/stacktrace or https://github.com/bombela/backward-cpp or write our own stack trace code. I believe that having the ability to print the call stack would significantly benefit our developers.
The text was updated successfully, but these errors were encountered: