-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add error check utilities #15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to put the basic assertions in here. But putting assertions related to curand/cusparse/nccl in the same header file means that we'll introduce dependencies on these even if the including source doesn't use these components. I'd highly prefer if we could separate these out.
cpp/include/raft/error.hpp
Outdated
// FIXME: unnecessary once CUDA 10.1+ becomes the minimum supported version | ||
#define _CUSPARSE_ERR_TO_STR(err) \ | ||
case err: \ | ||
return #err; | ||
inline auto cusparse_error_to_string(cusparseStatus_t err) -> const char* { | ||
#if defined(CUDART_VERSION) && CUDART_VERSION >= 10100 | ||
return cusparseGetErrorString(status); | ||
#else // CUDART_VERSION | ||
switch (err) { | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_SUCCESS); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_NOT_INITIALIZED); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ALLOC_FAILED); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INVALID_VALUE); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ARCH_MISMATCH); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_EXECUTION_FAILED); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INTERNAL_ERROR); | ||
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED); | ||
default: | ||
return "CUSPARSE_STATUS_UNKNOWN"; | ||
}; | ||
#endif // CUDART_VERSION | ||
} | ||
#undef _CUSPARSE_ERR_TO_STR | ||
|
||
inline void throw_cusparse_error(cusparseStatus_t error, const char* file, | ||
unsigned int line) { | ||
throw raft::cusparse_error( | ||
std::string{"cuSparse error encountered at: " + std::string{file} + ":" + | ||
std::to_string(line) + ": " + std::to_string(error) + " " + | ||
cusparse_error_to_string(error)}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this check already defined in https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cusolver_wrappers.h, better to update the existing code there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to put the basic assertions in here. But putting assertions related to curand/cusparse/nccl in the same header file means that we'll introduce dependencies on these even if the including source doesn't use these components. I'd highly prefer if we could separate these out.
Yeah... so the pros is to keep the related code in one place so it gets easy to enforce consistency in error handling, but the problem you mentioned is also valid (especially NCCL can be problematic if we just compile for single-GPU).
I'm fine with either approach as long as we are consistent and can maintain consistency.
I will move
CUDA related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/cudart_utils.h
cuSparse related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/sparse/cusparse_wrappers.h
NCCL related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/comms/std_comms.hpp
I should addd CUBLAS_TRY and CUSOLVER_TRY to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cublas_wrappers.h and https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cusolver_wrappers.h
I may drop CURAND related ones for now, but they should be added back when we have curand_wrappers.hpp
And are you guys using XXX_CHECK in RAFT from cuML? If I replace them with XXX_TRY, will this break cuML?
And any thoughts about XXX_TRY vs XXX_CHECK? cuDF is using XXX_TRY, so I am just following cuDF convention but this is undesirable for cuML, I am open to discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for CUBLAS_TRY and CUSOLVER_TRY they will be needed in #12
I'm fine with XXX_TRY for consistency with cuDF. It also captures the underlying exception mechanism a bit better in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seunghwak sorry for the delayed response.
Yes, we are using the *_CHECK macros everywhere inside cuML. So, replacing those will certainly break our codebase. Moreover, the C-style '%' modifiers provides more readable code than the '<<' style C++ syntax (purely my opinion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, better keep both *_TRY and *_CHECK in the meantime (better pick one eventually).
There can be a long debate in printf-ish vs cout-ish (e.g. https://stackoverflow.com/questions/2872543/printf-vs-cout-in-c).
My biggest concern was the type safety (in the above link) but it seems like most host compilers generates a warning for this and we are OK unless we ignore warnings. This wasn't a case with nvcc and leads to a weird runtime behavior but this is irrelevant here as this is a host side error checking mechanism. We may better keep the C-ish style (and add C++ style in the future if necessary).
After some more investigation, RAFT's exception throwing mechanism raft/cpp/include/raft/cudart_utils.h Line 84 in f48552e
provides additional stack trace compare to the one I brought from cuDF (https://github.com/rapidsai/raft/pull/15/files#diff-98265352f57cb794e805742f18ff96efR37) and this information can be valuable so I think it is better to preserve this mechanism (but I think it is better to create separate exception classes like logic_error or cuda_error inheriting raft::exception similar to cuDF). I think we should deprecate (and eventually remove) direct use of ASSERT macro (better use RAFT(CUML, or CUGRAPH)_EXPECTS) to be more consistent with the rest of the RAPIDS. And we should also deprecate XXX_CHECK (and eventually remove) and use XXX_TRY instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on board with putting the separate TRY
macros in their respective wrapper headers but I think we should consider the perspective of assuming RAFT prims only know about RAFT and not CUML or CUGRAPH.
OK, makes sense, the initial intention of having separate CUGRAPH_EXPECTS and CUML_EXPECTS was to use them in cuGraph and cuML without the need to redefine error handling macros there (instead of using them inside RAFT, within RAFT, better use RAFT_EXPECTS), but we may better avoid cuGraph & cuML specific stuff to RAFT. |
Hey guys,
I'd love to hear your thoughts about this. |
I believe this PR also addresses #23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks.
@teju85 I addressed all your last comments :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM! Thanks @seunghwak
Demangle the error stack trace provided by GCC. Example output: ```bash RAFT failure at file=/workspace/raft/cpp/bench/ann/src/raft/raft_ann_bench_utils.h line=127: Ooops! Obtained 16 stack frames #1 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::logic_error::logic_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5e [0x7fb20acce45e] #2 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::configured_raft_resources::stream_wait(CUstream_st*) const +0x2e3 [0x7fb20acd0ac3] #3 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::RaftIvfPQ<float, long>::search(float const*, int, int, unsigned long*, float*, CUstream_st*) const +0x63e [0x7fb20acd44fe] #4 in ./cpp/build/ANN_BENCH: void raft::bench::ann::bench_search<float>(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective) +0xf76 [0x55853859f586] #5 in ./cpp/build/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x84 [0x558538548f14] #6 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const +0x168 [0x5585385d6498] #7 in ./cpp/build/ANN_BENCH(+0x149108) [0x5585385b7108] #8 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoNIterations() +0x34f [0x5585385b8c7f] #9 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoOneRepetition() +0x119 [0x5585385b99b9] #10 in ./cpp/build/ANN_BENCH(+0x13afdd) [0x5585385a8fdd] #11 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) +0x58e [0x5585385aa8fe] #12 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks() +0x6a [0x5585385aaada] #13 in ./cpp/build/ANN_BENCH: raft::bench::ann::run_main(int, char**) +0x11ed [0x5585385980cd] #14 in /lib/x86_64-linux-gnu/libc.so.6(+0x28150) [0x7fb213e28150] #15 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x89 [0x7fb213e28209] #16 in ./cpp/build/ANN_BENCH(+0xbfcef) [0x55853852dcef] ``` Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2188
Close #11, #23
This PR copy-and-paste-and-modify cuDF's https://github.com/rapidsai/cudf/blob/branch-0.15/cpp/include/cudf/utilities/error.hpp
Currently supports
RAFT_EXPECTS, RAFT_FAIL, CUML_EXPECTS, CUML_FAIL, CUGRAPH_EXPECTS, CUGRAPH_FAIL, CUDA_TRY, CURAND_TRY, CUSPARSE_TRY, and NCCL_TRY.
Anything else to add?