[REVIEW] Add error check utilities #15

seunghwak · 2020-06-03T15:37:27Z

This PR copy-and-paste-and-modify cuDF's https://github.com/rapidsai/cudf/blob/branch-0.15/cpp/include/cudf/utilities/error.hpp

Currently supports

RAFT_EXPECTS, RAFT_FAIL, CUML_EXPECTS, CUML_FAIL, CUGRAPH_EXPECTS, CUGRAPH_FAIL, CUDA_TRY, CURAND_TRY, CUSPARSE_TRY, and NCCL_TRY.

Anything else to add?

teju85

It makes sense to put the basic assertions in here. But putting assertions related to curand/cusparse/nccl in the same header file means that we'll introduce dependencies on these even if the including source doesn't use these components. I'd highly prefer if we could separate these out.

teju85 · 2020-06-04T02:10:43Z

cpp/include/raft/error.hpp

+// FIXME: unnecessary once CUDA 10.1+ becomes the minimum supported version
+#define _CUSPARSE_ERR_TO_STR(err) \
+  case err:                       \
+    return #err;
+inline auto cusparse_error_to_string(cusparseStatus_t err) -> const char* {
+#if defined(CUDART_VERSION) && CUDART_VERSION >= 10100
+  return cusparseGetErrorString(status);
+#else   // CUDART_VERSION
+  switch (err) {
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_SUCCESS);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_NOT_INITIALIZED);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ALLOC_FAILED);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INVALID_VALUE);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ARCH_MISMATCH);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_EXECUTION_FAILED);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INTERNAL_ERROR);
+    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED);
+    default:
+      return "CUSPARSE_STATUS_UNKNOWN";
+  };
+#endif  // CUDART_VERSION
+}
+#undef _CUSPARSE_ERR_TO_STR
+
+inline void throw_cusparse_error(cusparseStatus_t error, const char* file,
+                                 unsigned int line) {
+  throw raft::cusparse_error(
+    std::string{"cuSparse error encountered at: " + std::string{file} + ":" +
+                std::to_string(line) + ": " + std::to_string(error) + " " +
+                cusparse_error_to_string(error)});
+}


We have this check already defined in https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cusolver_wrappers.h, better to update the existing code there?

It makes sense to put the basic assertions in here. But putting assertions related to curand/cusparse/nccl in the same header file means that we'll introduce dependencies on these even if the including source doesn't use these components. I'd highly prefer if we could separate these out.

Yeah... so the pros is to keep the related code in one place so it gets easy to enforce consistency in error handling, but the problem you mentioned is also valid (especially NCCL can be problematic if we just compile for single-GPU).

I'm fine with either approach as long as we are consistent and can maintain consistency.

I will move

CUDA related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/cudart_utils.h

cuSparse related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/sparse/cusparse_wrappers.h

NCCL related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/comms/std_comms.hpp

I should addd CUBLAS_TRY and CUSOLVER_TRY to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cublas_wrappers.h and https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cusolver_wrappers.h

I may drop CURAND related ones for now, but they should be added back when we have curand_wrappers.hpp

And are you guys using XXX_CHECK in RAFT from cuML? If I replace them with XXX_TRY, will this break cuML?

And any thoughts about XXX_TRY vs XXX_CHECK? cuDF is using XXX_TRY, so I am just following cuDF convention but this is undesirable for cuML, I am open to discussion.

+1 for CUBLAS_TRY and CUSOLVER_TRY they will be needed in #12

I'm fine with XXX_TRY for consistency with cuDF. It also captures the underlying exception mechanism a bit better in the name.

@seunghwak sorry for the delayed response.

Yes, we are using the *_CHECK macros everywhere inside cuML. So, replacing those will certainly break our codebase. Moreover, the C-style '%' modifiers provides more readable code than the '<<' style C++ syntax (purely my opinion).

OK, better keep both *_TRY and *_CHECK in the meantime (better pick one eventually).

There can be a long debate in printf-ish vs cout-ish (e.g. https://stackoverflow.com/questions/2872543/printf-vs-cout-in-c).

My biggest concern was the type safety (in the above link) but it seems like most host compilers generates a warning for this and we are OK unless we ignore warnings. This wasn't a case with nvcc and leads to a weird runtime behavior but this is irrelevant here as this is a host side error checking mechanism. We may better keep the C-ish style (and add C++ style in the future if necessary).

…rror

cpp/include/raft/error.hpp

seunghwak · 2020-06-05T18:08:05Z

After some more investigation,

RAFT's exception throwing mechanism

raft/cpp/include/raft/cudart_utils.h

Line 84 in f48552e

    
           #define THROW(fmt, ...)                                                        \

provides additional stack trace compare to the one I brought from cuDF (https://github.com/rapidsai/raft/pull/15/files#diff-98265352f57cb794e805742f18ff96efR37)

and this information can be valuable so I think it is better to preserve this mechanism (but I think it is better to create separate exception classes like logic_error or cuda_error inheriting raft::exception similar to cuDF).

I think we should deprecate (and eventually remove) direct use of ASSERT macro (better use RAFT(CUML, or CUGRAPH)_EXPECTS) to be more consistent with the rest of the RAPIDS.

And we should also deprecate XXX_CHECK (and eventually remove) and use XXX_TRY instead.

@teju85 @cjnolet @dantegd @afender What do you guys think?

cjnolet

I'm on board with putting the separate TRY macros in their respective wrapper headers but I think we should consider the perspective of assuming RAFT prims only know about RAFT and not CUML or CUGRAPH.

cpp/include/raft/error.hpp

seunghwak · 2020-06-09T01:55:03Z

I'm on board with putting the separate TRY macros in their respective wrapper headers but I think we should consider the perspective of assuming RAFT prims only know about RAFT and not CUML or CUGRAPH.

OK, makes sense, the initial intention of having separate CUGRAPH_EXPECTS and CUML_EXPECTS was to use them in cuGraph and cuML without the need to redefine error handling macros there (instead of using them inside RAFT, within RAFT, better use RAFT_EXPECTS), but we may better avoid cuGraph & cuML specific stuff to RAFT.

teju85 · 2020-06-09T02:25:11Z

Hey guys,
I'm a bit undecided about the *_CHECK vs *_TRY (or *_EXPECTS) macros. Let me list out my thoughts:

the *_CHECK, THROW and ASSERT macros being used pretty much in the whole of cuML codebase. So, removing them will create a lot of breakages and require widespread changes. This is probably just a one-time thing and I'm, maybe, overthinking? Would appreciate @JohnZed's inputs here.
This is an important one, IMO. All these cuML macros provide a nice C-style format specifiers syntax. This helps us to customize error messages, while keeping the code nice and short and readable. Personally, I'd even go along to say that C-style format string based messages are far more readable than C++-style '<<' syntax.
There are C++ libraries like 'fmt' to provide format specifiers, but using an external library just for formatting our messages when a simple C-style approach would work, seems like an overkill to me.

I'd love to hear your thoughts about this.

…:exception)

afender · 2020-06-15T16:02:53Z

I believe this PR also addresses #23

teju85

Some minor nitpicks.

cpp/include/raft/error.hpp

cpp/include/raft/comms/std_comms.hpp

seunghwak · 2020-06-16T14:11:46Z

@teju85 I addressed all your last comments :-)

teju85

Changes LGTM! Thanks @seunghwak

Demangle the error stack trace provided by GCC. Example output: ```bash RAFT failure at file=/workspace/raft/cpp/bench/ann/src/raft/raft_ann_bench_utils.h line=127: Ooops! Obtained 16 stack frames #1 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::logic_error::logic_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5e [0x7fb20acce45e] #2 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::configured_raft_resources::stream_wait(CUstream_st*) const +0x2e3 [0x7fb20acd0ac3] #3 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::RaftIvfPQ<float, long>::search(float const*, int, int, unsigned long*, float*, CUstream_st*) const +0x63e [0x7fb20acd44fe] #4 in ./cpp/build/ANN_BENCH: void raft::bench::ann::bench_search<float>(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective) +0xf76 [0x55853859f586] #5 in ./cpp/build/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x84 [0x558538548f14] #6 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const +0x168 [0x5585385d6498] #7 in ./cpp/build/ANN_BENCH(+0x149108) [0x5585385b7108] #8 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoNIterations() +0x34f [0x5585385b8c7f] #9 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoOneRepetition() +0x119 [0x5585385b99b9] #10 in ./cpp/build/ANN_BENCH(+0x13afdd) [0x5585385a8fdd] #11 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) +0x58e [0x5585385aa8fe] #12 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks() +0x6a [0x5585385aaada] #13 in ./cpp/build/ANN_BENCH: raft::bench::ann::run_main(int, char**) +0x11ed [0x5585385980cd] #14 in /lib/x86_64-linux-gnu/libc.so.6(+0x28150) [0x7fb213e28150] #15 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x89 [0x7fb213e28209] #16 in ./cpp/build/ANN_BENCH(+0xbfcef) [0x55853852dcef] ``` Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2188

seunghwak added 9 commits June 3, 2020 11:20

copy error.hpp from cuDF, add license statement, and initial update

a928468

add CUML_EXPECTS, CUML_FAIL, CUGRAPH_EXPECTS, and CUGRAPH_FAIL

328462f

add NCCL_TRY

187e12a

fix compile/clang-tidy errors

4ce8f37

fix an error in a comment

086abd3

add CUSPARSE_TRY

4f72257

add CURAND_TRY

a428c6e

address clang-tidy warnings

b9cee2b

update change log

b373267

seunghwak changed the title ~~[WIP][skip-ci] Add error check utilities~~ [REVIEW] Add error check utilities Jun 3, 2020

seunghwak added 3 commits June 3, 2020 13:40

resolve error conflicts

54bdc8e

clang-format fixes

e471f1d

another try to make clang-format happy

035dc00

teju85 requested changes Jun 4, 2020

View reviewed changes

Merge branch 'branch-0.15' of github.com:rapidsai/raft into fea_ext_e…

f8455b9

…rror

afender reviewed Jun 5, 2020

View reviewed changes

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved

cjnolet reviewed Jun 8, 2020

View reviewed changes

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved

seunghwak added 10 commits June 10, 2020 00:13

move common error handling utilities from cuda_utils.h to error.hpp

2566b24

update raft error classes to inherit raft::exception (instead of std:…

3656125

…:exception)

move macros out from the raft namespace

0e62cea

remove CUML(GRAPH)_EXPECTS(FAIL)

55922cb

update RAFT_EXPECTS and RAFT_FAIL

acd5824

compile error fix (namespace)

4a48b57

minor fixes to RAFT_EXPECTS(FAIL)

059f1ec

move error check macros from error.hpp to relevant headers

125911c

clang-format

d3192f4

cosmetic updates

ec0cf97

seunghwak added 3 commits June 11, 2020 17:51

cosmetic updates

f8f8d32

stifle some warnings

c3f153d

clang-format error

85c9b7d

seunghwak requested a review from teju85 June 11, 2020 23:20

afender self-requested a review June 15, 2020 16:00

BradReesWork added this to the 0.15 milestone Jun 15, 2020

BradReesWork added the 3 - Ready for Review label Jun 15, 2020

afender approved these changes Jun 15, 2020

View reviewed changes

teju85 requested changes Jun 16, 2020

View reviewed changes

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved

cpp/include/raft/comms/std_comms.hpp Outdated Show resolved Hide resolved

seunghwak added 4 commits June 16, 2020 09:50

fix unused location_prefix in error handling macro

6d9e392

remove NCCL_CHECK (replaced with NCCL_TRY)

4ebc0af

clang-format

851b401

another clang format

07a51a4

seunghwak requested a review from teju85 June 16, 2020 14:10

teju85 approved these changes Jun 16, 2020

View reviewed changes

BradReesWork merged commit aad0a00 into rapidsai:branch-0.15 Jun 16, 2020

seunghwak mentioned this pull request Jun 16, 2020

[ENH] Use RAFT error handling mechanism rapidsai/cugraph#951

Closed

afender mentioned this pull request Jun 16, 2020

[REVIEW] commSplit Implementation #18

Merged

seunghwak mentioned this pull request Jul 22, 2020

[FEA] also reconcile error_utils from cugraph #8

Closed

seunghwak deleted the fea_ext_error branch October 3, 2020 04:44

kun429973 mentioned this pull request Oct 24, 2024

Raft Cpp Test Error #2476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add error check utilities #15

[REVIEW] Add error check utilities #15

seunghwak commented Jun 3, 2020 •

edited

Loading

teju85 left a comment

teju85 Jun 4, 2020

seunghwak Jun 4, 2020

afender Jun 5, 2020

teju85 Jun 9, 2020

seunghwak Jun 10, 2020

seunghwak commented Jun 5, 2020

cjnolet left a comment

seunghwak commented Jun 9, 2020

teju85 commented Jun 9, 2020

afender commented Jun 15, 2020

teju85 left a comment

seunghwak commented Jun 16, 2020

teju85 left a comment

[REVIEW] Add error check utilities #15

[REVIEW] Add error check utilities #15

Conversation

seunghwak commented Jun 3, 2020 • edited Loading

teju85 left a comment

Choose a reason for hiding this comment

teju85 Jun 4, 2020

Choose a reason for hiding this comment

seunghwak Jun 4, 2020

Choose a reason for hiding this comment

afender Jun 5, 2020

Choose a reason for hiding this comment

teju85 Jun 9, 2020

Choose a reason for hiding this comment

seunghwak Jun 10, 2020

Choose a reason for hiding this comment

seunghwak commented Jun 5, 2020

cjnolet left a comment

Choose a reason for hiding this comment

seunghwak commented Jun 9, 2020

teju85 commented Jun 9, 2020

afender commented Jun 15, 2020

teju85 left a comment

Choose a reason for hiding this comment

seunghwak commented Jun 16, 2020

teju85 left a comment

Choose a reason for hiding this comment

seunghwak commented Jun 3, 2020 •

edited

Loading