Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add error check utilities #15

Merged
merged 30 commits into from
Jun 16, 2020

Conversation

seunghwak
Copy link
Contributor

@seunghwak seunghwak commented Jun 3, 2020

Close #11, #23

This PR copy-and-paste-and-modify cuDF's https://github.com/rapidsai/cudf/blob/branch-0.15/cpp/include/cudf/utilities/error.hpp

Currently supports

RAFT_EXPECTS, RAFT_FAIL, CUML_EXPECTS, CUML_FAIL, CUGRAPH_EXPECTS, CUGRAPH_FAIL, CUDA_TRY, CURAND_TRY, CUSPARSE_TRY, and NCCL_TRY.

Anything else to add?

@seunghwak seunghwak changed the title [WIP][skip-ci] Add error check utilities [REVIEW] Add error check utilities Jun 3, 2020
Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to put the basic assertions in here. But putting assertions related to curand/cusparse/nccl in the same header file means that we'll introduce dependencies on these even if the including source doesn't use these components. I'd highly prefer if we could separate these out.

Comment on lines 217 to 247
// FIXME: unnecessary once CUDA 10.1+ becomes the minimum supported version
#define _CUSPARSE_ERR_TO_STR(err) \
case err: \
return #err;
inline auto cusparse_error_to_string(cusparseStatus_t err) -> const char* {
#if defined(CUDART_VERSION) && CUDART_VERSION >= 10100
return cusparseGetErrorString(status);
#else // CUDART_VERSION
switch (err) {
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_SUCCESS);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_NOT_INITIALIZED);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ALLOC_FAILED);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INVALID_VALUE);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ARCH_MISMATCH);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_EXECUTION_FAILED);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INTERNAL_ERROR);
_CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED);
default:
return "CUSPARSE_STATUS_UNKNOWN";
};
#endif // CUDART_VERSION
}
#undef _CUSPARSE_ERR_TO_STR

inline void throw_cusparse_error(cusparseStatus_t error, const char* file,
unsigned int line) {
throw raft::cusparse_error(
std::string{"cuSparse error encountered at: " + std::string{file} + ":" +
std::to_string(line) + ": " + std::to_string(error) + " " +
cusparse_error_to_string(error)});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this check already defined in https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cusolver_wrappers.h, better to update the existing code there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to put the basic assertions in here. But putting assertions related to curand/cusparse/nccl in the same header file means that we'll introduce dependencies on these even if the including source doesn't use these components. I'd highly prefer if we could separate these out.

Yeah... so the pros is to keep the related code in one place so it gets easy to enforce consistency in error handling, but the problem you mentioned is also valid (especially NCCL can be problematic if we just compile for single-GPU).

I'm fine with either approach as long as we are consistent and can maintain consistency.

I will move

CUDA related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/cudart_utils.h

cuSparse related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/sparse/cusparse_wrappers.h

NCCL related ones to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/comms/std_comms.hpp

I should addd CUBLAS_TRY and CUSOLVER_TRY to https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cublas_wrappers.h and https://github.com/rapidsai/raft/blob/branch-0.15/cpp/include/raft/linalg/cusolver_wrappers.h

I may drop CURAND related ones for now, but they should be added back when we have curand_wrappers.hpp

And are you guys using XXX_CHECK in RAFT from cuML? If I replace them with XXX_TRY, will this break cuML?

And any thoughts about XXX_TRY vs XXX_CHECK? cuDF is using XXX_TRY, so I am just following cuDF convention but this is undesirable for cuML, I am open to discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for CUBLAS_TRY and CUSOLVER_TRY they will be needed in #12

I'm fine with XXX_TRY for consistency with cuDF. It also captures the underlying exception mechanism a bit better in the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seunghwak sorry for the delayed response.

Yes, we are using the *_CHECK macros everywhere inside cuML. So, replacing those will certainly break our codebase. Moreover, the C-style '%' modifiers provides more readable code than the '<<' style C++ syntax (purely my opinion).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, better keep both *_TRY and *_CHECK in the meantime (better pick one eventually).

There can be a long debate in printf-ish vs cout-ish (e.g. https://stackoverflow.com/questions/2872543/printf-vs-cout-in-c).

My biggest concern was the type safety (in the above link) but it seems like most host compilers generates a warning for this and we are OK unless we ignore warnings. This wasn't a case with nvcc and leads to a weird runtime behavior but this is irrelevant here as this is a host side error checking mechanism. We may better keep the C-ish style (and add C++ style in the future if necessary).

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved
cpp/include/raft/error.hpp Outdated Show resolved Hide resolved
@seunghwak
Copy link
Contributor Author

After some more investigation,

RAFT's exception throwing mechanism

#define THROW(fmt, ...) \

provides additional stack trace compare to the one I brought from cuDF (https://github.com/rapidsai/raft/pull/15/files#diff-98265352f57cb794e805742f18ff96efR37)

and this information can be valuable so I think it is better to preserve this mechanism (but I think it is better to create separate exception classes like logic_error or cuda_error inheriting raft::exception similar to cuDF).

I think we should deprecate (and eventually remove) direct use of ASSERT macro (better use RAFT(CUML, or CUGRAPH)_EXPECTS) to be more consistent with the rest of the RAPIDS.

And we should also deprecate XXX_CHECK (and eventually remove) and use XXX_TRY instead.

@teju85 @cjnolet @dantegd @afender What do you guys think?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with putting the separate TRY macros in their respective wrapper headers but I think we should consider the perspective of assuming RAFT prims only know about RAFT and not CUML or CUGRAPH.

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved
@seunghwak
Copy link
Contributor Author

I'm on board with putting the separate TRY macros in their respective wrapper headers but I think we should consider the perspective of assuming RAFT prims only know about RAFT and not CUML or CUGRAPH.

OK, makes sense, the initial intention of having separate CUGRAPH_EXPECTS and CUML_EXPECTS was to use them in cuGraph and cuML without the need to redefine error handling macros there (instead of using them inside RAFT, within RAFT, better use RAFT_EXPECTS), but we may better avoid cuGraph & cuML specific stuff to RAFT.

@teju85
Copy link
Member

teju85 commented Jun 9, 2020

Hey guys,
I'm a bit undecided about the *_CHECK vs *_TRY (or *_EXPECTS) macros. Let me list out my thoughts:

  1. the *_CHECK, THROW and ASSERT macros being used pretty much in the whole of cuML codebase. So, removing them will create a lot of breakages and require widespread changes. This is probably just a one-time thing and I'm, maybe, overthinking? Would appreciate @JohnZed's inputs here.
  2. This is an important one, IMO. All these cuML macros provide a nice C-style format specifiers syntax. This helps us to customize error messages, while keeping the code nice and short and readable. Personally, I'd even go along to say that C-style format string based messages are far more readable than C++-style '<<' syntax.
  3. There are C++ libraries like 'fmt' to provide format specifiers, but using an external library just for formatting our messages when a simple C-style approach would work, seems like an overkill to me.

I'd love to hear your thoughts about this.

@seunghwak seunghwak requested a review from teju85 June 11, 2020 23:20
@afender afender self-requested a review June 15, 2020 16:00
@afender
Copy link
Member

afender commented Jun 15, 2020

I believe this PR also addresses #23

Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks.

cpp/include/raft/error.hpp Outdated Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Outdated Show resolved Hide resolved
@seunghwak seunghwak requested a review from teju85 June 16, 2020 14:10
@seunghwak
Copy link
Contributor Author

@teju85 I addressed all your last comments :-)

Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM! Thanks @seunghwak

@BradReesWork BradReesWork merged commit aad0a00 into rapidsai:branch-0.15 Jun 16, 2020
@seunghwak seunghwak deleted the fea_ext_error branch October 3, 2020 04:44
rapids-bot bot pushed a commit that referenced this pull request Feb 15, 2024
Demangle the error stack trace provided by GCC.
Example output:
```bash
RAFT failure at file=/workspace/raft/cpp/bench/ann/src/raft/raft_ann_bench_utils.h line=127: Ooops!
Obtained 16 stack frames
#1 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::logic_error::logic_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5e [0x7fb20acce45e]
#2 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::configured_raft_resources::stream_wait(CUstream_st*) const +0x2e3 [0x7fb20acd0ac3]
#3 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::RaftIvfPQ<float, long>::search(float const*, int, int, unsigned long*, float*, CUstream_st*) const +0x63e [0x7fb20acd44fe]
#4 in ./cpp/build/ANN_BENCH: void raft::bench::ann::bench_search<float>(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective) +0xf76 [0x55853859f586]
#5 in ./cpp/build/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x84 [0x558538548f14]
#6 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const +0x168 [0x5585385d6498]
#7 in ./cpp/build/ANN_BENCH(+0x149108) [0x5585385b7108]
#8 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoNIterations() +0x34f [0x5585385b8c7f]
#9 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoOneRepetition() +0x119 [0x5585385b99b9]
#10 in ./cpp/build/ANN_BENCH(+0x13afdd) [0x5585385a8fdd]
#11 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) +0x58e [0x5585385aa8fe]
#12 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks() +0x6a [0x5585385aaada]
#13 in ./cpp/build/ANN_BENCH: raft::bench::ann::run_main(int, char**) +0x11ed [0x5585385980cd]
#14 in /lib/x86_64-linux-gnu/libc.so.6(+0x28150) [0x7fb213e28150]
#15 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x89 [0x7fb213e28209]
#16 in ./cpp/build/ANN_BENCH(+0xbfcef) [0x55853852dcef]


```

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #2188
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Exception-based macro and single header file for all error checkers
5 participants