Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CUDA error checking/debugging #94

Merged
merged 24 commits into from
Aug 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ BasedOnStyle: Google
---
Language: Cpp
Cpp11BracedListStyle: true
Standard: Cpp11
Standard: c++11
DerivePointerAlignment: false
PointerAlignment: Right
---
2 changes: 2 additions & 0 deletions k2/csrc/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
include_directories(${CMAKE_SOURCE_DIR})

# please sort the source files alphabetically
add_library(fsa
arcsort.cc
Expand Down
15 changes: 8 additions & 7 deletions k2/csrc/cuda/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,23 @@ target_link_libraries(context PUBLIC glog)
function(k2_add_cuda_test name)
add_executable(${name} "${name}.cu")
target_link_libraries(${name}
PRIVATE
PRIVATE
context
gtest
gtest_main
)
)
add_test(NAME "Test.Cuda.${name}"
COMMAND
COMMAND
$<TARGET_FILE:${name}>
)
)
endfunction()

# please sort the source files alphabetically
set(cuda_tests
ops_test
utils_test
)
# ops_test
utils_test
debug_test
)

foreach(name IN LISTS cuda_tests)
k2_add_cuda_test(${name})
Expand Down
1 change: 1 addition & 0 deletions k2/csrc/cuda/context.cu
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#include <cstdlib>

#include "k2/csrc/cuda/context.h"
#include "k2/csrc/cuda/debug.h"
#include "k2/csrc/cuda/error.h"

static constexpr size_t kAlignment = 64;
Expand Down
226 changes: 226 additions & 0 deletions k2/csrc/cuda/debug.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
// k2/csrc/cuda/debug.h

// Copyright (c) 2020, Xiaomi Corporation ( authors: Meixu Song )

// See LICENSE for clarification regarding multiple authors

#ifndef K2_CSRC_CUDA_DEBUG_H_
#define K2_CSRC_CUDA_DEBUG_H_

/**
* Include multiple cuda headers to make host compiler preprocessor happy.
*
* @todo
* Find a way to avoid this and make .h/.cc with cuda code
* could be parsed by host compiler (specially, GNU-gcc).
* (May assgin to nvcc to take control through change cmake
* compiler and options. Then, it need cmake-3.18 `FindCUDAToolkit`
* or other config helpers to make things easy.)
*/
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <device_launch_parameters.h>
#include <driver_types.h>

#include <stdio.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use cstdio?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <assert.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this header file from cuda? or from c++ library <cassert>? I know there's an assert function in cuda header file, but I don't know how nvcc decides which one to use.


namespace k2 {

/**
* @brief A static assertion
*
* @param[in] exp the compile-time boolean expression that must be true
* @param[in] msg an error message if exp is false
*
* @note `static_assert` is supported by both of host and device.
*
* @code{.cpp}
* K2_STATIC_ASSERT(DEFINED_SHAPE % DEFINED_X == 0);
* @endcode
*/
#define K2_STATIC_ASSERT(exp, msg) static_assert(exp, msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_assert is part of C++11 standard. What is the reason to add another indirection here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I guess we can just use static_assert.


/**
* @brief Check if the expression is true.
*
* @details Implemented by `assert`, which is supported by host and device.
*
* @param[in] exp the boolean expression that should be true
*
* @code{.cpp}
* K2_ASSERT(1 == 1);
* @endcode
*/
#define K2_ASSERT(exp) assert(exp)

/**
* @brief Check if two arguments are equal.
*
* @details Implemented by `assert`, which is supported by host and device.
*
* @param[in] a left argument to compare
* @param[in] b right argument to compare
*
* @code{.cpp}
* K2_CHECK_EQ(1, 1);
* @endcode
*/
#define K2_CHECK_EQ(a, b) assert( a == b )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always add parenthesis to your macro arguments.

assert((a) == (b))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do not need something like K2_CHECK_EQ(a, b) << "some error message" ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@songmeixu I'd rather have it print out the error message, including the values of a and b, and line numbers, if it fails. This will make debugging much easier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. also of course there will be other similar macros like K2_CHECK, K2_CHECK_GE, K2_CHECK_LT and so on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@songmeixu there are some comments here you need to address: (1) using parentheses, (2) printing out the values of a and b and the line numbers if it fails.
I'm thinking that if we want a regular assert that just dies, we can use assert().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that assert in cuda has printed some error messages, e.g.

test.cu:19: void testAssert(): block: [0,0,0], thread: [0,0,0] Assertion `should_be_one` failed.

so maybe we need to avoid the duplicate of line numbers when printing error messages.


/**
* @fn
* __host__ __device__ __forceinline__ cudaError_t
* _K2CudaDebug(cudaError_t error,
* const char *filename,
* int line)
*
* @brief This is an error checking function, with context information.
* It's not designed to called by users, but inner macros.
* It's made used by other macros.
*
* @param[in] error an enum type indicating CUDA errors.
* @param[in] filename the source filename that the error comes from.
* @param[in] line the code line that the error happened.
* @return Pass the input CUDA error.
*
* @code{.cpp}
* _K2CudaDebug(cudaGetLastError(), __FILE__, __LINE__);
* @endcode
*/
__host__ __device__ __forceinline__ cudaError_t _K2CudaDebug(
cudaError_t error,
const char *filename,
int line) {
if (cudaSuccess != error) {
#ifndef __CUDA_ARCH__
fprintf(stderr, "CUDA error ID=%d, NAME=%s, [%s, %d]: %s\n",
error, cudaGetErrorName(error),
filename, line,
cudaGetErrorString(error));
fflush(stderr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need fflush here. stderr is unbuffered by default.

#elif __CUDA_ARCH__ >= 200
printf("CUDA error ID=%d, NAME=%s, "
"[block (%d,%d,%d) thread (%d,%d,%d), %s, %d]: %s\n",
error, cudaGetErrorName(error),
blockIdx.x, blockIdx.y, blockIdx.z,
threadIdx.x, threadIdx.y, threadIdx.z,
filename, line,
cudaGetErrorString(error));
#endif
}
return error;
}

/**
* @def K2_CUDA_CHECK_ERROR(cudaError)
*
* @brief Macro for checking cuda error.
*
* @details
* If error is not cudaSuccess, print the error message and exit with the
* error enum value. Otherwise, it does nothing.
*
* @param[in] e an enum type indicating CUDA errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the type please, e.g. cudaError_t. There are many such enum's in CUDA and they are not all the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. I meant, mention the type cudaError_t when you document the variable

*
* @code{.cpp}
* K2_CUDA_CHECK_ERROR(cudaGetLastError());
* @endcode
*/
#define K2_CUDA_CHECK_ERROR(e) \
if (::k2::_K2CudaDebug((cudaError_t)(e), __FILE__, __LINE__)) { \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will evaluate e twice! Not good!

exit(e); \
}

/**
* @def K2_CUDA_SAFE_CALL([cuda_runtime_api|kernel])
*
* @brief Macro for checking "cuda standard runtime api"
* or "kernels" return status.
*
* @details
* - If api return status is not cudaSuccess, print and exit.
* - The `cudaDeviceSynchronize` only happens when `NDEBUG` is not defined.
*
* @note
* Kernel launches do not return any error code, thus checking should after it.
* To wait kernel to finish, cudaDeviceSynchronize is called between.
*
* @code{.cpp}
* K2_CUDA_SAFE_CALL(cudaRuntimeApi());
* K2_CUDA_SAFE_CALL(kernel_func<<<...>>>());
* @endcode
*/
#ifndef NDEBUG
#define K2_CUDA_SAFE_CALL(...) \
do { \
(__VA_ARGS__); \
cudaDeviceSynchronize(); \
K2_CUDA_CHECK_ERROR(cudaGetLastError()); \
} while (0)
#else
#define K2_CUDA_SAFE_CALL(...) \
do { \
(__VA_ARGS__); \
K2_CUDA_CHECK_ERROR(cudaGetLastError()); \
} while (0)
#endif

/**
* @def K2_DLOG
*
* @brief Log macro for printf statements.
*
* @details
* `printf` is supported by both host and device. This Log is for debugging,
* the error msg is printed to the stderr. The log msg always get printed,
* regardless of macro `NDEBUG`. Thus it should only be used for debugging.
*
* @code{.cpp}
* K2_DLOG("Value is %d, string is %s ..", i, str);
* @endcode
*/
#ifndef __CUDA_ARCH__
#define K2_DLOG(format, ...) printf(format, __VA_ARGS__)
#elif __CUDA_ARCH__ >= 200
#define K2_DLOG(format, ...) \
printf("[block (%d,%d,%d), thread (%d,%d,%d)]: " format, blockIdx.x, \
blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, \
__VA_ARGS__)
#endif

/**
* @brief An more expensive asserts only checked if `K2_PARANOID` defined.
*
* @details
* If triggered, these info get shown: "file, line, [blockIdx, threadIdx],
* formated message, and the standard asserts info".
*
* @param[in] exp the expression expected to be true
* @param[in] format an error message if exp is false
* @param[in] \optional ... the optional arguments for printf format.
*
* @code{.cpp}
* K2_PARANOID_ASSERT(a >= b, "a must be greater than b, "
* "but now a = %d, b = %d", a, b);
*
* K2_PARANOID_ASSERT(a >= b, "a must be greater than b");
* @endcode
*/
#ifdef K2_PARANOID
#define K2_PARANOID_ASSERT(exp, format, ...) \
do { \
if (exp) \
(void)0; \
else { \
K2_DLOG(" [%s, %d] " format, __FILE__, __LINE__, ##__VA_ARGS__); \
assert(exp); \
} \
} while (0)
#else
#define K2_PARANOID_ASSERT(exp, format, ...) ((void) 0)
#endif

} // namespace k2

#endif // K2_CSRC_CUDA_DEBUG_H_
Loading