Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocm/Roctracer will hang and crash when interrupted by a real-time timer #22

Closed
mxz297 opened this issue Oct 15, 2019 · 21 comments
Closed

Comments

@mxz297
Copy link
Contributor

mxz297 commented Oct 15, 2019

We encountered a hang when using Roctracer to collect real-time profiling data on both CPUs and GPUs. HPCToolkit collects real time profiling data by repetitively setting a real-time timer and register a signal handler to record samples when the timer goes off. Our example application (the roctracer example) hangs non-deterministically in ioctl.

I created a reproducer based on the Roctracer example, which contains only the real-timer logics without any other HPCToolkit logic.

If we compile this reproducer and run it, it will hang non-deterministically at the following stack trace:

#0  0x00007f7ad29bd5d7 in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f7acd0f3f28 in kmtIoctl () from /opt/rocm/lib/libhsakmt.so.1
#2  0x00007f7acd0ee36f in hsaKmtWaitOnMultipleEvents () from /opt/rocm/lib/libhsakmt.so.1
#3  0x00007f7acd0ee929 in hsaKmtWaitOnEvent () from /opt/rocm/lib/libhsakmt.so.1
#4  0x00007f7ad38d694a in core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f7ad38d674a in core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f7ad38c8ad9 in HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f7abf3c3b31 in waitComplete () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:4930
#8  0x00007f7abf3c37cd in operator() () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:5070
#9  _M_invoke<> () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1530
#10 operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1520
#11 operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1342
#12 0x00007f7abf3c3762 in _M_invoke () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1856
#13 0x00007f7abf3c36d7 in operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:2267
#14 _M_do_set () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:527
#15 0x00007f7ad47f8827 in __pthread_once_slow (once_control=0x7f7a9d0eb1d8, init_routine=0x7f7ad2f6c830 <__once_proxy>) at pthread_once.c:116
#16 0x00007f7abf3c436b in __gthread_once () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0/bits/gthr-default.h:699
#17 call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *> () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/mutex:738
#18 _M_set_result () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:386
#19 _M_complete_async () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1606
#20 0x00007f7abf3a87ce in wait () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:319
#21 wait () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:656
#22 wait () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:1628
#23 0x00007f7abf3a7ef5 in copy_ext () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:4144
#24 0x00007f7ad3b7fa15 in ihipStream_t::locked_copySync(void*, void const*, unsigned long, unsigned int, bool) () from /opt/rocm/hip/lib/libhip_hcc.so
#25 0x00007f7ad3bb9f0e in hipMemcpy () from /opt/rocm/hip/lib/libhip_hcc.so
#26 0x000000000040622d in main () at MatrixTranspose.cpp:183

If we change the timer interval from 2.5 millisecond to 1 millisecond, the reproduce will deterministically crash at the following stack trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fc0de253801 in __GI_abort () at abort.c:79
#2  0x00007fc0defd8844 in ?? () from ../../build/libroctracer64.so.1
#3  0x00007fc0df51a367 in api_callbacks_spawner_t<3>::api_callbacks_spawner_t(hip_api_id_t const&, hip_api_data_t&) () from /opt/rocm/hip/lib/libhip_hcc.so
#4  0x00007fc0df519e4f in hipMalloc () from /opt/rocm/hip/lib/libhip_hcc.so
#5  0x00000000004060cf in main () at MatrixTranspose.cpp:154

These stack traces include code from roctracer, and also code from other AMD toolchains such as HIP. I cannot really determine whether the root cause of the issue is in HIP or Roctracer, but since HPCToolkit is a direct user of Roctracer, I will post it here.

Reproducer.zip

@mxz297
Copy link
Contributor Author

mxz297 commented Oct 15, 2019

Replacing real-time timers with a cpu time timer will still hang. This suggests that handling of E_INTR for system calls is not the issue.

@eshcherb
Copy link
Contributor

Could you upload the test with cpu time timer?

@mxz297
Copy link
Contributor Author

mxz297 commented Nov 15, 2019

You can replace the a real-time timer with a cpu timer by changing line 65 of the "MatrixTranpose.cpp".

I have a section of comments before line 65 stating how to do it:

	/*  REALTIME timer can interrupt syscalls, but CPUTIME timer will not.
	 *  Replacing CLOCK_REALTIME with CLOCK_PROCESS_CPUTTIME_ID to measure
	 *  CPU time will still hang
	 */

@mxz297
Copy link
Contributor Author

mxz297 commented Dec 13, 2019

@eshcherb We found that the hang has nothing to do with roctracer. In the example attached below, the code does not link against roctrocer and still hangs in the same way. This means that the bug is in the rocm runtime. Where should I report this bug? Or can you forward my reproducer to the responsible developers?

rocm-hang.tar.gz

@eshcherb
Copy link
Contributor

Thank you for your update, appreciate!
I also tried that and I already created an internal AMD ticket for KFD.
Will keep you updated on the progress.

@eshcherb
Copy link
Contributor

eshcherb commented Jan 10, 2020

While we are looking the issue, could you try a workaround:
$ export HSA_ENABLE_INTERRUPT=0

The setting will force ROCr-runtime/KFD to use polling to wait for signals rather than interrupts.

kodiakhq bot added a commit to espressomd/espresso that referenced this issue Apr 3, 2020
The `ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/` issue has been worked around by properly setting `HCC_PATH` on the CMake side.
The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing `cudaDestoryStream` though.

Fixes #3620 (according to `ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000`).
Fixes #3587 (according to `ctest -R ek_charged_plate --repeat-until-fail 100`).

**TODO**
- https://github.com/espressomd/docker/blob/master/docker/rocm-python3/Dockerfile-latest needs to be updated to ROCm 3.3 once this pull request is merged.
@ROCmSupport
Copy link

Hi @mxz297
Can you please validate with the latest ROCm 4.5 code and update the status.
Thank you.

@ROCmSupport
Copy link

Hi @mxz297
I am not able to reproduce with the latest ROCm 4.5.
Request to check with the latest ROCm 4.5 code. Thank you.

@mxz297
Copy link
Contributor Author

mxz297 commented Jan 7, 2022

The reproducer no longer compiles with a recent rocm. I tweaked it to compile with rocm-4.3.1 and it does not seem to have the problem any more.

@mxz297 mxz297 closed this as completed Jan 7, 2022
@jrmadsen
Copy link

jrmadsen commented Feb 2, 2022

@ROCmSupport I was able to reproduce this issue w/ ROCm 4.5.0

@jmellorcrummey
Copy link

@ROCmSupport I observed this issue with ROCm 4.5.2 using hpctoolkit's hpcrun to sample miniqmc. The application deadlocked with a thread stuck in a callstack like that above ending in ioctl (I didn't save the callstack). @jrmadsen Do you have a good reproducer or should we try to build one? I think that any thread that calls HSA::hsa_signal_wait_scacquire is vulnerable.

@jrmadsen
Copy link

jrmadsen commented Feb 8, 2022

@jmellorcrummey I haven't created a reproducer yet. For the time being, I've essentially resorted to just setting HSA_ENABLE_INTERRUPT=0 as the default when the library gets loaded:

namespace
{
int disable_hsa_interrupt_on_load = setenv("HSA_ENABLE_INTERRUPT", "0", 0);
}

It is not ideal and definitely causes the CPU utilization to increase but at least it prevents the deadlock. I did delve a little deeper into it.

Testing

I noticed a possible issue with the implementation in ROCT-Thunk-Interface where errno wasn't being saved in kmtIoctl (which is called from HSA::hsa_signal_wait_scacquire):

int kmtIoctl(int fd, unsigned long request, void *arg)
{
	int ret;

	do {
		ret = ioctl(fd, request, arg);
	} while (ret == -1 && (errno == EINTR || errno == EAGAIN));

	if (ret == -1 && errno == EBADF) {
		/* In case pthread_atfork didn't catch it, this will
		 * make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN.
		 */
		pr_err("KFD file descriptor not valid in this process\n");
		is_forked_child();
	}

	return ret;
}

and thought maybe this was happening when the signal handler was overwriting errno but correcting that and installing a custom ROCR-Runtime built against it didn't fix the problem.

I think that any thread that calls HSA::hsa_signal_wait_scacquire is vulnerable.

It appears it is the ioctl call. Interrupting HSA::hsa_signal_wait_scacquire appears fine (maybe thats what you meant).

Once I re-implemented kmtIoctl to block signals around the ioctl call, removed my env setting, and cranked up SIGALRM and SIGPROF to 200 interrupts per second (which virtually guaranteed a deadlock before this change), the deadlocks disappeared:

#include <signal.h>
#include <pthread.h>

static __thread sigset_t _signal_set;

static void setup_signal_set(void)
{
    static __thread size_t _once = 0;
    if(_once != 0) return;
    _once = 1;

    sigemptyset(&_signal_set);
    sigaddset(&_signal_set, SIGPROF);
    sigaddset(&_signal_set, SIGALRM);
    sigaddset(&_signal_set, SIGVTALRM);
}

/* Call ioctl, restarting if it is interrupted */
int kmtIoctl(int fd, unsigned long request, void *arg)
{
    int ret = 0;
    int err = 0;

    setup_signal_set();
    pthread_sigmask(SIG_BLOCK, &_signal_set, NULL);

    do
    {
        ret = ioctl(fd, request, arg);
        err = errno;
    }
    while(ret == -1 && (err == EINTR || err == EAGAIN));

    if (ret == -1 && err == EBADF) {
		/* In case pthread_atfork didn't catch it, this will
		 * make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN.
		 */
		pr_err("KFD file descriptor not valid in this process\n");
		is_forked_child();
	}

    pthread_sigmask(SIG_UNBLOCK, &_signal_set, NULL);

    return ret;
}

Thus, as far as I can tell, it appears that when the signal is delivered during the ret = ioctl(fd, request, arg), it causes ioctl to never return (e.g. in another experiment, I put a print statement right after this line and never see a print statement when the deadlock happens).

EDIT: as a sanity check, I removed the signal blocking and all my tests resumed deadlocking.

@jrmadsen
Copy link

jrmadsen commented Feb 8, 2022

As you can see from this snippet from one of the outputs, there were several potential cases for the signal handler to interrupt the ioctl call:

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                             WALL CLOCK TIME (VIA SAMPLING)                                                                                                            |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                          LABEL                                                           | COUNT  | DEPTH  |       METRIC        | UNITS  |   SUM     |   MEAN   |   MIN    |   MAX    |   VAR    | STDDEV   | % SELF |
|--------------------------------------------------------------------------------------------------------------------------|--------|--------|---------------------|--------|-----------|----------|----------|----------|----------|----------|--------|
...
| |0>>> |_hipLaunchKernel                                                                                                  |   2337 |      1 | sampling_wall_clock | sec    | 14.395847 | 0.006160 | 0.001834 | 0.207173 | 0.000041 | 0.006394 |    0.0 |
| |0>>>   |_hip_impl::hipLaunchKernelGGLImpl(unsigned long, dim3 const&, dim3 const&, unsigned int, ihipStream_t*, void**) |   2282 |      2 | sampling_wall_clock | sec    | 13.963639 | 0.006119 | 0.001836 | 0.207173 | 0.000040 | 0.006350 |    0.0 |
| |0>>>     |_hipModuleGetTexRef                                                                                           |   2282 |      3 | sampling_wall_clock | sec    | 13.963639 | 0.006119 | 0.001836 | 0.207173 | 0.000040 | 0.006350 |    0.0 |
| |0>>>       |_hipTexObjectCreate                                                                                         |   2258 |      4 | sampling_wall_clock | sec    | 13.788658 | 0.006107 | 0.001836 | 0.207173 | 0.000041 | 0.006377 |    0.0 |
| |0>>>         |_hipTexObjectCreate                                                                                       |   2255 |      5 | sampling_wall_clock | sec    | 13.774659 | 0.006108 | 0.001836 | 0.207173 | 0.000041 | 0.006381 |    0.1 |
| |0>>>           |___new_sem_wait_slow.constprop.0                                                                        |   2016 |      6 | sampling_wall_clock | sec    | 10.615110 | 0.005265 | 0.001845 | 0.207173 | 0.000023 | 0.004846 |    0.0 |
| |0>>>             |_do_futex_wait.constprop.0                                                                            |   2015 |      7 | sampling_wall_clock | sec    | 10.610925 | 0.005266 | 0.001845 | 0.207173 | 0.000023 | 0.004847 |  100.0 |
| |0>>>           |_hipTexObjectCreate                                                                                     |    236 |      6 | sampling_wall_clock | sec    |  3.147407 | 0.013336 | 0.001836 | 0.090864 | 0.000131 | 0.011427 |    0.4 |
| |0>>>             |_hipTexObjectCreate                                                                                   |    233 |      7 | sampling_wall_clock | sec    |  3.134396 | 0.013452 | 0.001836 | 0.090864 | 0.000131 | 0.011454 |    0.1 |
| |0>>>               |_hipTexObjectCreate                                                                                 |    231 |      8 | sampling_wall_clock | sec    |  3.125229 | 0.013529 | 0.001836 | 0.090864 | 0.000132 | 0.011473 |    0.0 |
| |0>>>                 |_hipTexObjectCreate                                                                               |    229 |      9 | sampling_wall_clock | sec    |  3.085253 | 0.013473 | 0.001836 | 0.090864 | 0.000130 | 0.011421 |    0.0 |
| |0>>>                   |_hipTexObjectCreate                                                                             |    229 |     10 | sampling_wall_clock | sec    |  3.085253 | 0.013473 | 0.001836 | 0.090864 | 0.000130 | 0.011421 |    0.0 |
| |0>>>                     |_rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned... |    227 |     11 | sampling_wall_clock | sec    |  3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 |    0.0 |
| |0>>>                       |_rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, h... |    227 |     12 | sampling_wall_clock | sec    |  3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 |    0.0 |
| |0>>>                         |_rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long,... |    227 |     13 | sampling_wall_clock | sec    |  3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 |   98.9 |
| |0>>>                           |_hsaKmtWaitOnEvent                                                                      |      3 |     14 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |    0.0 |
| |0>>>                             |_hsaKmtWaitOnMultipleEvents                                                           |      3 |     15 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |    0.0 |
| |0>>>                               |_kmtIoctl                                                                           |      3 |     16 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |    0.0 |
| |0>>>                                 |_pthread_sigmask                                                                  |      3 |     17 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |  100.0 |

@jmellorcrummey
Copy link

@jrmadsen We agree with your diagnosis. i meant that any thread calling hsa_signal_wait_sacquire is vulnerable to a deadlock. We also believe that the deadlocks arise due to problems with a signal interrupting the ioctl that it calls. We are aware of the flag to block interrupts. We have been using this for 2.5 years when profiling HIP programs with roctracer. I ran into the problem profiling programs using OpenMP offloading through HSA, which didn’t go through the path that set the environment variable to disable HSA interrupts.

We’re glad that you are looking into this.

@jmellorcrummey
Copy link

Why is this issue marked as closed? ROCm still hangs when interrupted with a Linux REALTIME timer?

@skyreflectedinmirrors
Copy link

@jmellorcrummey -- just as an update, I have been trying to reproduce this with both the examples here, but haven't had much luck. I did find another issue in initializing roctracer that I've reported internally:

void start_tracing();
int main() {
start_tracing();
}
#include <roctracer/roctracer_hip.h>
void api_callback(uint32_t domain, uint32_t cid, const void *callback_data,
void *arg) {
const hip_api_data_t *data =
reinterpret_cast<const hip_api_data_t *>(callback_data);
}
void start_tracing() { roctracer_enable_callback(api_callback, NULL); }

Still working on it.

@jmellorcrummey
Copy link

the way i saw this recently was using hpctoolkit’s hpcrun to profile miniqmc. perhaps you can reproduce it with linux perf sampling with realtime.

otherwise we could write a little preloaded library that does nothing but set up realtime sampling on the main thread and wrap pthread_create to start sampling for every other thread. then preload the library and launch miniqmc.

@skyreflectedinmirrors
Copy link

perhaps you can reproduce it with linux perf sampling with realtime.

That's a good idea -- I had been trying to build Omnitrace and use that, but ran into fun build issues on our cluster. I'll try that next. Have you been triggering this on Crusher? I doubt it depends on the system, but thought I'd check.

@jmellorcrummey
Copy link

i haven’t tried this one on crusher. i was seeing it on our local system with mi50 and rocm 5.1.
@jrmadsen saw this problem on amd systems, though i don’t know what flavor rocm and gpus. i think he was using omnitrace.

@jrmadsen
Copy link

jrmadsen commented Jun 24, 2022

perhaps you can reproduce it with linux perf sampling with realtime.

That's a good idea -- I had been trying to build Omnitrace and use that, but ran into fun build issues on our cluster. I'll try that next. Have you been triggering this on Crusher? I doubt it depends on the system, but thought I'd check.

@arghdos You shouldn't have to build it. You should just be able to use this installer and then:

export HSA_ENABLE_INTERRUPT=1
export OMNITRACE_ENABLE_SAMPLING=ON
export OMNITRACE_SAMPLING_FREQ=500

If you haven't been seeing it with omnitrace it's bc I do setenv("HSA_ENABLE_INTERRUPT", "0", 0) when the library loads but I don't override it bc there's nothing wrong if it is 1 when solely using the binary instrumentation

@skyreflectedinmirrors
Copy link

Didn't look close enough to see you didn't override it if set in the env :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants