Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_ucp.c was stuck/hang at ucs_init. #10389

Closed
MoFHeka opened this issue Dec 18, 2024 · 5 comments
Closed

test_ucp.c was stuck/hang at ucs_init. #10389

MoFHeka opened this issue Dec 18, 2024 · 5 comments

Comments

@MoFHeka
Copy link

MoFHeka commented Dec 18, 2024

How should I debug it?

Here is the code:https://github.com/openucx/ucx/blob/master/examples/cmake/test_ucp.c

/**
* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2021. ALL RIGHTS RESERVED.
*
* See file LICENSE for terms.
*/

#include <ucp/api/ucp.h>
#include <assert.h>
#include <string.h>

int main(int argc, char **argv)
{
    ucp_params_t ucp_params = {};
    ucp_context_h ucp_context;
    ucs_status_t status;

    ucp_params.field_mask = UCP_PARAM_FIELD_FEATURES;
    ucp_params.features   = UCP_FEATURE_AM;
    status = ucp_init(&ucp_params, NULL, &ucp_context);
    assert(status == UCS_OK);

    ucp_cleanup(ucp_context);
    return 0;
}

Here is the build configuration:

./contrib/configure-release-mt --with-avx --with-march --enable-compiler-opt --enable-optimizations --enable-cma --with-rc --with-ud --with-dc --with-ib-hw-tm --with-dm --enable-tuning --with-cuda --with-java=no --with-go=no --enable-example=yes

Here is the GDB backtrace:
Thread 1

#0  0x00007ffff7eb5b65 in sigaction ()
#1  0x0000555555684df6 in ucs_debug_init ()
#2  0x00007ffff7e977de in ucs_init ()
#3  0x00007ffff7fe0b9a in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd158, env=env@entry=0x7fffffffd168) at dl-init.c:72
#4  0x00007ffff7fe0ca1 in call_init (env=0x7fffffffd168, argv=0x7fffffffd158, argc=1, l=<optimized out>) at dl-init.c:30
#5  _dl_init (main_map=0x7ffff7ffe190, argc=1, argv=0x7fffffffd158, env=0x7fffffffd168) at dl-init.c:119
#6  0x00007ffff7fd013a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7  0x0000000000000001 in ?? ()
#8  0x00007fffffffd4c8 in ?? ()
#9  0x0000000000000000 in ?? ()

Thread 2

#0  0x00007ffff7936bbf in __GI___poll (fds=0x55555572bbc0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff5d8027f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff5e447cf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffff5d77f73 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff7a2a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007ffff7943353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

In debug mode, a recursive call appears to have occurred

./contrib/configure-release --with-java=no --with-go=no --enable-debug --enable-debug-data

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7df0bba in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
(gdb) bt
#0  0x00007ffff7df0bba in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#1  0x000055555571e2b5 in ucs_orig_sigaction ()
#2  0x00007ffff7df0bed in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#3  0x000055555571e2b5 in ucs_orig_sigaction ()
#4  0x00007ffff7df0bed in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#5  0x000055555571e2b5 in ucs_orig_sigaction ()
#6  0x00007ffff7df0bed in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#7  0x000055555571e2b5 in ucs_orig_sigaction ()
#8  0x00007ffff7df0bed in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#9  0x000055555571e2b5 in ucs_orig_sigaction ()
#10 0x00007ffff7df0bed in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#11 0x000055555571e2b5 in ucs_orig_sigaction ()
#12 0x00007ffff7df0bed in sigaction ()
   from /data/hejia/.cache/bazel/_bazel_hejia/9ab6dbbd34814379183c8ab0e65d3269/execroot/_main/bazel-out/k8-fastbuild/bin/ucx_examples/cmake/../../_solib_k8/_U@@_Umain~openucx_Udep~ucx_S_S_Cucx___Uexternal_S_Umain~openucx_Udep~ucx_Sucx_Slib/libucs_signal.so.0
#13 0x000055555571e2b5 in ucs_orig_sigaction ()
#14 0x00007ffff7df0bed in sigaction ()
@yosefe
Copy link
Contributor

yosefe commented Dec 18, 2024

@MoFHeka can you pls post the reproducer code (test_ucp.c)?

@MoFHeka
Copy link
Author

MoFHeka commented Dec 18, 2024

@MoFHeka MoFHeka changed the title test_ucp.c was stuck at ucs_init. test_ucp.c was stuck/hang at ucs_init. Dec 18, 2024
@MoFHeka
Copy link
Author

MoFHeka commented Dec 19, 2024

I known why. It can't link all libraries ucp, uct, ucs, ucm to the binary

@yosefe
Copy link
Contributor

yosefe commented Dec 19, 2024

@MoFHeka
Copy link
Author

MoFHeka commented Dec 22, 2024

The problem is that linking method of Bazel. Should not allow to use the "@ucx//:ucx" building output by configure_make as a binary dependency at once. Instead, better have to manually create cc_library hierarchy dependencies based on *.la.

@MoFHeka MoFHeka closed this as completed Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants