Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching to info screen sometimes hangs #120

Closed
wsakernel opened this issue Oct 11, 2023 · 7 comments · Fixed by #123
Closed

Switching to info screen sometimes hangs #120

wsakernel opened this issue Oct 11, 2023 · 7 comments · Fixed by #123

Comments

@wsakernel
Copy link
Contributor

As title says, sometimes when pressing '1' for the info screen, the program locks up. The lowest line highlights 'info' but nothing else is printed on the screen. I can reproduce it with current top-of-tree (58fb5a4) by constantly switching between info screen and help screen. It might take some seconds, but here on two different machines, the issue will show. I haven't time to debug it further, my gut feeling is that there is some race condition between the pthreads? The lock up does not happen between other screens I tried. Let me know if more information is needed.

@grrtrr
Copy link
Contributor

grrtrr commented Oct 12, 2023

Yes, more information is needed.

@grrtrr
Copy link
Contributor

grrtrr commented Oct 30, 2023

@wsakernel - how can this be reproduced, which system are you using, is it using Linux with glibc, or musl, is the architecture x64 or ARM, any other information?

I am not able to reproduce this locally, related concurrency bugs have been fixed quite a while ago.

@wsakernel
Copy link
Contributor Author

I see it on two different Intel x86-64 based off-the-shelf Fujitsu laptops. One running Debian 11 (bullseye), one running Debian 12 (bookworm). Nothing fancy, standard glibc. I can reproduce it by repeatedly hitting keys '1' and '3' (note that I don't run as root, so the scan screen is just the text about missing CAP_SYS_ADMIN), but '1' and '2' might fail as well. It is always the info screen which hangs. The frequency with which I change screens does not matter. It happens "randomly". Just tried again, it failed after 3 seconds. Next fail took 20 seconds of switching screens. I guess I need to fire up GDB for really useful info, but sadly I have no time for digging into this right now :(

@wsakernel
Copy link
Contributor Author

Okay, I may have no time, but I have interest ;)
Main thread is trapped in this loop:

825         while (!ready) {
826                 pthread_mutex_lock(&linkstat_mutex);
827                 ready = ls_new && !ls_tmp;
828                 pthread_mutex_unlock(&linkstat_mutex);
829         }

Sampling thread is stuck here:

38   Thread 0x7feea59ff6c0 (LWP 14122) "wavemon" __recvmsg_syscall (flags=34, msg=0x7feea59fec40, fd=7)
        at ../sysdeps/unix/sysv/linux/recvmsg.c:27

with this backtrace

#0  __recvmsg_syscall (flags=34, msg=0x7feea59fec40, fd=7) at ../sysdeps/unix/sysv/linux/recvmsg.c:27 
#1  __libc_recvmsg (fd=7, msg=0x7feea59fec40, flags=34) at ../sysdeps/unix/sysv/linux/recvmsg.c:41
#2  0x00007feea6016255 in nl_recv () from /lib/x86_64-linux-gnu/libnl-3.so.200
#3  0x00007feea6016bfd in nl_recvmsgs_report () from /lib/x86_64-linux-gnu/libnl-3.so.200
#4  0x00007feea6016e89 in nl_recvmsgs () from /lib/x86_64-linux-gnu/libnl-3.so.200
#5  0x0000557141829058 in handle_cmd (cmd=0x557141835c20 <cmd_survey>) at iw_nl80211.c:100
#6  0x0000557141829105 in handle_interface_cmd (cmd=0x557141835c20 <cmd_survey>) at iw_nl80211.c:125
#7  0x000055714182aa27 in iw_nl80211_get_survey (sd=0x557141dcc500) at iw_nl80211.c:767
#8  0x000055714182a898 in iw_nl80211_get_linkstat (ls=0x557141dcc3e0) at iw_nl80211.c:681

It seems this syscall blocks? Something bad with my wifi driver not able to respond?
"Intel Corporation Dual Band Wireless-AC 8260 [8086:1010}" here, with iwlwifi and 5.18.5 Kernel.

@wsakernel
Copy link
Contributor Author

Other progs had a similar problem, too: sonic-net/sonic-swss-common#114

Just setting the socket to non-blocking alone is not enough. The while (ret > 0) loop in handle_cmd() will run endlessly then.

@grrtrr
Copy link
Contributor

grrtrr commented Oct 30, 2023

Thanks for looking into this. It seems there is no quick fix for this, and I have only limited extra time at the moment.

@wsakernel
Copy link
Contributor Author

I'll try to add timeout to non-blocking. Discarding the broken cmd and starting a new one seems to work. We will see...

wsakernel added a commit to wsakernel/wavemon that referenced this issue Oct 31, 2023
Info screen waits for the first data to arrive. This stalls sometimes on
my machine because the netlink command does not complete for some
reason. So, set the socket to non-blocking and try again next cycle if
no data is available. This needs a small version bump for libnl from
3.2 to 3.2.22 because only since then the call to nl_recvmsgs() returns
-NLE_AGAIN. Fixes uoaerg#120.
wsakernel added a commit to wsakernel/wavemon that referenced this issue Oct 31, 2023
Info screen waits for the first data to arrive. This stalls sometimes on
my machine because the netlink command does not complete for some
reason. So, set the socket to non-blocking and try again next cycle if
no data is available. This needs a small version bump for libnl from
3.2 to 3.2.22 because only since then the call to nl_recvmsgs() returns
-NLE_AGAIN. Fixes uoaerg#120.
grrtrr pushed a commit that referenced this issue Oct 31, 2023
Info screen waits for the first data to arrive. This stalls sometimes on
my machine because the netlink command does not complete for some
reason. So, set the socket to non-blocking and try again next cycle if
no data is available. This needs a small version bump for libnl from
3.2 to 3.2.22 because only since then the call to nl_recvmsgs() returns
-NLE_AGAIN. Fixes #120.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants