-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TC358743 - NULL pointer dereference on cable plugging #4128
Comments
@6by9 In ancient times, ambassadors with bad news could be killed, but I hope you will spare me in this time :) I brought another nasty bug related to TC358743. |
__vb2_queue_cancel is a warning, not a crash. See the comment at https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-core.c#L1939 just before it. I will look at it again as we should return all queued buffers, but it won't crash the system. |
Thanks. In this case, I will try to reproduce the NULL pointer dereference. |
Catched NULL pointer dereference:
... Crash, reboot And also this, similar as discribed in the first report but with another trace:
I use a slightly different command: here, jpeg encoding is enabled for omx and h264 encoding for mmal using zerocopy from the video device. But it seems to me that this is not related, because I added oops and stacktrace as above (not warnings) and without these options.
|
@6by9 I continued testing attempts and realized that NULL dereference (described in the previous comment) happens stably for about 10-20 attempts to connect the cable. Next, the kernel runs kdbg, but I can't do anything there since the kernel is rebooted after that (probably Arch settings). The command causing the crash is also in the previous comment. |
Another catch on zero. Crash and warning:
|
@6by9 Is there anything I can do to help solve this bug? I get pecked at the head because of the NULL dereference. |
Sup? |
Sorry, still a way down the priority list. @naushir It looks like there is a path where the unicam driver can return a buffer that isn't queued to it. That's what the videobuf2-core.c:997 warning is about. Is it just that we've forgotten to clear UNICAM_ICTL, so the interrupts are still active? |
@6by9 In this situation, I'm more concerned about the NULL dereference rather than the warning that occurs. Apparently you're right and it's a race, given the diverse range of errors. Most of all, I would like to get rid of crash. Warning is a much lesser evil. |
The only way I can see this problem occurring is if
|
@naushir Well, yes, I have reproduced it stably, as you can see :) Can your patch fix NULL dereference, do you think? |
The patch above should fix a potential NULL deference, but it may or may not be the one that you have seen triggered, it's difficult to tell. Please do give it a try and let me know. |
@naushir A potential fix is much better than no fix :) Thank you! I'll take care of it immediately. |
@naushir Got this:
And NULL dereference after ~30 retries:
|
Ok, so that potential fix did nothing, which sort of reassures me that Need to continue looking into this one in more detail... |
If necessary, I can give you access to the device via the serial console. |
@naushir I'm not one hundred percent sure, but could this problem have something to do with using DMA? At least, NULL dereference was played exactly with H264 encoding enabled. It works like this: ustreamer exports v4l2 buffers, then feeds them to the VC. At least so far I haven't been able to reproduce this without h264. But maybe I'm just unlucky. Hypothesis based on ustreamer behavior:
|
It may be generally unrelated. I'm just informing you of what I'm observing. Perhaps exporting buffers somehow negatively affects the warning that we observe, and with some probability exacerbates the effect. |
I was thinking of Supposedly
That warn is that the buffer isn't in the ACTIVE state, ie it's not with the driver such that it should be returned. |
It's possible that we might be looking at two separate and unrelated problem here, the NULL dereference during encode, and the vb2_buf asserts. |
Yes, that was my thinking, and the patch above should have ensured this cannot happen by testing |
But if it's related, then maybe it's underground knocks, and if you fix the problem with active buffers, it can also disappear. In any case, I am at your disposal for testing. |
@mdevaev can you try the below patch (only slightly different from the earlier one) and send me a log if/when you hit the NULL ptr exception? Could you also enable debug logging on the unicam driver with the below command:
|
@naushir Sure. Can I just ask you to attach it as a file because github messed up the indentation. In the last patch, I had to arrange them manually :) |
Sure thing. Here is a zip of the patch file you can apply with |
Thanks! I'll start building it now. The result will be available in the next hour. |
@naushir Bingo! Got it via UART. |
Thanks! Would be useful to see the videobuf2_common warning messages as well if you can capture them. |
@mdevaev , can I just confirm, you only see the NULL ptr exception if you are doing a H.264 encode? If so, any chance you could try reproducing without encode? |
I tried to reproduce this on Raspbian on RPi4B1.4 with the latest commit ec967eb (32bit build) and I wasn't able to trigger the crash nor the warning. I did change some kconfig values, but these were (I suppose) related to kernel debugging. I was using the Auvidea B101 rev 4 board for this. I will try to test this with the latest official kernel to see if I can reproduce the segfault. I tried multiple ways of reproducing this:
EDIT: I trink I also tried stopping ustreamer before plugging the HDMI cable back a few times. |
Taking back: I reproduced both the warning and NULL dereference successfully on the official 5.10.17 build, but I had to use another RPi as the signal source. |
This is the NULL dereference from a local debug build without the zipped patch applied. Here is a fuller stack trace of it: KDB stacktrace
The last exception stack likely belongs to ustreamer; this is the symbolized version:
KGDB stacktrace from the ISR:
|
(this was triggered by the original reproducer - |
Contents of the affected unicam_device struct
Contents of the unicam buffer that is being deleted (?)
Suspicious lists:
I'd be happy to send a complete memory dump, but I couldn't find anything like that apart from kdump, which is likely not working due to its reliance on kexec. |
After rebuilding the kernel with list debugging enabled, I received a BUG caused by (potential?) list corruption here (this might be related to the warning instead of the NULL though):
The problem seems to be "list_add double add", as the 1305 static void unicam_buffer_queue(struct vb2_buffer *vb)
1306 {
1307 struct unicam_node *node = vb2_get_drv_priv(vb->vb2_queue);
1308 struct unicam_buffer *buf = to_unicam_buffer(vb);
1309 unsigned long flags;
1310
1311 spin_lock_irqsave(&node->dma_queue_lock, flags);
1312 list_add_tail(&buf->list, &node->dma_queue);
1313 spin_unlock_irqrestore(&node->dma_queue_lock, flags);
1314 } state of the variables:
|
@mdevaev and @JakubVanek thank you for the logs and traces you have provided! I'm afraid I still cannot see where this may be going wrong from the code. Would you be able to capture some more logs for me with the attached patch please? It adds a few more log points to help trace buffer usage. I am weary of adding to much logging, as it could hide the problem. Thanks again for helping to debug this! With the below patch, there is no need to do |
Time to build the kernel! |
Warning without null dereference (first on plug, second on unplug): no-crash.txt |
Thank you. Ok, I have started to see something strange with the way interrupts occur. It seems this device does not return Unicam Frame End interrupts? Instead we rely on the packet capture PI0 to indicate end of frame? @6by9 can you confirm that? Additionally, from sequence 2 onward, we only get one isr call per frame, with FS + LC + PI0 which is really not great. After that, our videobuf2 warning occurs. Our buffer handling code is probably not robust enough to handle this sequence, particularly when FE is missing and we rely and PI0.
|
Based on the original firwmare driver FE detection is sometimes dubious, so CMP0/PI0 is set up to also look for FE (VC = 0 and DT/short packet ID = 1 (FE)) |
I can't catch null dereference right now on the same kernel with the new patch, it seems the new debug logs take up kernel time and reduce the chance of a race. |
The ISR is taking FAR to long to run
That looks to be 40ms, which explains why FS + LC + FE/PI0 all occur at the same time. FS+FE/PI0 simultaneously is a condition we can't handle well to determine whether the buffer is actually swapped or not. |
Do you have any experience with this device? It never seems to produce FE interrupts, and often we get only one interrupt per frame with FS + LC + PI0 set together. This is not good as buffer swaps must happen on LC interrupts but only without FE. |
Good catch, I missed that. This is likely due to my patch that adds a spinlock wait in that bit of code. I think the problem starts on the first interrupt where we get FS + PI0. Since LC has not had a chance to swap buffers, cur_buf and next_buf are the same values. We could not do the FE handling if we have a FS + FE simultaneously, but I'm not sure that is the right answer. |
It is Unicam that sometimes fails to trigger on FE, not the CSI2 source failing to produce it. If you want one of these then I have about 6 here. I need to head past the office and can detour slightly to your place. |
I think I have a grasp on the buffer warning now. Hopefully the attached patch should get rid of it. However, I am not convinced that this would also get rid of the NULL deference exception. @mdevaev can you give this a try whenever you are able to please? I've removed the extra spinlocks so that interrupts should also behave better with this patch. EDIT: Noticed a typo in the last patch, please reapply this updated one. |
Building... |
I had to set debug level to 3 to be able to use ctrl+c. Also found some interesting thing about i2c (maybe not related):
Warning and null dereference not yet reproduced |
It seems the problem is fixed. At least I didn't manage to reproduce it for an hour and a half. I suggest to clean the patch and apply it. It doesn't seem to get any worse :) |
Good news! I'll prepare a clean commit for a PR soon. |
@naushir If it's not difficult, plz attach the new patch, I will test the final version. |
No problem. I will cc you on the PR. Once you confirm it working we will submit it. However I may not be able to get it ready until later on in the day. |
It's okay, I'll wait. Solving this problem is a top priority for me. |
Given the presence of the fix, I think that this issue can be closed. |
Describe the bug
If you connect the cable while the ustreamer is running, the driver may crash. In rarer cases, with the same reproducing method - NULL pointer dereference. Perhaps the nature of both of these bugs is the same, so this is one issue. The bug is not related to image encoding, just getting a RAW stream is enough.
To reproduce
At the moment, it can't be reproduced on Raspbian (only on Arch), as it has not yet built the 5.10.13 kernel and cannot be installed with rpi-update. However the approximate method looks like this:
ustreamer --dv-timings --device=/dev/video0 --persistent --format=uyvy --encoder=noop
The error may occur after the second or twentieth iteration, it is non-periodic, but it occurs when the cable is connected. In rarer cases, some kind of NULL pointer dereference may occur, but I have not yet managed to catch it in the log, I only saw it briefly in the UART console three times, after which the Pi rebooted. In even rarer cases, something strange happens like double free or 'munmap_chunk(): invalid pointer'. I suppose it's all one problem.
On older 5.4 kernels, I didn't see this. Interestingly, if you now try to reproduce the problem on Raspbian with the kernel 5.10.11, the first connection of the cable will cause segfault in ustreamer, this was also not observed before.
If you need any more information, I'm ready to help as usual.
System
version 99d9a48302e4553cff3688692bb7e9ac760a03fa (clean) (release) (start) / Jan 27 2021 22:19:57
Linux pikvm 5.10.13-2-ARCH #1 SMP Fri Feb 5 15:47:26 UTC 2021 armv7l GNU/Linux
Logs
Just dmesg: dmesg1.txt
Another crash (UART log):
screenlog1.txt
The text was updated successfully, but these errors were encountered: