Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPi 2b & RPi 3b crash after a while with kernel 5.10.31+ #4319

Closed
henkv1 opened this issue May 3, 2021 · 43 comments · Fixed by #4418 or #4431
Closed

RPi 2b & RPi 3b crash after a while with kernel 5.10.31+ #4319

henkv1 opened this issue May 3, 2021 · 43 comments · Fixed by #4418 or #4431

Comments

@henkv1
Copy link

henkv1 commented May 3, 2021

Describe the bug
Since the upgrade to linux-raspberrypi version 5.10.31 and also 5.10.32 and 5.10.33, my Raspberry Pi 3b crashes after a while. I do not have this problem with kernel 5.10.27. I've also tried a Raspberry Pi 2b with the same config and it also crashes.
Sometimes the system crashes after an hour, sometimes after a day. It seems that the crashes occur randomly.

System

  • Which model of Raspberry Pi? Pi 2b and Pi 3b
  • Which OS and version: Arch Linux Arm, latest version
  • Which firmware version: version a48d332c35ee1c1c1ab433228e23317f62dcc5fb
  • Which kernel version: 5.10.31 and up

The system is connected to a TV (1920x1080) using HDMI. The TV is switched off most of the time.
The system is used as an wireless access point and sometimes Kodi is started.

Logs
I do not see anything in the logs. Also, when the system crashes, the TV output gives no signal.

@henkv1
Copy link
Author

henkv1 commented May 5, 2021

The crashes are caused by the new vc4 code.
I've compiled kernel 5.10.33 with the old vc4 code from 5.10.27 and it is running stable now.

@popcornmix
Copy link
Collaborator

What are the symptoms of the crash? Just display going blank? Does an ssh connection still work? Anything in dmesg?
Are you saying it is this #4302 that causes the issue?

@henkv1
Copy link
Author

henkv1 commented May 5, 2021

The system stalls completely. The display goes blank and ssh is not working anymore. There is also nothing in the journald logs that relates to the crash.
I will rebuild my kernel with #4302 to see if that causes the crashes.

@mripard
Copy link
Contributor

mripard commented May 7, 2021

I was seeing something similar due to a patch in #4302 that I fixed in #4313 , maybe it can help?

@henkv1
Copy link
Author

henkv1 commented May 7, 2021

It looks like #4302 indeed causes the crashes. I've compiled d62dee4 and experienced no crashes. After that I patched the source with #4302 and the system crashed after I turned on the tv (not every time though).
I see that #4313 is merged, so I will try with the latest source.

@henkv1
Copy link
Author

henkv1 commented May 8, 2021

Unfortunately #4313 did not fix the issue.

@henkv1
Copy link
Author

henkv1 commented May 10, 2021

I also tried 5.12.1, but it has the same issue. The system crashed when I switched on the TV.

@mripard
Copy link
Contributor

mripard commented May 10, 2021

I'm sorry it didn't fix it, and thanks for testing. So if I understand well, it sits idle with a TV connected to it through HDMI. The TV is off most of the time, and the hang occurs when you turn the TV on?

@henkv1
Copy link
Author

henkv1 commented May 10, 2021

Yes, the TV is off most of the time while the RPi is on (it is used as an access point as well). The TV is switched on a couple of times a day.
The system crashes sometimes when I turn on the TV, but not always. Normally, the system crashes within a day.
Unfortunately the system is not accessible via SSH after the crash and the screen gives no signal. Is there any other way to debug this?

@mripard
Copy link
Contributor

mripard commented May 11, 2021

5.12.1 is weird too, since it doesn't have the content of #4302. I tried all afternoon to reproduce it on my TV with a 3B and 3B+ and couldn't reproduce it. Are you sure of the branch you tested with?

@henkv1
Copy link
Author

henkv1 commented May 11, 2021

I tried the rpi-5.12.y branch, which includes #4302 . For instance the 'drm/vc4: hdmi: Prevent clock unbalance' patch is committed in rpi-5.12.y in 482543f and in rpi-5.10.y in 981adc5

Is there a specific patch in #4302 that could cause this issue, so I can try to revert it?

@henkv1
Copy link
Author

henkv1 commented May 13, 2021

I am pretty sure this commit causes the crashes: e259821
I reverted it and haven't seen any crashes for two days now.

@henkv1
Copy link
Author

henkv1 commented May 18, 2021

I experienced another crash yesterday. So the issue is not fixed. It only occurs less frequent.

@henkv1
Copy link
Author

henkv1 commented Jun 3, 2021

@mripard , @popcornmix
After some extensive testing I found out that the crashes were caused by this commit: 3cf3d39 (drm/vc4: Rework the encoder retrieval code)
This commit is reverted in cbf1a95, so cbf1a95 does not have this issue.

Unfortunately, the issue is re-introduced in 7416691 (drm/vc4: crtc: Fix vc4_get_crtc_encoder logic)

@mripard
Copy link
Contributor

mripard commented Jun 4, 2021

755b2c8 fixes a similar crash. Did you have it in your tree when you tested?

@henkv1
Copy link
Author

henkv1 commented Jun 4, 2021

I've tested it with c2f585a (5.10.38) This contains 755b2c8, but the issue was still present.
After that, I reverted 7416691 from c2f585a and that fixed the issue.

@mripard
Copy link
Contributor

mripard commented Jun 16, 2021

Thanks for testing. You seem to have a fairly reliable test now, did you change anything?

Anything a bit out of the ordinary in your setup, or is it just a Pi3 connected to a TV?

@henkv1
Copy link
Author

henkv1 commented Jun 17, 2021

No, nothing special. Just a Pi3 connected to a Samsung LE32B650 TV. I just tested a more recent Git snapshot (1c38342) and the issue is still there.

@mripard
Copy link
Contributor

mripard commented Jun 17, 2021

You seem to have a test reliable enough to allow you to bisect though. I've tested to put my TV off with a Pi3 and then back on for about an hour without any success in triggering your bug. Can you share what you're doing exactly to trigger it?

@henkv1
Copy link
Author

henkv1 commented Jun 17, 2021

It's nothing special. I just switch on the TV a few times a day. The system displays the console. I get a signal most of the time, but sometimes I do not get a signal and the pi crashes. It seems to occur at random.

I have a script that checks the CEC status of the TV and when it's switched on, the script starts Kodi. Kodi switches off the TV again when I shut down Kodi.

Maybe it is the combination of my specific TV and the pi that triggers this bug.

@henkv1
Copy link
Author

henkv1 commented Jun 18, 2021

It looks like this little patch fixes the issue:

--- a/drivers/gpu/drm/vc4//vc4_crtc.c   2021-06-17 14:27:45.210646017 +0200
+++ b/drivers/gpu/drm/vc4//vc4_crtc.c   2021-06-17 10:06:57.043833267 +0200
@@ -294,7 +294,7 @@
                if (!conn_state)
                        continue;
 
-               if (conn_state->crtc == crtc) {
+               if (connector->state->crtc == crtc) {
                        drm_connector_list_iter_end(&conn_iter);
                        return connector->encoder;
                }

I'm not sure what's the difference between conn_state->crtc and connector->state->crtc.
Maybe you can explain this? So I can understand what's happening here (and hopefully fix it).
Thank you.

@mripard
Copy link
Contributor

mripard commented Jun 21, 2021

I had an unrelated bug that made me rework that code today, I sent a PR with that fix #4402. It would be great if you could test it and see if it fixes it

@henkv1
Copy link
Author

henkv1 commented Jun 22, 2021

I tried https://github.com/mripard/rpi-linux/tree/rpi/5.10-core-clock-request-fix, but unfortunately this does not fix the issue.

@henkv1
Copy link
Author

henkv1 commented Jun 27, 2021

It turns out that the system crashes when vc4_get_crtc_encoder is called from vc4_crtc_atomic_disable.
Removing it fixes it, but it probably breaks other stuff:

--- drivers/gpu/drm/vc4/vc4_crtc.c.orig	2021-06-24 20:00:23.477023243 +0200
+++ drivers/gpu/drm/vc4/vc4_crtc.c	2021-06-26 13:04:27.060557426 +0200
@@ -522,7 +522,7 @@
 	struct drm_crtc_state *old_state = drm_atomic_get_old_crtc_state(state,
 									 crtc);
 	struct vc4_crtc_state *old_vc4_state = to_vc4_crtc_state(old_state);
-	struct drm_encoder *encoder = vc4_get_crtc_encoder(crtc, old_state);
+	struct drm_encoder *encoder;
 	struct drm_device *dev = crtc->dev;
 
 	drm_dbg(dev, "Disabling CRTC %s (%u) connected to Encoder %s (%u)",

@mripard
Copy link
Contributor

mripard commented Jun 28, 2021

Thanks for digging into this, it's definitely weird.

One can force that function to run by running

echo off > /sys/class/drm/card0-HDMI-A-1/status

And it's running without an issue here with a display connected to HDMI.

Do you have any way to access the logs once it failed (like with a UART?) If so, could you add drm.debug=0x1f to the kernel command line in /boot/cmdline.txt and paste the result?

@henkv1
Copy link
Author

henkv1 commented Jun 28, 2021

It looks like that function does indeed trigger the issue. When I force it, the system crashes and the TV shows 'No Signal'
I will try to collect some more information

@henkv1
Copy link
Author

henkv1 commented Jun 28, 2021

I tried a different monitor and installed the default Arch kernel 5.10.44-4-ARCH. But with this kernel and different monitor, I get the same crash when I force the function with
echo off > /sys/class/drm/card0-HDMI-A-1/status

@mripard
Copy link
Contributor

mripard commented Jun 28, 2021

What is the new monitor you've been using? Also, what display stack is being run when it crashes? Kodi? Xorg?

@henkv1
Copy link
Author

henkv1 commented Jun 28, 2021

I finally know what triggers the crash and how to reproduce it: it crashes when cec-client is called after the HDMI is turned off.
To reproduce it:

# echo off > /sys/class/drm/card0-HDMI-A-1/status 
# cec-client

@mripard
Copy link
Contributor

mripard commented Jun 29, 2021

Ah, yes, that makes a lot of sense. The changes you pointed out earlier fix the encoder retrieval logic that was broken before, and now the HDMI controller will be completely shut down when not active anymore. Running cec-client will try to access its registers, and will stall the CPU.

I'm not entirely sure how libcec uses the CEC controller (it seems to have support for both the kernel CEC API, and the VC4 firmware one), but this is reproducible using the kernel CEC API with:

cec-ctl --tuner -p 1.0.0.0
echo off > /sys/class/drm/card0-HDMI-A-1/status
cec-compliance

I'll look into it, thanks for your debugging

@mripard
Copy link
Contributor

mripard commented Jun 29, 2021

I just pushed a PR that seems to fix this for me.

Now, one can do

# echo off > /sys/class/drm/card0-HDMI-A-1/status
# cec-ctl --tuner -p 1.0.0.0
# cec-compliance
# echo on > /sys/class/drm/card0-HDMI-A-1/status

With everything working as intended

@henkv1
Copy link
Author

henkv1 commented Jun 30, 2021

I compiled mripard@126af69
But both my Pi2b and Pi3b crash when the vc4 module is loaded.

@henkv1
Copy link
Author

henkv1 commented Jul 1, 2021

After I reverted #4418 , the module loads fine. So the patch breaks something.

@mripard
Copy link
Contributor

mripard commented Jul 2, 2021

I just pushed a new version of #4418 that should fix both issues you were seeing

@henkv1
Copy link
Author

henkv1 commented Jul 3, 2021

@pelwell: please re-open this issue. The new version only makes it worse. The system now crashes as soon as I modprobe the vc4 module.

@pelwell
Copy link
Contributor

pelwell commented Jul 3, 2021

Sure - the close was automatic because of the Fixes tag in the PR.

@pelwell pelwell reopened this Jul 3, 2021
@henkv1
Copy link
Author

henkv1 commented Jul 4, 2021

I found out what causes the crash: I had video=HDMI-A-1:1920x1080@60e in cmdline.txt. With this option, the system crashes as soon as the vc4 module is loaded. Without this option, the system works fine and the CEC issue is also solved.

@6by9
Copy link
Contributor

6by9 commented Jul 4, 2021

Does your display not give you an EDID, or is 1920x1080@60 missing from it?

If it isn't there, then adding your video= entry to cmdline.txt will add the GTF mode for 1920x1080@60 which is

Modeline "1920x1080_60.00"  172.80  1920 2040 2248 2576  1080 1081 1084 1118  -HSync +Vsync

The maximum supported pixel clock for Pi0-3 is 162MHz, therefore that mode is invalid and pruned. Slightly annoyingly should that happen DRM doesn't discard the command line mode and just fails to set up the display. (It shouldn't actually crash, just doesn't enable the display).

If you add the 'M' option to use CVT timings then it still exceeds the limit:

Modeline "1920x1080_60.00"  173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync

You need to select the CEA or DMT timings to support 1920x1080@60 with a pixel clock under 162MHz (CEA/DMT uses 148.5MHz), but AFAIK that can only be done via an EDID and not via the command line.

@henkv1
Copy link
Author

henkv1 commented Jul 4, 2021

The display gives an EDID, so the video= entry is not necessary. I used it to set the screen offset, but I removed that part some time ago. The entry worked before #4418 was merged and now the system crashes with it, so there must be some regression here.

@mripard
Copy link
Contributor

mripard commented Jul 5, 2021

The PR #4431 I just sent should address this

@henkv1
Copy link
Author

henkv1 commented Jul 5, 2021

Thank you, it looks like this fixes the issue

@mripard
Copy link
Contributor

mripard commented Jul 6, 2021

Thanks for testing, and for your persistence, it's been very helpful :)

@pelwell
Copy link
Contributor

pelwell commented Jul 6, 2021

Yes, thank you @henkv1. The fixing patch (#4431) is in the latest rpi-update firmware release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants