Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel issue? #2557

Closed
mtausk opened this issue May 17, 2018 · 32 comments
Closed

Kernel issue? #2557

mtausk opened this issue May 17, 2018 · 32 comments
Labels
Waiting for external input Waiting for a comment from the originator of the issue, or a collaborator.

Comments

@mtausk
Copy link

mtausk commented May 17, 2018

Similiar to #1587
Not sure which update caused this.
Linux pi 4.14.17-v7+ #1090 SMP Mon Feb 5 21:02:18 GMT 2018 armv7l armv7l armv7l GNU/Linux
Ubuntu 16.04

May 17 18:48:11 kernel: [ 6141.921740] INFO: task kworker/u8:0:5 blocked for more than 120 seconds.
May 17 18:48:11 kernel: [ 6141.921764]       Tainted: G         C      4.14.17-v7+ #1090
May 17 18:48:11 kernel: [ 6141.921774] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 18:48:11 kernel: [ 6141.921785] kworker/u8:0    D    0     5      2 0x00000000
May 17 18:48:11 kernel: [ 6141.921812] Workqueue: writeback wb_workfn (flush-8:1-fuseblk)
May 17 18:48:11 kernel: [ 6141.921849] [<807762c0>] (__schedule) from [<80776938>] (schedule+0x50/0xa8)
May 17 18:48:11 kernel: [ 6141.921868] [<80776938>] (schedule) from [<8014ac8c>] (io_schedule+0x20/0x40)
May 17 18:48:11 kernel: [ 6141.921887] [<8014ac8c>] (io_schedule) from [<8021c4e4>] (__lock_page+0x100/0x128)
May 17 18:48:11 kernel: [ 6141.921906] [<8021c4e4>] (__lock_page) from [<8022bb78>] (write_cache_pages+0x30c/0x4d4)
May 17 18:48:11 kernel: [ 6141.921985] [<8022bb78>] (write_cache_pages) from [<7f4ba618>] (fuse_writepages+0x80/0xe4 [fuse])
May 17 18:48:11 kernel: [ 6141.922075] [<7f4ba618>] (fuse_writepages [fuse]) from [<8022e284>] (do_writepages+0x30/0x8c)
May 17 18:48:11 kernel: [ 6141.922094] [<8022e284>] (do_writepages) from [<802ba854>] (__writeback_single_inode+0x44/0x430)
May 17 18:48:11 kernel: [ 6141.922111] [<802ba854>] (__writeback_single_inode) from [<802bb148>] (writeback_sb_inodes+0x20c/0x4c4)
May 17 18:48:11 kernel: [ 6141.922126] [<802bb148>] (writeback_sb_inodes) from [<802bb490>] (__writeback_inodes_wb+0x90/0xd0)
May 17 18:48:11 kernel: [ 6141.922141] [<802bb490>] (__writeback_inodes_wb) from [<802bb714>] (wb_writeback+0x244/0x358)
May 17 18:48:11 kernel: [ 6141.922155] [<802bb714>] (wb_writeback) from [<802bc2ac>] (wb_workfn+0x36c/0x4d8)
May 17 18:48:11 kernel: [ 6141.922173] [<802bc2ac>] (wb_workfn) from [<801370d8>] (process_one_work+0x158/0x454)
May 17 18:48:11 kernel: [ 6141.922192] [<801370d8>] (process_one_work) from [<80137438>] (worker_thread+0x64/0x5b8)
May 17 18:48:11 kernel: [ 6141.922209] [<80137438>] (worker_thread) from [<8013d4a8>] (kthread+0x13c/0x16c)
May 17 18:48:11 kernel: [ 6141.922226] [<8013d4a8>] (kthread) from [<8010812c>] (ret_from_fork+0x14/0x28)
May 17 18:48:11 kernel: [ 6141.922313] INFO: task atop:671 blocked for more than 120 seconds.
May 17 18:48:11 kernel: [ 6141.922326]       Tainted: G         C      4.14.17-v7+ #1090
May 17 18:48:11 kernel: [ 6141.922335] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 18:48:11 kernel: [ 6141.922344] atop            D    0   671      1 0x00000000
May 17 18:48:11 kernel: [ 6141.922371] [<807762c0>] (__schedule) from [<80776938>] (schedule+0x50/0xa8)
May 17 18:48:11 kernel: [ 6141.922389] [<80776938>] (schedule) from [<80779a38>] (rwsem_down_read_failed+0x10c/0x15c)
May 17 18:48:11 kernel: [ 6141.922404] [<80779a38>] (rwsem_down_read_failed) from [<807790cc>] (down_read+0x5c/0x60)
May 17 18:48:11 rsyslogd-2007: action 'action 10' suspended, next retry is Thu May 17 18:49:41 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
May 17 18:48:11 kernel: [ 6141.922421] [<807790cc>] (down_read) from [<802f8468>] (proc_pid_cmdline_read+0xd4/0x4fc)
May 17 18:48:11 kernel: [ 6141.922441] [<802f8468>] (proc_pid_cmdline_read) from [<80289d74>] (__vfs_read+0x38/0x130)
May 17 18:48:11 kernel: [ 6141.922458] [<80289d74>] (__vfs_read) from [<80289f08>] (vfs_read+0x9c/0x168)
May 17 18:48:11 kernel: [ 6141.922473] [<80289f08>] (vfs_read) from [<8028a4b8>] (SyS_read+0x54/0xb0)
May 17 18:48:11 kernel: [ 6141.922489] [<8028a4b8>] (SyS_read) from [<80108080>] (ret_fast_syscall+0x0/0x28)
May 17 18:48:11 kernel: [ 6141.922655] INFO: task tmux:5348 blocked for more than 120 seconds.
May 17 18:48:11 kernel: [ 6141.922667]       Tainted: G         C      4.14.17-v7+ #1090
May 17 18:48:11 kernel: [ 6141.922676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 18:48:11 kernel: [ 6141.922685] tmux            D    0  5348      1 0x00000000
May 17 18:48:11 kernel: [ 6141.922711] [<807762c0>] (__schedule) from [<80776938>] (schedule+0x50/0xa8)
May 17 18:48:11 kernel: [ 6141.922727] [<80776938>] (schedule) from [<80779a38>] (rwsem_down_read_failed+0x10c/0x15c)
May 17 18:48:11 kernel: [ 6141.922741] [<80779a38>] (rwsem_down_read_failed) from [<807790cc>] (down_read+0x5c/0x60)
May 17 18:48:11 kernel: [ 6141.922758] [<807790cc>] (down_read) from [<802f8468>] (proc_pid_cmdline_read+0xd4/0x4fc)
May 17 18:48:11 kernel: [ 6141.922775] [<802f8468>] (proc_pid_cmdline_read) from [<80289d74>] (__vfs_read+0x38/0x130)
May 17 18:48:11 kernel: [ 6141.922790] [<80289d74>] (__vfs_read) from [<80289f08>] (vfs_read+0x9c/0x168)
May 17 18:48:11 kernel: [ 6141.922805] [<80289f08>] (vfs_read) from [<8028a4b8>] (SyS_read+0x54/0xb0)
May 17 18:48:11 kernel: [ 6141.922820] [<8028a4b8>] (SyS_read) from [<80108080>] (ret_fast_syscall+0x0/0x28)
May 17 18:48:11 kernel: [ 6141.922835] INFO: task rtorrent main:6557 blocked for more than 120 seconds.
May 17 18:48:11 kernel: [ 6141.922846]       Tainted: G         C      4.14.17-v7+ #1090
May 17 18:48:11 kernel: [ 6141.922855] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 18:48:11 kernel: [ 6141.922864] rtorrent main   D    0  6557   5539 0x00000000
May 17 18:48:11 kernel: [ 6141.922890] [<807762c0>] (__schedule) from [<80776938>] (schedule+0x50/0xa8)
May 17 18:48:11 kernel: [ 6141.922905] [<80776938>] (schedule) from [<8014ac8c>] (io_schedule+0x20/0x40)
May 17 18:48:11 kernel: [ 6141.922922] [<8014ac8c>] (io_schedule) from [<8021c4e4>] (__lock_page+0x100/0x128)
May 17 18:48:11 kernel: [ 6141.922937] [<8021c4e4>] (__lock_page) from [<8022bb78>] (write_cache_pages+0x30c/0x4d4)
May 17 18:48:11 kernel: [ 6141.923006] [<8022bb78>] (write_cache_pages) from [<7f4ba618>] (fuse_writepages+0x80/0xe4 [fuse])
May 17 18:48:11 kernel: [ 6141.923078] [<7f4ba618>] (fuse_writepages [fuse]) from [<8022e284>] (do_writepages+0x30/0x8c)
May 17 18:48:11 kernel: [ 6141.923095] [<8022e284>] (do_writepages) from [<8021e5c8>] (__filemap_fdatawrite_range+0xc0/0xfc)
May 17 18:48:11 kernel: [ 6141.923110] [<8021e5c8>] (__filemap_fdatawrite_range) from [<8021e6cc>] (filemap_write_and_wait+0x48/0x9c)
May 17 18:48:11 kernel: [ 6141.923178] [<8021e6cc>] (filemap_write_and_wait) from [<7f4b66e4>] (fuse_vma_close+0x20/0x24 [fuse])
May 17 18:48:11 kernel: [ 6141.923247] [<7f4b66e4>] (fuse_vma_close [fuse]) from [<8025b32c>] (remove_vma+0x40/0x6c)
May 17 18:48:11 kernel: [ 6141.923266] [<8025b32c>] (remove_vma) from [<8025d6f8>] (do_munmap+0x280/0x378)
May 17 18:48:11 kernel: [ 6141.923283] [<8025d6f8>] (do_munmap) from [<8025de2c>] (vm_munmap+0x68/0x88)
May 17 18:48:11 kernel: [ 6141.923299] [<8025de2c>] (vm_munmap) from [<8025de78>] (SyS_munmap+0x2c/0x30)
May 17 18:48:11 kernel: [ 6141.923315] [<8025de78>] (SyS_munmap) from [<80108080>] (ret_fast_syscall+0x0/0x28)
May 17 18:48:11 kernel: [ 6141.923324] INFO: task rtorrent disk:6560 blocked for more than 120 seconds.
May 17 18:48:11 kernel: [ 6141.923335]       Tainted: G         C      4.14.17-v7+ #1090
May 17 18:48:11 kernel: [ 6141.923344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 18:48:11 kernel: [ 6141.923354] rtorrent disk   D    0  6560   5539 0x00000000
May 17 18:48:11 kernel: [ 6141.923378] [<807762c0>] (__schedule) from [<80776938>] (schedule+0x50/0xa8)
May 17 18:48:11 kernel: [ 6141.923395] [<80776938>] (schedule) from [<80779a38>] (rwsem_down_read_failed+0x10c/0x15c)
May 17 18:48:11 kernel: [ 6141.923409] [<80779a38>] (rwsem_down_read_failed) from [<807790cc>] (down_read+0x5c/0x60)
May 17 18:48:11 kernel: [ 6141.923423] [<807790cc>] (down_read) from [<8077c758>] (do_page_fault+0xfc/0x3a4)
May 17 18:48:11 kernel: [ 6141.923438] [<8077c758>] (do_page_fault) from [<801011e4>] (do_DataAbort+0x48/0xc4)
May 17 18:48:11 kernel: [ 6141.923453] [<801011e4>] (do_DataAbort) from [<8077c1a4>] (__dabt_usr+0x44/0x60)
May 17 18:48:11 kernel: [ 6141.923460] Exception stack(0xaadf9fb0 to 0xaadf9ff8)
May 17 18:48:11 kernel: [ 6141.923470] 9fa0:                                     022f3b90 63f20000 64000000 8948b000
May 17 18:48:11 kernel: [ 6141.923482] 9fc0: 10b3d897 f866b788 2c11b2f3 668aec92 7689aaa0 917f0b46 9eec5b1a 210583e4
May 17 18:48:11 kernel: [ 6141.923492] 9fe0: 75f97cc0 75f97cc0 75f97d00 7689b608 200b0010 ffffffff
May 17 18:48:11 kernel: [ 6141.923511] INFO: task omxplayer.bin:10559 blocked for more than 120 seconds.
May 17 18:48:11 kernel: [ 6141.923522]       Tainted: G         C      4.14.17-v7+ #1090
May 17 18:48:11 kernel: [ 6141.923531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 18:48:11 kernel: [ 6141.923539] omxplayer.bin   D    0 10559  10297 0x00000000
May 17 18:48:11 kernel: [ 6141.923565] [<807762c0>] (__schedule) from [<80776938>] (schedule+0x50/0xa8)
May 17 18:48:11 kernel: [ 6141.923580] [<80776938>] (schedule) from [<8014ac8c>] (io_schedule+0x20/0x40)
May 17 18:48:11 kernel: [ 6141.923593] [<8014ac8c>] (io_schedule) from [<8021c4e4>] (__lock_page+0x100/0x128)
May 17 18:48:11 kernel: [ 6141.923608] [<8021c4e4>] (__lock_page) from [<8027abac>] (move_to_new_page+0x258/0x264)
May 17 18:48:11 kernel: [ 6141.923624] [<8027abac>] (move_to_new_page) from [<8027b764>] (migrate_pages+0x808/0x860)
May 17 18:48:11 kernel: [ 6141.923642] [<8027b764>] (migrate_pages) from [<80229f18>] (alloc_contig_range+0x190/0x364)
May 17 18:48:11 kernel: [ 6141.923658] [<80229f18>] (alloc_contig_range) from [<80284f08>] (cma_alloc+0xfc/0x288)
May 17 18:48:11 kernel: [ 6141.923676] [<80284f08>] (cma_alloc) from [<80534d54>] (dma_alloc_from_contiguous+0x48/0x4c)
May 17 18:48:11 kernel: [ 6141.923695] [<80534d54>] (dma_alloc_from_contiguous) from [<80115598>] (__alloc_from_contiguous+0x40/0xc8)
May 17 18:48:11 kernel: [ 6141.923709] [<80115598>] (__alloc_from_contiguous) from [<80115668>] (cma_allocator_alloc+0x48/0x50)
May 17 18:48:11 kernel: [ 6141.923723] [<80115668>] (cma_allocator_alloc) from [<80115844>] (__dma_alloc+0x1d4/0x32c)
May 17 18:48:11 kernel: [ 6141.923735] [<80115844>] (__dma_alloc) from [<80115a34>] (arm_dma_alloc+0x50/0x58)
May 17 18:48:11 kernel: [ 6141.923752] [<80115a34>] (arm_dma_alloc) from [<80643044>] (vchiq_prepare_bulk_data+0xf8/0x750)
May 17 18:48:11 kernel: [ 6141.923769] [<80643044>] (vchiq_prepare_bulk_data) from [<8063b300>] (vchiq_bulk_transfer+0x2e4/0x568)
May 17 18:48:11 kernel: [ 6141.923784] [<8063b300>] (vchiq_bulk_transfer) from [<8063ff10>] (vchiq_ioctl+0x4e0/0x1a14)
May 17 18:48:11 kernel: [ 6141.923799] [<8063ff10>] (vchiq_ioctl) from [<8029d3ac>] (do_vfs_ioctl+0xac/0x7c4)
May 17 18:48:11 kernel: [ 6141.923812] [<8029d3ac>] (do_vfs_ioctl) from [<8029db08>] (SyS_ioctl+0x44/0x6c)
May 17 18:48:11 kernel: [ 6141.923827] [<8029db08>] (SyS_ioctl) from [<80108080>] (ret_fast_syscall+0x0/0x28)
@pelwell
Copy link
Contributor

pelwell commented May 17, 2018

Please give an indication of how repeatable this issue is, and what you are typically doing when it occurs.

@mtausk
Copy link
Author

mtausk commented May 17, 2018

It occurs whenever I run rtorrent under tmux and when I run omxplayer. As you can see, I also tried to run atop to see what's happening, but it just did not run. The whole system becomes unresponsive and I am unable to kill -9 omxplayer. Simple reboot -f does not work either. I can only do emergency reboot via magic sysrq.

@lategoodbye
Copy link
Contributor

I think it looks more similiar to #2555. On which Raspberry Pi can you reproduce this issue?

@mtausk
Copy link
Author

mtausk commented May 18, 2018

Hardware : BCM2835
Revision : a02082

That's PI 3 B

@JamesH65
Copy link
Contributor

@mtausk Since this is on a 3B not a 3B+, I suspect this is unrelated to #2555, which does appears to be limited to the 3B+ model, which initially had some lockup issues.

From the log above, its possible the VC4 has died in the omxplayer case, since it has failed during a bulk transfer, but not conclusive.

I think it would be worth trying an rpi-update, which will get the very latest kernel and retesting. Exact circumstances of any failure, and the dmesg at the time would be useful.

@JamesH65 JamesH65 added the Waiting for external input Waiting for a comment from the originator of the issue, or a collaborator. label Jun 26, 2018
@mtausk
Copy link
Author

mtausk commented Jun 30, 2018

@JamesH65, thanks for your response.
I've just recreated the issue on new rpi-update. I just need to have rtorrent and omxplayer running together and it freezes within minute. One more note: I have hifiberry onboard.

Attaching dmesg and syslog.

Thanks.

dmesg.txt
syslog.txt
dmesg1.txt

@JamesH65
Copy link
Contributor

@mtausk Just got back on to this - when you say you are using rtorrent and omxplayer at the same time, are you using the Raspbian desktop and running these from separate terminal windows? Or some other way?

Are you actually downloading with rtorrent at the time?

@mtausk
Copy link
Author

mtausk commented Jul 19, 2018

It's run within tmux (terminal multiplexor similiar to screen). And the rtorrent has to be either downloading or doing checksum.

@JamesH65
Copy link
Contributor

I wonder if the system is running out of memory - seems to be dying trying to allocate.

@mtausk
Copy link
Author

mtausk commented Jul 20, 2018 via email

@dimovnike
Copy link

dimovnike commented Nov 6, 2018

I experience very similar issue (almost the same). A QT app runs in background, which plays video thru omxplayer (the PIOmxtextures library) and when i start writing on the sd card - exact same thing happens like in the description. I can reproduce the issue pretty quick, whithin 30 minute or so. The last firmware release that works fine is 1.20171029, all others have this issue. The problem I have are on PI2. Not sure if PI3 is affected. If you need any other info or tests, I can do it, just let me know.

@pelwell
Copy link
Contributor

pelwell commented Nov 6, 2018

The problem does seem to be a stalled CMA allocation. Can you try with "cma=64M" added to /boot/cmdline.txt?

@pelwell
Copy link
Contributor

pelwell commented Nov 6, 2018

Hmm - have you tried with a kernel newer than 4.14.73 (October 3rd)? A patch went in that changed the VCHIQ driver to use a pool of CMA buffers rather than continuously allocating and deallocating.

@dimovnike
Copy link

the last kernel I tried is 4.14.71 from the 1.20180924 release. But I will test the latest soon. I am testing now with "cma=64M" and will let you know soon how it goes.

@pelwell
Copy link
Contributor

pelwell commented Nov 6, 2018

I suspect it will run for longer, but will eventually die due to CMA fragmentation. sudo rpi-update will get you a 4.14.79 kernel.

@dimovnike
Copy link

So cma=64M helps a bit, but doesn't fix it completely. Will let you know how 4.14.79 goes.

@dimovnike
Copy link

dimovnike commented Nov 7, 2018

So 4.14.79 seems to fix the issue so far, but I can not do extensive tests because other issues get in the way:

  1. INFO: rcu_sched detected stalls on CPUs/tasks / rcu_sched kthread starved for 994 jiffies ...
  2. CPUx failed to come online (sometimes PI boots with less than 4 CPU's)

Both issues seem to happen only at boot (to be confirmed).

@pelwell , do you know what is the oldest raspberrypi/firmware commit that contains the fix you mentioned? I hope, may be an older commit will lack the new found bugs.

@dimovnike
Copy link

While doing more tests, i got a case where a PI did not boot and had the following:

  1. CPU 1 failed to come online
  2. task swapper hung
  3. rcu_sched kthread starved for 1002 jiffies

So the problem is not completely gone.

@pelwell
Copy link
Contributor

pelwell commented Nov 7, 2018

Although not listed in the commit message, I'm sure the fix is in the 4.14.74 release (rpi-update release Hexxeh/rpi-firmware@e0a0ba8).

@dimovnike
Copy link

So far the release fbad6408c4596d3d worked fine on these revisions:

Revision        : a02082 (PI3B)
Revision        : a21041 (PI2B)

But failed on this (hung tasks):

Revision : a020d3 (PI3B+)
@pelwell , can you suggest another commit to test ?

@lategoodbye
Copy link
Contributor

@dimovnike Could you please post the output in case of hung task? Is it always the same?

@dimovnike
Copy link

dimovnike commented Nov 7, 2018

attached is an example of PI3B+ output. Please note that some lines might not be in order (this is actually elasticsearch output)

log.txt

as for sameness I don't know yet, as it happened only once until now.

@pelwell
Copy link
Contributor

pelwell commented Nov 9, 2018

I can't think of a reason why 3B+ would behave any differently with respect to CMA allocations. If I'm not mistaken, that looks like a custom kernel you are running - can you try with an "official" firmware release?

@dimovnike
Copy link

I can't think of a reason why 3B+ would behave any differently with respect to CMA allocations. If I'm not mistaken, that looks like a custom kernel you are running - can you try with an "official" firmware release?

yes its a custom recompiled kernel but its almost the same (I will try with the official) however it might be the sdcard problem because I moved that card to a PI2 and now that PI is causing hung tasks. I think more tests are needed so I keep testing.

@pelwell
Copy link
Contributor

pelwell commented Nov 9, 2018

I really think the SD card and driver is an innocent bystander here - collateral damage - but there's no harm in trying another card (except your time).

@lategoodbye
Copy link
Contributor

@dimovnike In case you customized the kernel config, please attach it to this issue.

@dimovnike
Copy link

dimovnike commented Nov 9, 2018

@dimovnike In case you customized the kernel config, please attach it to this issue.

the config is not yet customized, its a standard config, just recompiled kernel (it will be customized later).

@dimovnike
Copy link

So after replacing SD card, hung tasks are gone, tested for 24+ hours on 6 PI's (PI2 and PI3).

Thanks for the help!

@dimovnike
Copy link

Can you confirm that these error messages are harmless? (from different PI's):

[ 12.344143] brcmfmac mmc1:0001:1: Direct firmware load for brcm/brcmfmac43455-sdio.bin failed with error -2
[ 12.572893] brcmfmac mmc1:0001:1: Direct firmware load for brcm/brcmfmac43430-sdio.bin failed with error -2
[ 1.037930] mmc-bcm2835 3f300000.mmc: could not get clk, deferring probe
[ 1.201103] sdhost-bcm2835 3f202000.mmc: could not get clk, deferring probe
[ 0.028322] uart-pl011 3f201000.serial: could not find pctldev for node /soc/gpio@7e200000/uart0_pins, deferring probe

@pelwell
Copy link
Contributor

pelwell commented Nov 10, 2018

Yes - the probe deferrals are an ugly but expected part of unordered device creation, and the firmware load failures are from the newer in-kernel mechanism, where Pis currently use hciattach.

@mtausk
Copy link
Author

mtausk commented Dec 17, 2018

My initial problem has gone away. Although, I don't know what helped, as it works for some time already and I also used cma=64M as suggested above.
I will try to remove cma settings and see if it was the firmware update that helped.
Anyway, it is good to see the device working without interruption.
Thank you!

@JamesH65
Copy link
Contributor

Closing this issue as questions answered/issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for external input Waiting for a comment from the originator of the issue, or a collaborator.
Projects
None yet
Development

No branches or pull requests

5 participants