-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenVG / videocore memory corruption #943
Comments
I just tried a clean install of the 2015-05-05 Raspbian and this version suffers from the same VC memory corruption bug as described above. Is there anyone more at home with VC debugging than me, willing to share a few pointers? Nature of the bug suggests it is related to core VC memory management. What sources would be candidates to look at in this case? |
Unfortunately the people who wrote the OpenVG server are gone. As OpenVG is not widely used it's lower priority than, say, openGL or video. We do have access to the source code, so if there is a straightforward fix, then it may be possible. If you could find a trivial test app that fails in a short time (e.g. just doing the problematic operation in a tight loop) then it's more likely we could fix it. As far as debugging goes, if you add "start_debug=1" to config.txt and run the build will include assert logging. This may generate messages in debug log (sudo vcdbg log assert) that if we are lucky may narrow down the problem. |
Thanks for commenting! Compliance testing for OpenVG is quite strict (I have seen no issues with the API as such), so it may be that the issue is more generic and applies to other API’s as well. As for the user base, this API is supported by QT (which is reasonably popular) and has the same issue when run on the Pi. This is also the only HW accelerated (and quite sophisticated if I may) 2D API available and makes for an excellent scholar introduction to computer graphics. I believe the bug may have a trivial fix (finding it however is not). Some nightly kernel builds (specifically: sudo apt-get 40185a95ac04ffcec406d9e1ef934406d7221939) works with no issues (one theory is an uninitialized variable that incidentally got the right value for this build). The fabric of a test application would involve creating a number of OpenVG objects (such as font glyphs) during program initialization and then verify the integrity of these objects against source after a series of dynamic operations (such as create/destroy path). I am short on ideas however when it comes to verifying the objects (as they exists in VC memory) beyond visual inspection of characters as they appear on screen. A more indirect approach is to check for errors (vgGetError) after every API call, as this is likely to trigger once memory is corrupted. Going from an error return to “why” and “where” however may be tricky. I notice a number of threads get started when launching OpenVG. Names are “VCHIQ completion”, “HDispmanx Notif”, “HTV Notify” and “HCEC Notify”. Is there VC “garbage collector / memory manager” running in user space or is memory management VC domain only? In short, I am struggling to get a handle on debugging this and even when errors materialize, it doesn’t help me towards solving the bug without understanding the full architecture and having access to relevant sources. |
The openvg code on the arm side (/opt/vc/lib) is a pretty thin layer that remotes function calls to the GPU. It's unlikely the bug lies there. Memory allocations occur on the gpu. |
I’ve been running Jessie Lite (2015-11-21) now since it was released. The application I’ve been using for longevity test (run-forever is the goal) has a 10Hz graphics refresh rate with a combination of text and simple graphics (lines, squares, circles, … instrument dials). Unfortunately, issues are the same (as Wheezy) with corrupted glyphs after leaving it running for a few days (or sometimes a few hours). Regular checking with “vcgencmd cache_flush && sudo vcdbg reloc stats” is pretty consistent (fairly high reloc/alloc activity), but always plenty of space left (compaction counts remain zero). ARM memory footprint of the application is stable at around 1% of total. Glyphs get created at program start using the “vgSetGlyphToPath” function. They remain static for the lifetime of the running program. Paths get destroyed (vgDestroyPath) right after vgSetGlyphToPath in accordance with OpenVG 1.1 recommendations: “Applications are responsible for destroying path or image objects they have For lack of better ideas, I tried compiling a version without destroying paths and this turned out to make a difference. I’ve not observed glyph corruptions with vgDestroyPath (as used for glyphs) commented out and it will come back when included. So it appears there may be an issue with memory allocation / reference counting related to the vgSetGlyphToPath function. Any ideas on how to debug this? |
Using "vcdbg reloc small", I was able to check individual GPU memory fragments allocated for glyphs and it appears that reference counting is correct. That is GPU memory buffers get incremented/decremented as expected - so back to start again. There are still no signs of glyph corruptions (with the hack described in the previous thread enabled), but I've since observed a few GPU dead-locks. My application is multi-threaded, but OpenVG is only used in a single thread so a GPU dead-lock is not expected. Stack dump shows the following sequence leading up to the dead-lock: eglSwapBuffers() Looking at other running threads, I can see that ILCS_HOST, HCEC_Notify, HTV Notify and HDispmanx Notif are all waiting in a call to do_futex_wait(). VCHIQ completion is waiting in a call to select(). Commands "vcgencmd cache_flush && sudo vcdbg reloc stats" suggest I have plenty of GPU memory available. Still at loss and don't quite know where to go from here. Anyone feeling inspired to chip in with knowledge that can help narrow down this GPU memory corruption/lock issue? |
To get any kind of traction you'll need to provide a test application, and accompanying instructions, that demonstrates the problem in as little time as possible. If we know that running something overnight is almost guaranteed to show the problem then it may get some attention, but bear in mind that it will be competing with issues which affect more people and require less dedication to investigate. |
Anything new here? Looks like other users have random crashes as well and it seems to be the same problem. I use the ajstarks lib as well... |
"Anything new here?" Not much I'm afraid - the error (GPU dead-lock and/or memory corruption) is still there in the latest Raspbian release build (Jessie lite May 27th, 2016). It really is a shame as it severely limits usage of the Pi beyond the academic scene. |
No sign of that, therefore there will have been little (if any) investigation will have been done within the firmware. |
I can provide my OSD code that causes the problem. Isdue with that only is, that sometines ut happens within 30mins or so and sometimes after hours only, so there really is no quick way to show it. |
Thought I would share an example image that shows GPU memory corruption. In this example, the 's' character has been corrupted. The program this is taken from creates 256 vector characters at startup and these remain static for the lifetime of the application. A number of other objects/shapes however are constantly allocated/destroyed. In OpenVG, vector fonts are created from path segments converted to glyphs. Rendering and memory management is all within the GPU and so an issue like this should not happen unless there is a BUG in the underlying library (possibly code running on the GPU). As explained in posts above, this may happen after leaving the application running for a couple hours, a few days or even weeks. This makes it very hard to debug as there is no (known) way to create sample that will fail in a predictable/useful manner. Knowing there are some smart people out there with much more knowledge of the PI HW than myself – what other options exist for debugging this issue? |
Pictures don't help - there's no way to debug a picture. @SamuelBrucksch if you want to share your some simple code with build instructions, then please do. |
You can find it here including installation instructions: |
Thank you. |
People reported the more graphical elemets there are, the higher the chances for a freeze are. You can enable more elements in den osdconfig.h. I will upload a telemetry file later, that you can feed in via stdin so you can actually see something. |
I just checked and there is already a telemetry dump. So what you have to do is select LTM in the lower section of the osdconfig.h and then once you built the osd run this command: while true; do cat raw_dump.txt; done | ./osd |
@gitbf has your issue been resolved? If yes, then please close this issue. |
No, the issue remains unresolved. To the best of my knowledge this VC bug affects:
|
How long does it typically run before getting a problem? |
Can you try this test firmware: |
Certainly! I now have a test setup running with the firmware - what should I expect? |
Well I'm hoping no more corruption and hanging. |
Ok, let's see! @SamuelBrucksch Will you try it as well? |
Sure but i cant try it within the next two weeks. However some friends use my OSD and i will tell them that there might be a solution so i think they will try it. |
See: raspberrypi/linux#1596 kernel: drm/vc4: vc4 loops support See: raspberrypi/linux#1597 kernel: Add cm3 dts file See: raspberrypi/linux#1595 kernel: net: ethernet: enc28j60: add device tree support kernel: enc28j60: Fix race condition in enc28j60 driver See: raspberrypi/linux#1385 firmware: platform: Don't swap audio L&R if using GPIOs 12&13 See: http://github.com/raspberrypi/linux/issues/1473 firmware: hdmi: Increase muting before resolution change firmware: board_info: Add cm3-specific dtb file firmware: Ensure extended part of vg_spath is zeroed See: raspberrypi/linux#943
See: raspberrypi/linux#1596 kernel: drm/vc4: vc4 loops support See: raspberrypi/linux#1597 kernel: Add cm3 dts file See: raspberrypi/linux#1595 kernel: net: ethernet: enc28j60: add device tree support kernel: enc28j60: Fix race condition in enc28j60 driver See: raspberrypi/linux#1385 firmware: platform: Don't swap audio L&R if using GPIOs 12&13 See: http://github.com/raspberrypi/linux/issues/1473 firmware: hdmi: Increase muting before resolution change firmware: board_info: Add cm3-specific dtb file firmware: Ensure extended part of vg_spath is zeroed See: raspberrypi/linux#943
Looking good so far. A high-load test has been running now for 4+ days without a glitch. This makes a difference. A rig is prepared this weekend that will feature on two exhibitions in September (5 Pi's will display instrumentation on 15 inch monitors and a 6th will display a CCTV LAN feed). @popcornmix thanks for your support! |
Cool. The current fix is not quite in the right place, but we do know what is wrong and where we want to fix it. The test firmware (and latest rpi-update firmware which includes the current fix) should make your code reliable, but it could theoretically occur elsewhere so we'd like to fix it at source. I'll leave the issue open and will ping this issue when there is a final fix. |
I have also tested the firmware some days now with Samuel's OSD code, seems stable. Thanks a lot. However, I noticed something different: Somehow it's possible to "overload" the GPU, when the OSD is running and I start another modified hello_font.bin process, the HDMI output seems to stop completely for a second. Monitor shows "no input signal" then. It seems to be dependent on "load", it doesn't happen anymore when GPU is overclocked. Or alternatively adding a usleep line to the OSD code also helps. |
Read raspberrypi/firmware#407 When debugging this issue, I was running 4 instances of your
which stabilised things, but overclocking is not guaranteed to work on all Pi's. |
Thanks. After some optimizations (mainly adding usleep() lines to the loops that do the rendering) it seems to work stable now, even with standard clock settings. Were the problems I had mainly from different applications trying to use the GPU at the same time or is it "load" in general? In your testing, you needed 4 OSD processes to make the HDMI output stop. I assume if one would use a single application that draws 4 times as much, the problem wouldn't occur so easily? The command to show the layers in use (from the thread you linked) gives:
Layer -127 seems to be the framebuffer console, 0 is hello_video.bin decoding/displaying a 720p 5mbit h264 stream, 1 is Samuel's OSD, 2 is my modified hello_font.bin. |
Makes no difference if it's a single app or several. Just down to the number and cost of the layers. If you are not using the default framebuffer console then disabling or making it smaller is beneficial. |
This is weird. It seems to happen again now (with standard clock settings) My modified hello_font.bin just displays a line of text fading in and then fading out again:
But this really is "too much"? Samuel's OSD is just painting some lines and characters on the screen, I thought these GPUs are much more capable considering that there are much more complex 3D games being run on them? Or is the hello_video.bin already causing so much load? I also tried disabling the console, I think it helped (but need more testing). |
Nothing to do with the complexity of each layer - just the number and size. |
Layers are composed of pixels, not 3D primitives. Fetching a blank screen
takes as much time as a complex image.
|
Ah, okay, thanks. I guess that means the best thing would be to move the functionality of my hacked hello_font.bin into Samuel's OSD so that there is no additional application that creates an additional layer. Hmm, let's see if my non-existant C skills will allow for that ;) |
Hmm, now I just had a display freeze again. hello_video.bin and Samuel's OSD was running, but this time it was hello_video.bin that was in [defunct] state. Running vcgencmd also doesn't work anymore, it just sits there and ctrl-c does nothing. But the system is still working apart from that, can I collect any other infos you may need? |
OSD crashed gain, after just 10 mins or so. But this time only the OSD froze. hello_video.bin is still displaying the video stream. Could it be that it is somehow related to the number of wifi sticks I'm using? USB or interrupt load or something? The last two times it crashed now was with 4 and 5 wifi sticks for receiving. |
Now I sometimes see wrong renderings again. Just happens sometimes, not sure how often. Looks like it draws a big red triangle on the screen for a very short amount of time (one rendering probably, it draws every 50ms). Probably coming from the battery status element as that is the only red thing in the OSD. Difference is, that it's not permanent. Before, with the older firmware when the drawings got corrupted it looked very similar to the pic that gitbf posted and was permanent. |
Just booted up the system again, right before it would normally display the video image and OSD, the HDMI signal got interrupted again shortly. I can see, that: 408 tty2 Sl+ 0:00 /usr/bin/vcgencmd get_camera is running (or hanging) permanently. (a line in my script that checks for the cam to determine if the Pi is in the transmitter or the receiver role). Tried running vcgencmd get_camera again manually already expecting that it would just sit and hang there, but it's still working. tty2, the terminal that the OSD would run on shows "vchi_msg_dequeue -> -1(22)" Edit: when I remove that vcgencmd get_camera line from my script it works, tried 20 times in a row. |
Have added "tvservice -o && tvservice -p" to a startup script to make sure it's not that layer thing, but it's still flaky, just bootet up, everything looks good, like after 30 seconds or so, the HDMI output suddenly drops again shortly :( |
…flush/invalidate See: raspberrypi/linux#943 firmware: vchiq_arm: Fix return value of vchiq_initialise See: raspberrypi/userland#331
…flush/invalidate See: raspberrypi/linux#943 firmware: vchiq_arm: Fix return value of vchiq_initialise See: raspberrypi/userland#331
@gitbf I have now pushed a better fix for the issue. Available with rpi-update. |
Thanks! Just upgraded the test setup which by the way had been running 18+ days without a glitch prior to reboot. |
- kernel: config: Enable SENSORS_LM75 See: #508 - kernel: config: Enable SERIAL_SC16IS7XX See: raspberrypi/linux#1594 - kernel: snd-bcm2835: Don't allow responses from VC to be interrupted by user signals See: raspberrypi/linux#1560 - kernel: Merge many vc4 changes from drm-vc4-next-2016-07-15 See: raspberrypi/linux#1596 - kernel: drm/vc4: vc4 loops support See: raspberrypi/linux#1597 - kernel: Add cm3 dts file See: raspberrypi/linux#1595 - kernel: net: ethernet: enc28j60: add device tree support - kernel: enc28j60: Fix race condition in enc28j60 driver See: raspberrypi/linux#1385 - firmware: platform: Don't swap audio L&R if using GPIOs 12&13 See: http://github.com/raspberrypi/linux/issues/1473 - firmware: hdmi: Increase muting before resolution change - firmware: board_info: Add cm3-specific dtb file - firmware: Ensure extended part of vg_spath is zeroed See: raspberrypi/linux#943 - kernel: config: Enable SERIAL_SC16IS7XX_SPI See: raspberrypi/linux#1594 - kernel: Added Overlay for Microchip MCP23S08/17 SPI gpio expanders See: raspberrypi/linux#1566 - kernel: BCM270X_DT: Add audio_pins to CM dtb - kernel: BCM270X_DT: Don't enable UART0 in CM3 dtb - kernel: overlays: Add audremap overlay - kernel: overlays: Add swap_lr and enable_jack to audremap See: raspberrypi/linux#1473 - firmware: Raspi[Still|Vid]Yuv: Add option for just saving luma See: raspberrypi/userland#170 - firmware: RaspiVidYuv: Add option of saving RGB data - firmware: Only change I2C/GPIO pin functions when needed - firmware: platform: Redo the audio remapping logic See: raspberrypi/linux#1473 - kernel: overlays: added sc16is750 UART over I2C See: raspberrypi/linux#1617 - kernel: config: Add CONFIG_IPVLAN module See: raspberrypi/linux#1612 - kernel: config: Add CONFIG_VXLAN module See: raspberrypi/linux#1614 - firmware: platform: Make the default UART0 clock 48MHz for all Pis See: raspberrypi/linux#1601 See: #643 - firmware: cacheasm: Enable workaround for unwanted sdram write after flush/invalidate See: raspberrypi/linux#943 - firmware: vchiq_arm: Fix return value of vchiq_initialise See: raspberrypi/userland#331 - firmware: Revert temp: Ensure extended part of vg_spath is zeroed See: raspberrypi/linux#943 - firmware: MMAL: Support MMAL_ENCODING_xxx_SLICE formats - firmware: arm_display: Add bitmapped icons for warning conditions See: #367 - firmware: deinterlace: Avoid frame doubling with progressive frames See: http://forum.kodi.tv/showthread.php?tid=269814&pid=2412845#pid2412845 - kernel: config: Enable SENSORS_INA2XX module See: raspberrypi/linux#1632 - kernel: overlays: Add dpi18 overlay See: raspberrypi/linux#1634 - kernel: brcmfmac: do not use internal roaming engine by default See: http://projectable.me/optimize-my-pi-wi-fi/ - firmware: arm_display: Fix alpha of warning icons See: #367 - firmware: mmal: Advertise sliced formats in MMAL_PARAMETER_SUPPORTED_ENCODINGS - firmware: IL Resize: Accept strides greater than the minimum - firmware: vmcs_host: Poll for multiple dispmanx messages - firmware: VCHI clients: Poll for messages until empty See: https://discourse.osmc.tv/t/april-update-causes-system-freezes/15361/183 - firmware: tvservice/cecservice: Make unexpected messages more fatal - kernel: drm/vc4: Allow some more signals to be packed with uniform resets See: raspberrypi/linux#1636 - kernel: Rpi 4.4.y firmware kms See: raspberrypi/linux#1637 - firmware: tvservice/cecservice: We only care about unexpected message sizes - firmware: khronos: Avoid starting khronos service when vc4-kms-v3d is active - firmeare: arm_loader: Enable fake_vsync_isr when fkms overlay is used See: raspberrypi/linux#1637 - firmware: vcdbg: Add support for vchiq debugging
thanks! I had the same problem and this solves it. |
@gitbf okay to close? |
Yes, indeed. Thanks for sorting this one out - case closed! |
See: raspberrypi/linux#1596 kernel: drm/vc4: vc4 loops support See: raspberrypi/linux#1597 kernel: Add cm3 dts file See: raspberrypi/linux#1595 kernel: net: ethernet: enc28j60: add device tree support kernel: enc28j60: Fix race condition in enc28j60 driver See: raspberrypi/linux#1385 firmware: platform: Don't swap audio L&R if using GPIOs 12&13 See: http://github.com/raspberrypi/linux/issues/1473 firmware: hdmi: Increase muting before resolution change firmware: board_info: Add cm3-specific dtb file firmware: Ensure extended part of vg_spath is zeroed See: raspberrypi/linux#943
…flush/invalidate See: raspberrypi/linux#943 firmware: vchiq_arm: Fix return value of vchiq_initialise See: raspberrypi/userland#331
Ever since I got my hand on the first R-Pi released there have been issues with VC memory being corrupted when running OpenVG applications for a long time (hours/days). The application I run start out with allocating a font (256 glyphs) and corruption becomes apparent when after some time, a few glyphs become corrupted ("funny" characters appear on the display).
I have one Raspbian kernel however (nightly build 40185a95ac04ffcec406d9e1ef934406d7221939) from a few weeks back (I'm using this with a R-Pi 2) that is dead solid. My theory is that the bug is caused by an uninitialized variable related to VC garbage collection and that this kernel incidentally got it right.
I would love to see someone take on the challenge to eradicate this long-term bug as it is a show-stopper (in terms of using OpenVG) for any non-trivial project.
See the last few posts in this thread for more detail:
https://www.raspberrypi.org/forums/viewtopic.php?f=69&t=96899
The text was updated successfully, but these errors were encountered: