Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmal/h264: corrupt buffers returned by encoder #443

Closed
malc0 opened this issue Jun 22, 2015 · 13 comments
Closed

mmal/h264: corrupt buffers returned by encoder #443

malc0 opened this issue Jun 22, 2015 · 13 comments

Comments

@malc0
Copy link

malc0 commented Jun 22, 2015

Under particular circumstances, files saved from the h264 encoder's output do not decode cleanly using FFmpeg. A typical FFmpeg message for these frames would be

[h264 @ 0xf491a0] error while decoding MB 50 46, bytestream -10
[h264 @ 0xf491a0] Cannot use next picture in error concealment
[h264 @ 0xf491a0] concealing 2639 DC, 2639 AC, 2639 MV errors in P frame

though the "error while decoding" line may or may not be present.

This error seems almost exclusively to occur when multiple output buffers are allocated (though I have seen one instance of this error while using a single buffer) and it only seems to happen when an output frame is split across multiple buffers. A guess might be some locking bug.

I've made a minimal patch against RaspiVid.c that reproduces the issue (using a pool of 3 buffers), uploaded to https://gist.github.com/malc0/d7d3b0d90af0e39d335c . When running the resultant binary with -ss 2000 -t 0 -sp -o %d.h264 -td 1000,1000 -n within 100 files I get one or more with corruption (-ss 2000 I suspect simply makes the frames noisier and hence bigger: appropriate clips tend to be >1.5Mb, -n is necessary due to the change in connect_ports()). I'm detecting corruption using:

#!/bin/bash

for i in *h264; do
    echo -n $i
    gre=`ffmpeg -i $i  -c:v copy test.mp4 2>&1 | grep error`
    if [ "$gre" != "" ]; then
        echo " FAILED"
    else
        echo " ok"
    fi
    rm -f test.mp4
done

Do let me know if there's any more tests I can do, or pertinent information I can provide.

@6by9
Copy link

6by9 commented Jun 23, 2015

Do you have to change the connection on the input side to use I420, and return the buffers to your app? Or is just setting buffer_num on the output to >1 sufficient?

I'm a little surprised that fragmenting frames is an issue as that used to be the way the codec was always driven. Under Android we had to avoid it due to issues in the Android framework if you did so. The output data is just saved into a circular buffer and read out as appropriate to fill the buffers - the codec itself doesn't know any different. The buffer at the wrap point in the circular buffer will always get fragmented - IIRC that is at about 2MB. You can avoid that by setting the MMAL parameter MMAL_PARAMETER_MINIMISE_FRAGMENTATION (added due to the Android limitation above).

You may be able to provoke this more easily if you drop the buffer_size on the output port. As long as it is >=buffer_size_min then the system will run with it.

@malc0
Copy link
Author

malc0 commented Jun 25, 2015

I've found that the buffer_num change alone is sufficient, but without the other changes instances of corruption become less frequent: only 19 cases out of 1000, in one test.

I may have misunderstood, but setting the fragmentation parameter on encoder_output at the same time as MMAL_PARAMETER_VIDEO_ENCODE_INLINE_HEADER et al. are set resulted in the encoder jamming up, after some varying number of clips has been emitted (4 -- 20, typically). Nothing useful logged, even when using VC_LOGLEVEL=mmal:trace, regarding why it gets stuck though.

buffer_size is already equal to buffer_size_min and buffer_size_recommended (65536). Increasing buffer_size, arbitrarily to 262144, leads to zero corrupt clips (out of 1000), but of course there's probably no fragmentation going on either, except perhaps on I-frames.

@6by9
Copy link

6by9 commented Jun 26, 2015

OK on the mods. The only thing I can think of is that the IPC mechanism will be significantly more stressed with you pulling all the images back to the ARM before encoding.

Inline headers was never used with minimise fragmentation, so I guess may have some nasty lurking. I'm a little surprised as the mechanisms should be independent, but I haven't looked at the code. It will be an issue in the encode component rather than MMAL, hence the lack of any useful logging.

Thanks for checking on buffer_size - I couldn't remember what the settings are, and it was possibly a useful one.

Are you on a Pi 1 or Pi 2? And what firmware version and kernel version are you using?
I just wonder if it is a caching issue. You could try setting MMAL_PARAMETER_ZERO_COPY to MMAL_TRUE on the ports. That switches to using shared memory (via /dev/vcsm - just double check permissions on that) rather than copying every byte between ARM and GPU memory. Slight grasping at straws, but I do recall having a similar issue when developing the V4L2 driver as it needed an ARM L1 cache flush.
I have just checked and the circular buffer appears to be 2MB in size, so if it were wrapping around the end then I would only expect corruption at the 2MB mark. Your comment says >1.5MB which may or may not tally with that.

@malc0
Copy link
Author

malc0 commented Jun 30, 2015

Inline headers was never used with minimise fragmentation, so I guess
may have some nasty lurking. I'm a little surprised as the mechanisms
should be independent, but I haven't looked at the code. It will be an
issue in the encode component rather than MMAL, hence the lack of any
useful logging.

Understood; the below may make this irrelevant, but is there any way to get debugging passed back from the encoder, or is it mostly invisible from the ARM side (vcgencmd set_logging level=64 seems to mostly talk about the camera RIL)?

Are you on a Pi 1 or Pi 2? And what firmware version and kernel version
are you using?

This is on a Pi 2, with most testing done with kernel/firmware from raspberrypi/firmware/8aca5762f984f6decbeda294cee8418966c3d8d3, but also reproduces on an up-to-date Raspbian image, and with an rpi-update done last week.

Interestingly, I've now tried a Pi 1, running both 8aca576 and last week's firmware, and cannot reproduce the issue, even with I420 buffers bouncing through the ARM side. The issue is still present on a Pi 2 booted with nosmp.

I just wonder if it is a caching issue. You could try setting
MMAL_PARAMETER_ZERO_COPY to MMAL_TRUE on the ports. That
switches to using shared memory (via /dev/vcsm - just double check
permissions on that) rather than copying every byte between ARM and
GPU memory. Slight grasping at straws, but I do recall having a similar
issue when developing the V4L2 driver as it needed an ARM L1 cache
flush.

Seems like you might be on to something here, so far I haven't been able to hit it when using zero copy!

@6by9
Copy link

6by9 commented Jun 30, 2015

@pelwell You know more about caching differences on Pi2.
https://github.com/raspberrypi/linux/blob/rpi-4.0.y/drivers/media/platform/bcm2835/mmal-vchiq.c#L281 was the nasty call we had to add to the V4L2 driver. I thought VCHIQ dealt with it all, but that wasn't the case here. Is it definitely the case more generally from userspace?

logging 64 or 192 should give out more useful information from the codec, but the camera is also on that flag. 4 or 8 will dump out logging from the lower level codec, but that is unlikely to be of much use without source.

@pelwell
Copy link
Contributor

pelwell commented Jul 1, 2015

I have been looking at this today. By putting cache flushes into vchiq (which it has largely escaped up to now) I can make it slower, but my simple test is failing just as often with and without them. There must be some other factor at work.

@6by9
Copy link

6by9 commented Jul 1, 2015

Hmm. VPU side should be treating all of those buffers as direct alias addresses (mmal_server.c mmal_do_allocate_buffers() has MEM_FLAG_DIRECT).
I can't quite remember why we had to faff with caches in V4L2 when we haven't previously with other userspace apps or platforms, but do remember we got corrupted JPEGs otherwise. Possibly because we don't get to allocate the buffers in V4L2 (videobuf2 does) so can't set the caching up there? Does the ARM even support uncached access?

Zero copy should remove the use of bulk transfers completely which does move the goalposts somewhat.

Can you remind me (via email if necessary) of the Pi2 cache structure. Pi1 the ARM had a dedicated L1, and then went to SDRAM via the VPU L2.

@popcornmix
Copy link
Contributor

Pi1: ARM->L1->GertMMU->0x40000000 alias (so allocating/coherent with GPU L2)
Pi2: ARM->L1->L2->GertMMU->0xC0000000 alias (so direct to SDRAM)

@pelwell
Copy link
Contributor

pelwell commented Jul 2, 2015

Update: I can see anomalous cache behaviour using a test programme. The ARM is writing stale data back over bulk data transmitted from the VPU, and this despite attempts to invalidate the destination.

@pelwell
Copy link
Contributor

pelwell commented Jul 8, 2015

I've just pushed changes to the active kernel branches that make allowances for the increased cache line size on 2836, and include an extra cache invalidate after a bulk transmit from VC. There is a corresponding firmware patch, and when both are present it passes my tests.

The extra invalidate should mean that the additional flush in the MMAL code is unnecessary, but I've left it in place.

popcornmix added a commit that referenced this issue Jul 8, 2015
kernel: BCM270X_DT: I2S needs function Alt2
See: raspberrypi/linux#1046

kernel: vchiq_arm: Two cacheing fixes
See: #443

kernel: BCM270X_DT: Overlay for the Fen Logic VGA666 board

firmware: arm_loader: Increase stack and ensure icache flush is done before threads in execute multi

firmware: arm_loader: Switch to vpu queues and more profile logging

firmware: clocks: Allow arm to be overclocked to 1.6GHz

firmware: gpioman: Don't force all pin pulls to their defaults

firmware: arm_loader: Fix length on get palette mailbox call
See: raspberrypi/linux#1026

firmware: vchiq: Better error handling
firmware: vchiq: Make fragment size vary with cache line size
See: #443
popcornmix added a commit to Hexxeh/rpi-firmware that referenced this issue Jul 8, 2015
kernel: BCM270X_DT: I2S needs function Alt2
See: raspberrypi/linux#1046

kernel: vchiq_arm: Two cacheing fixes
See: raspberrypi/firmware#443

kernel: BCM270X_DT: Overlay for the Fen Logic VGA666 board

firmware: arm_loader: Increase stack and ensure icache flush is done before threads in execute multi

firmware: arm_loader: Switch to vpu queues and more profile logging

firmware: clocks: Allow arm to be overclocked to 1.6GHz

firmware: gpioman: Don't force all pin pulls to their defaults

firmware: arm_loader: Fix length on get palette mailbox call
See: raspberrypi/linux#1026

firmware: vchiq: Better error handling
firmware: vchiq: Make fragment size vary with cache line size
See: raspberrypi/firmware#443
@popcornmix
Copy link
Contributor

Firmware is rebuild with @pelwell fix. Not sure if it is the problem reported here, but worth running rpi-update and retesting.

@malc0
Copy link
Author

malc0 commented Jul 9, 2015

I'm delighted to say the issue no longer reproduces for me when using the latest firmware. Thank you very much for the work in producing a fix.

@malc0 malc0 closed this as completed Jul 9, 2015
@pelwell
Copy link
Contributor

pelwell commented Jul 9, 2015

No problem, and thanks for the feedback.

neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this issue Feb 27, 2017
kernel: BCM270X_DT: I2S needs function Alt2
See: raspberrypi/linux#1046

kernel: vchiq_arm: Two cacheing fixes
See: raspberrypi#443

kernel: BCM270X_DT: Overlay for the Fen Logic VGA666 board

firmware: arm_loader: Increase stack and ensure icache flush is done before threads in execute multi

firmware: arm_loader: Switch to vpu queues and more profile logging

firmware: clocks: Allow arm to be overclocked to 1.6GHz

firmware: gpioman: Don't force all pin pulls to their defaults

firmware: arm_loader: Fix length on get palette mailbox call
See: raspberrypi/linux#1026

firmware: vchiq: Better error handling
firmware: vchiq: Make fragment size vary with cache line size
See: raspberrypi#443
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants