-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Config Support]: Whole Machine Crashing - looking for some tips #8470
Comments
There's no info provided here so there is nothing to go off of. You first need to figure out why the machine is actually freezing (is it memory issue, kernel panic, etc.) |
nothing is shown on the console. I'll check to see if there is anything in syslog. The last time i checked there was nothing - the whole machine was just frozen. |
it can happen for many different reasons, if there is no information that can be provided then there's not really much that can be done on the frigate side. There are plenty of solutions like having a log written to a file so the cause can be seen in the logs after restarting the machine. Also, you can try putting a memory limit on the frigate container |
The next steps would be to back down frigate to a bare minimum config and slowly add parts back until you can see what is causing the issue. |
Thanks - i'm following the other thread also. I also added some more debugging to Ubuntu to see if I can capture anything in the logs before the freeze. Do you have any suggestions on where to start removing the config from? us the hwaccel param a place to start? ffmpeg: |
This comment is not very helpful but I had the exact issue running the containers on kubernetes (microk8s on ubuntu). I ended up removing my coral (m.2) and switching to CPU/VAAPI detection for now, It's been a few weeks without issue. It's a long shot but could be worth trying the same to rule it out? I have not gone back to the coral as I have only a 1 camera doing detection / CPU usage is not high. |
I have a similar issue, running an i5-6500T, no external accelerator, and so far I've been able to ascertain the following:
Here's my config using three random camera feeds from the Internet that I use for debugging, currently the hardware acceleration for decoding/encoding is commented out:
I tried to grab kernel crashdump via kdump, and also tried out kernel netconsole (dmesg) logging to another server running on the same network, but neither resulted in any output, which makes me think it's a driver issue that affects the CPU itself, not even a kernel crash. Running the beta2 image in docker-compose, the beta3 image has an issue with go2rtc failing to parse the camera feed URLs. If you have any ideas for any further troubleshooting I could do, please do let me know. |
@ggidofalvy-tc After round about 12h-24h the Host is crashing when using OpenVINO. Tried different drivers in the Host, but didnt Help. Currently thinking about to buy a Coral... |
I have same issue on K3s on Debian with i3-6100U. VAAPI HW encode/decode + OpenVINO setup in config. with obj detection turned off its rock solid. if I turn on obj detection for a single object on a single camera, whole node hangs within 3 days. Those of us using the official helm chart cant update go2rtc or ffmpeg with custom versions. |
Adding onto my previous comment: Running Ubuntu 22.04, tried both the GA (5.15) and HWE (6.2) kernels, both exhibited the same crash behaviour. |
#8338 (comment) may be relevant with a couple suggestions (and other linked issue) |
@ggidofalvy-tc @madasus especially if your frigate machine is headless, I would recommend removing the often-default It's certainly suspicious that what I reported in #8338 is also using a i7-6600(U) / Skylake GPU - same generation as you both - wondering if there is a driver bug / hardware quirk that other generations don't have that the i915 driver isn't handling |
@kevin-david my host is headless so i'll give this a try. Will the debug then be written to syslog? how are you grabbing it? Can you point me in the direction of where you made this change in your linux distro? (i'm using Ubuntu). I'm glad i opened this thread as it appears this is not an isolated problem - and while not a Frigate issue but likely something that Frigate exposes due to load in the underlying hardware/software. Thanks M |
@madasus sure - I am using proxmox, so it should be similar. In my case the message never appeared in syslog, only on the physically connected screen - I guess because the machine was hung, it wasn't able to be written to syslog. this might mean you need to temporarily connect a monitor to the machine. To do what I was talking about, you'll want to change This describes it a little more: https://askubuntu.com/a/19487. Again in my case, I removed |
I gave This is what I got in dmesg:
I'll keep a look out for more messages in the netconsole destination now that I rebooted again and set the |
I probably found a solution... Running a yolov8s model since some days and currently >48h stable without any crash. Perhaps you can try this aswell? |
@Pingbo can you share your model and detector config.yml snippets? Sorry for the mild derail, I would like to see if this might be a model-specific issue, not an OpenVINO-related one. Running the beta2 branch, since beta3 has issues with go2rtc with my config. I've been trying to get yolov8n/yolov8s running on my setup based on the notebook linked in this comment: #5184 (comment) But I keep getting an error when the detector starts up:
My config.yml bits, attempting to run the yolov8n model:
(all 3 output files are mounted /config/openvino-model, I'm reusing the labelmap from the original mobileSSD model used) |
when mine freezes I managed to check the console this time and there were no messages at all being written to the console before the crash. @Pingbo can you elaborate on how to use the model you are suggesting? is this being used instead of the Coral? |
Thats how i have done it:
@madasus |
Thank you for the detail @Pingbo The only thing I would clarify for others is 1. you want to put all 3 files in the .zip file from the yolo model generation in the model folder, and 2. the files that were generated for me were yolo8n.xml, so make sure your file path is correct. Hopefully this is the fix. Edit: 2 weeks running with solid person detection using the yolov8n model on a single camera. Looks like CPU usage dropped significantly for me. Enabling it on the rest of my cameras now. |
@Pingbo Thank you for the help and the detailed instructions! I've been using yolov8n for nearly two weeks now without any crashing on beta2. I think the issue might indeed be caused by the combination of the bundled ssdlite_mobilenet_v2 model and Skylake-gen OpenVINO -- is this perhaps worth documenting somewhere? |
Wanted to chime in here, im a new frigate user as of about two weeks ago. My hardware is an i7-7700 kaby lake. I am running frigate and wyze-bridge together. Wyze bridge is correctly using Intel QSV with ffmpeg and Frigate will use it fine on ffmpeg as well. However, if i tried to use any openvino detector, it would crash the container everytime. If i set a detector as cpu (not openvino cpu), the container would start and detect fine. Today i followed these steps by @Pingbo and finally my openvino detector will start with GPU selected. My inference speed went from 45ms (cpu) to 15ms (ov gpu). The only error i could make out from the container was Anyway, the yolov8 model from the above comment seems to have resolved my issue for now. Ive been stable for a few hours (where previously i was unable to even start the containers). I will continue to monitor. (thanks @Pingbo !!) edit: i am on frigate version 0.12.1-367D724 |
I use unRAID, I don't know but probably |
Update on my 0.14 setup on a Skylake Intel: I managed to get it stable by setting the hwaccel to only use QSV instead of VAAPI. Been running for almost 3 weeks with 4 cameras with both h264 and h265 decode, with go2rtc and YOLO-NAS on OpenVINO. Here are the relevant config bits:
In the system metrics tab (and in intel_gpu_top), you should only see QSV jobs for hardware acceleration, if you still see VAAPi in the GPU usage graph, you should double-check your config (and make sure other services don't use VAAPI either). |
@vista- your config crashes my i5-6500T a couple hours later. |
Hello, similar issue here. In case I am using GPU statistics plugin this causes as well GUI unresponsiveness. amd ryzen 7900, 64gb RAM, coral tpu usb. |
Without logs, dmesg, etc there's not much to go off of here. I have two Frigate instances running using my AMD 5700G iGPU without any issues. |
Here, from unraid:
|
Yeah, looks like issues on the host. Since yours is newer perhaps the older kernel that Unraid uses has issues, perhaps Unraid 7 with the newer kernel will improve things. |
Yes, issues on the host triggered by frigate 0.14. 0.13 worked in this way flawlessly. |
Not sure how 0.14 would be the cause as the AMD GPU driver used is the same version as 0.13 as well as the same ffmpeg version. You could try updating to a newer ffmpeg version and see if that helps. https://docs.frigate.video/configuration/advanced#custom-ffmpeg-build |
Thank you for your feedback, I will try few steps (including alter ffmpeg with a newer vesrion). |
So strange. I just stumbled upon this thread. I get an OS hang around once a month. I am using a Dell Optiplex full form factor. Only thing running is frigate in docker (latest beta). My measurement tools don't show anything erratic on CPU/GPU/MEM just before the hang, but I haven't yet grabbed any kernel logs. I'll have to do that next. The machine is running the latest Debian. I'll be trying to set quicksync instead of vaapi. Definitely annoying not knowing the cause. Latest crash I was only at around 75% memory usage. CPU is below 12% all the time. Doesn't ever deviate outside of 8-12%. It definitely seems like a GPU/CPU kernel thing. What should my SHM size be for 6 4k cameras? I have it set at 6gb I think. I'll have to double check. |
That's fine for 0.14, 0.15 not as much is needed (could just be 1gb) |
in my case i have to switch openvino from gpu to cpu and use qsv. no crash so far for 1 week. |
Are you on 0.14 or dev? Dev has a newer driver and newer openvino that may help |
dev but i think im fine with current config if it is stable |
Any luck with raspberry pi4?, I am not able to make much tests because it is an office server |
this is stable for me with latest dev: detectors:
ov:
type: openvino
device: GPU
ffmpeg:
hwaccel_args: ' ' |
Side Question for all of you: I have been running the yolov8s since Dec '23 without issue, however I have since upgraded and am no longer running on Skylake. One thing I've noticed with the yolo is that cars and pickups are detected fine, but box trucks or vans are not picked up. Do those running the yolov8n/s models have this behavior as well? |
When you say yolo what do you mean? UPS and FedEx are only supported with frigate+ model. If you are referring to the yolonas frigate+ model then I have no issues detecting these labels |
Are you running 0.13 or 0.12? |
I just spent a week debugging the amdgpu crash issue on my 6900HX-based system and finally have a setup that seems stable for more than 12 hours at a time. The key seems to be to install a recent version of mesa-libgallium in the container (and possibly upgrade mesa-va-drivers); I'm at 24.2.4-1~bpo12+1 for both. My specific working setup is:
I started with the same LXC container setup (debian 12 standard), with vanilla Frigate 0.14.1, and things would work great for about 12 hours before I'd get the amdgpu crash. The same would happen running everything through a VM (also with debian 12 standard) with GPU passthrough, but the amdgpu crash would happen on the VM instead of the host. I tried all kinds of kernel cmdline parameters, disabling the IOMMU, changing the iGPU VRAM setting in the BIOS, upgrading the PVE kernel, etc. The system was actually less stable with all of those changes. The single thing that seems to have brought consistent stability is the libgallium install with mesa-va-drivers upgrade. I suspect that this same tweak in docker in the VM would have fixed the issue there too. Also, side note, using an LXC container and passing through the USB coral is getting about 2x better inference speed (7.5 ms vs 15 ms) for me compared to running in a VM and passing through the port, as recommended by the docs. Anyway, I'm going to test a clean install to evaluate the minimum required change for stability, but if it ends up being an install of libgallium and an upgrade of mesa-va-drivers, is that something that you guys would be willing to push to the docker config? |
The problem with updating mesa drivers is it breaks older hardware, so we'd have to consider what that would look like |
@NickM-27 That makes sense; what about a variant of the container with a layer that just applies tweaks for Rembrandt and newer AMD GPUs that benefit from libgallium? I'm barely docker literate but can't we just slap on a layer that makes those tweaks on top of the standard image (i.e. it would be fairly trivial to maintain)? |
That's more effort for us to maintain, test, and deploy before every release. |
still on 0.12. I've been playing around with trying to get yolov11 converted to openvino. I think my issue is just the coco models. Still confused as to why "car is listed twice because truck has been renamed to car by default." and what constitutes as "truck" but thats probably a diff discussion. |
Hi folks, just wanted to update that I had many freeze / crashes for a long long time. I recently upgraded my Intel NUC from 4GB to 12GB RAM and all the freeze problems disappeared. |
I have migrated my whole Unraid-based container configuration (2x Frigate, 2x HDD, 2x TPU) to a new AMD-based PC (moving from a Core i7 14700K Intel box). I'm happy to report that it is suddenly a lot more stable (no crashes after 4 days versus crashing within hours). I wonder how many people reporting issues were on Core i5/i7/i9 Intel platforms versus AMD or other Intel variants (such as the NUC-based CPUs). |
I got k3s on debian 6.15 with Intel Arc A380, got the whole node crash after 30-60 minutes using version 0.14. Xeon server, not arm64 It's not ffmpeg's fault and I tried both hwaccel and cpu, I tried both yolo-nas and ov models, still getting the whole node crash. Can't catch a kernel log because it's headless, even with kvm i can't see any log before the crash. When I turn off the detector I get no server freeze. Arc is being used as shown in Frigate system metrics (privileged container with sys_*). Shm size is 512 megs, 128 gb on host so it's not oom. update: this is still happening on 0.15.0-beta3 My config:
using stable image, and talking about tips - maybe this will be at least penny-valuable |
Im not sure what Debian 6.15 is. Debian 12 with a 6.15 kernel? I think only 6.10 or 6.11 is supported. You might want to go over your config, I can run 15b3 stable in K3s 1.30 on a NUC with: detectors:
ov_0:
type: openvino
device: GPU
ov_1:
type: openvino
device: GPU
model:
model_type: yolonas
width: 640
height: 640
input_pixel_format: bgr
input_tensor: nchw
path: /config/yolo_nas_s.onnx
labelmap_path: /labelmap/coco-80.txt
ffmpeg:
global_args: -hide_banner -loglevel warning
hwaccel_args: preset-intel-qsv-h264
output_args:
detect: -f rawvideo -pix_fmt yuv420p
record: preset-record-generic-audio-aac
detect:
width: 640
height: 480
fps: 5
enabled: true
max_disappeared: 25
stationary:
interval: 25
threshold: 50
go2rtc:
streams:
Entrance:
- rtsp://user:[email protected]:554/cam/realmonitor?channel=1&subtype=0
Entrance_sub:
- rtsp://user:[email protected]:554/cam/realmonitor?channel=1&subtype=1
cameras:
Entrance:
enabled: True
ffmpeg:
inputs:
- path: rtsp://127.0.0.1:8554/Entrance_sub
input_args: preset-rtsp-restream
roles:
- audio
- detect
- path: rtsp://127.0.0.1:8554/Entrance
input_args: preset-rtsp-restream
roles:
- record
objects:
track:
- person
|
Hi all, I'm also dealing with this issue, and I'm running out of solutions to try.. Any other ideas? My machine is an Intel J5005 with integrated graphics, running Debian 12 with 6.1 kernel and a PCIe Coral TPU. Frigate is running as Home Assistant add-on. The only way to get my system stable is by removing my cameras from frigate.yaml; as soon as I add one, the crashes start occurring every 6 to 24 hours. Things I've tried so far:
Still on my list to try are an older Frigate version (0.13?) and a yolo model for detection. Any other tips to try are very welcome! |
Describe the problem you are having
I have two docker hosts and both have a coral. I find that Frigate seems to cause the whole host to freeze completely (console is not responsive) at frequent intervals - right now I would say on average every 48 hours but its not consistent. I've moved the docker container to the other host and cleared out all the other dockers and the freeze follows Frigate.
Its likely Frigate is pushing the hosts much harder than any other docker and perhaps its finding a bug somewhere in the hardware or OS. The Devices are BeeLink devices running the latest Ubuntu.
Looking for some advice - has anyone seen this sort of behavior and identified the cause?
This has been happening for many months so it is not related to the beta Frigate or any particular Frigate (and likely this is NOT a Frigate bug)
Version
0.13 Beta 3
Frigate config file
Relevant log output
Frigate stats
No response
Operating system
Other
Install method
Docker Compose
Coral version
USB
Any other information that may be helpful
No response
The text was updated successfully, but these errors were encountered: