Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support]: OpenVINO detector crashing, hanging machine #8338

Closed
kevin-david opened this issue Oct 26, 2023 · 17 comments
Closed

[Support]: OpenVINO detector crashing, hanging machine #8338

kevin-david opened this issue Oct 26, 2023 · 17 comments

Comments

@kevin-david
Copy link
Contributor

kevin-david commented Oct 26, 2023

Describe the problem you are having

I'm having this same issue as #7607 an OpenVINO detector using the i915 driver for an older Skylake GPU. When I restart Frigate, it does not come back online either. Haven't rebooted the machine yet in case this state is interesting.

00:02.0 VGA compatible controller: Intel Corporation Skylake GT2 [HD Graphics 520] (rev 07) (prog-if 00 [VGA controller])
	Subsystem: Microsoft Corporation Skylake GT2 [HD Graphics 520]
	Kernel driver in use: i915
	Kernel modules: i915

As far as I understand the only driver is the one bundled with the kernel installed, which is...

Linux proxmox-surf 6.2.16-15-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-15 (2023-09-28T13:53Z)

Unfortunately I don't have a coral device to use, are there any other diagnostics I can gather for this?

(side note, I don't recommend using old laptop hardware if you can help it - but I was hoping to make use of something sitting around - https://devopsx.com/intel-gpu-hang/)

It doesn't line up with the time, but I pulled the message from /sys/class/drm/card0/error in case it's interesting, because the GPU definitely hung on this before.

I can go play with more of the i915.* driver settings above, but was hoping someone may have better ideas before I do that. My current set of kernel params is just disabling power management:
GRUB_CMDLINE_LINUX_DEFAULT="debug intel_idle.max_cstate=1 i915.enable_dc=0 ahci.mobile_lpm_policy=1 i915.mitigations=off mitigations=off"

I think this may have been the initial crash from the kernel logs:

Oct 25 15:04:49 proxmox-surf kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Oct 25 15:04:49 proxmox-surf kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8ed97ff2, in frigate.detecto [2432]
Oct 25 15:05:02 proxmox-surf kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8ed97ff2, in frigate.detecto [2432]
Oct 25 15:05:02 proxmox-surf kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Oct 25 15:05:02 proxmox-surf kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Oct 25 15:05:02 proxmox-surf kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
Oct 25 15:05:03 proxmox-surf kernel: i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip
Oct 25 15:05:03 proxmox-surf kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x191/0x340 [i915]
Oct 25 15:05:03 proxmox-surf kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
Oct 25 15:05:03 proxmox-surf kernel: i915 0000:00:02.0: [drm] frigate.detecto[2432] context reset due to GPU hang
Oct 25 17:59:14 proxmox-surf kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by __intel_gt_unset_wedged+0x213/0x240 [i915]
Oct 25 17:59:43 proxmox-surf kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by __intel_gt_unset_wedged+0x213/0x240 [i915]

I'm pretty sure this is a kernel/hardware issue, but figured I'd report here in case anyone else has seen this before and addressed it.

/sys/class/drm/card0/error output
root@proxmox-surf in /root/
# cat /sys/class/drm/card0/error                                                                                                                                                                                              [8:32:05]
GPU HANG: ecode 9:1:8ed97ff2, in frigate.detecto [2432]
Kernel: 6.2.16-15-pve x86_64
Driver: 20201103
Time: 1698260689 s 3211 us
Boottime: 1137 s 104045 us
Uptime: 1130 s 369808 us
Capture: 4295176544 jiffies; 7236 ms ago
Active process (on ring rcs0): frigate.detecto [2432]
Reset count: 0
Suspend count: 0
Platform: SKYLAKE
Subplatform: 0x1
PCI ID: 0x1916
PCI Revision: 0x07
PCI Subsystem: 1414:0014
IOMMU enabled?: 0
DMC loaded: yes
DMC fw version: 1.27
RPM wakelock: yes
PM suspended: no
IER: 0x08080000
DERRMR: 0x2077efef
GT awake: yes
CS timestamp frequency: 0 Hz, 0 ns
EIR: 0x00000000
PGTBL_ER: 0x00000000
GTIER[0]: 0x09090909
GTIER[1]: 0x09090909
GTIER[2]: 0x00000000
GTIER[3]: 0x00000909
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000
  fence[3] = 00000000
  fence[4] = 00000000
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 00000000
  fence[13] = 00000000
  fence[14] = 00000000
  fence[15] = 00000000
  fence[16] = 00000000
  fence[17] = 00000000
  fence[18] = 00000000
  fence[19] = 00000000
  fence[20] = 00000000
  fence[21] = 00000000
  fence[22] = 00000000
  fence[23] = 00000000
  fence[24] = 00000000
  fence[25] = 00000000
  fence[26] = 00000000
  fence[27] = 00000000
  fence[28] = 00000000
  fence[29] = 00000000
  fence[30] = 00000000
  fence[31] = 00000000
FORCEWAKE: 0xffff0001
ERROR: 0x00000000
DONE_REG: 0x07ffffff
FAULT_TLB_DATA: 0x0000001a 0x97f8da8c
GTT_CACHE_EN: 0xf0007fff
rcs0 command stream:
  CCID:  0x00000000
  START: 0xfffc6000
  HEAD:  0x00002168 [0x00002110]
  TAIL:  0x000021c8 [0x00002170, 0x000021c8]
  CTL:   0x00003001
  MODE:  0x00000000
  HWS:   0xffffe000
  ACTHD: 0x00007f46 cdeab914
  IPEIR: 0x00000000
  IPEHR: 0x7105000d
  ESR:   0x00000000
  INSTDONE: 0xffdc7fff
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  SAMPLER_INSTDONE[0][1]: 0xffffffff
  SAMPLER_INSTDONE[0][2]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][1]: 0xffffffff
  ROW_INSTDONE[0][2]: 0xffffffff
  batch: [0x00007f47_2ac9f000, 0x00007f47_2acaf000]
  BBADDR: 0x00007f46_cdeab915
  BB_STATE: 0x00000020
  INSTPS: 0x00009080
  INSTPM: 0x00000000
  FADDR: 0x00007f46 cdeabac0
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00008000
  PDP0: 0x0000000175d43000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  ELSP[0]:  pid 2432, seqno       17:000123b4+, prio 0, head 00002110, tail 000021c8
  ELSP[1]:  pid 2399, seqno     572d:00000002+, prio 0, head 00000000, tail 000000b8
  hung: 1
  engine reset count: 0
  Active context: frigate.detecto[2432] prio 0, guilty 0 active 0, runtime total 201898106082ns, avg 7041216ns
rcs0 --- WA context = 0x00000000 ffffd000
:bjCpF0RsTe^YskXJJ^sQ"KZ!Q&dhS`LqllT!?`?sZm[=Qm@An]?KK(QB)qC@!P_os`:ll:@kq7R)@%s<J%I8GE?4Uo?-eS1;tBs2j<^9u3=TPkWD-/OLkpkEaG>Y.D36HC!!'P>
rcs0 --- HW Status = 0x00000000 ffffe000
:cL%-H+93"I5f!R3s$DrOKoLV:rjr^aAthNP05\iPZL):B$/sp6p](9ob+Jsg!!$Ub
rcs0 --- batch = 0x00007f47 2ac9f000
:?1"/,`E0Ej%k$3GQ3``i<9tW]"dMubTaE6F6kfc<YnG':?Bu?7dAZU&9#ckC2&Mc=:pDQh7HkL;@hGH?(WAk<*1hcUIWrDPrL)F7E9F^nAoH?f]tWcT;PcamX/_FpPU\eEDRiiKI4??1I7u,3]_uNunp]U?U(QHfe)qWdQbVPRDMO!3;*Q>CT.-di2M!FY/CQh`F*.D`3MHFa0Sb,&)aKJ<_.fOla_nj@<N:Q;>dOH6>&Bi.2S\;\>@6uY(Mj%?S=+js7Am3R-g#n]3MGt3n2a(bA^`<e%K*t2UGhm06We)gfC[Auf3T;nIi'M/&_+B5JS/iLE@a3H=O%rb@C-<TG`qOqmh;Dpg+p6(hA\_@X%RPa9Vd#""5Pg5Vt.%e@]FbLc2YaN[l^OgpeB(S7WJ;4KjFYtRd-oPfBL$JMQ3\2q`"OL1>;]_[-2@6P^?3ZDk);ZT@p%5LTUT>BE'7n;oS)!)/jX^R%p_Z6IGhuktf(eM2u0G=QVj2hfO@l:Z)-Q(\dp%V#e@%N""UI#X(Cb>T06@(d.i6Ip/EV5@Z-loFp3Lm-u7$p0rcSf=V-jp&TpJKV(#@ODrJ'hAFB7&i4ca^f0R:f)h$+WV,6]JZ&F=o6>oYVu:Qb6(d_#qmdfm?ge/"RsBV9Ih96loGmmIbLM</rBVGmr<I!Lf7jlM[6+JJW5ZK"T^;eFqLFn2e_uBich[pQn0Yn3k(i_^k:[i8_-Xm_1:9?M(G,Qf<L[WN#-\'"L(/Y2XPP)/2ZMA4"F'jT"2LPgj-h(L0oingL_\hHlPTBrX>YVGm@!_KrdK'`eGc'@ZhjTTrt$4SO7VtU]k?@Bj6uu_gp?jaS9%'qZsc/snc_88q2D6/ef-l8q8A$fr5LP]r.-NqnH%Q<T7>&1J)C)YP:u.Zq>UiobPUN-U%&;Q^.6I/VI-gll5pO0Du$a8rSDoV@l=2s&0lW$&0:t(qtRi(D18aUI"0c__>ebHa#$$*d>G-\s#t1HT-C'01dAn`#-k;N&rQ]FSCP-p\%'`3N8af\[Igc<-EDOtn"fs-VlI/Fg&K/_j7]q8U$MWMqU?NYOl<`_,:dV<$1RW&nb1PR_:;'Rn-6r2NN1f@Qg;2i#G3L&qtpE3kGgBa9)jh`nO`T2T?ktn9""[50RDWCr\Ze`ce-EUI0B=kp3=2bp1)iUf=V-jp&OCXJF>77OBkT>hAD:Q'=2q8K0.\>>Qs$N]@Vj'#M8NuB(6,f]DZ\-61=Asqii2Hf&#=0^YT9eD$Th?qr8H6pOchNmfq(0l?FH_)5FC]W"+%ErBYM>*clRWI1?)R0DFITJ%dTQc\BdiX%Y.lT+QK9TZ,F[1O+QP3Bt.\9E0qdG_>\)r'76&r]$3NIuJ"nr]"`r0@hku@K,a@p\3ZpD4UT0s1AUarV2bJpr4TJpP.$<^Cuf\30Kj4Zt2u3S2Pp?jtWb5q#]fqr8L1ag#'chr]m>ms#?:558_4Bo9;p\itfl!4[l9L$5:tiS9%'WYN@22kK#J7e34r*WAA2YrqASnoql9@M,)kg+$i-Opjhi[WW2)Ce'n/@VPj+^je()$j-h(Pd=XCEfHW2$ci5XOI6Xc)<;oRfs-T'QVQXj-rUc`Hq;Rk:>5sdF4_=J]We79m;]Tq/W*[19"Sd<BG_Z*D`,;@Ln-6r28\b:3/DgB!Pdc/'CNY*?%pAclbl+Onn,*?+RJVGVVsItVrI$0ap3?$!p0dVNi%Mjop&au<LYjk.+T9jL]J'Tjs-+(D6#R/T>Sqb!r1G;7s7nse.0%Eq^rO3'rHNtf8:N?R1YCU?jXbn#R(\LXrAj?75Mck<^Usg*or6l%rhBM#g\QL?&KfL=g#)?raXPQ*IV$.ja6ru1BJfa=q3h2V)`Zu8J*4Th4rZbQXY;h$HpRE(K*P?O:#ZEJAH)'C2#f7>iXaWAs*dm@`tHYIpr7B)j+bo.qSj*HrSU#qosDn0GO]^on0MeW>'6bBN&I[6%A6)Y-JLcRRN2@aA$u=r2LusW1YVl!!Y3!-ia]1fho3Y65a/(u31T=6kf+YCki,ltQW')$a6nEOiei6CO.:8Ys*9lOHPO/kpVZcU^PX86T>h:H"Ec>P5braL*)C_TlPMd7lo'D2mh7kahBdJ4ZtAr[]Jb<_XXZ=1R3-8E.?8-?lF$Th\Gbk8jJ(#oIf<$us%E#I6N<uN48Ph(%fWCl@=QWT1&j@4dJkqRs6pcDn*1):2uZ,ML&XE6J,9hOJ,9<Vs1ha,Se;l&#Y+htnF6JNaO^1o:-\]5"Fp:L)uj76T>-s$T:_\^*C'*!nH](VqsY$VqV[d^&(^m/T\9"tq*CeR#<BF_TXk)-lp^+8s*OkI7.G*=_HQrscg'uf/a)kUJ'`@i+5R(urSUS"obh,Ue/b2Prh*3IO7*:EIfI;K5^3pYS9%'qpr<>oJQ_(5XYqR$Ihs[8K>3mgkT'X%pOgflH+ZEjMuR#_+ohDs*X)KS*74]s(mj9\JS"lAZ%[]&_e9X2m59CiLHS0)>j_L["@)X3]`/NaN-Xa\?sb_[i`c*Vr7ndUrL@prhm@$&n+d#(8,UA$J*RQ:\,>5O/cRPe"FnK*]`/Nad?qE@lpR0>lkH`)l&;nITlT]%4s%c=ZY/mZa8`,j>PQ31"Fgt/]`/Na=Nt(Flj21WhBpf<&+aaiHhMI(ms`ZRej&L,s)[biSseX9hoY!;rh2LOkSN\4`'Nn-Jfc6:@Qm<\U#J<%rDXe1oh>%JqOnT&oFp3Pp*3B#p&Nbf^#02=5ahZOaB\cDTO%Qs&Jr@bBR#U8^1Zh1?[pubq>C-sOoJ7c+7onYs'dn^s'UhKp^6JPrcf/bk@2H7G[XhNlb`g'Rf6,UG_4chs',q1g\d_q#1rErZkhQ\qVZPS6a"'fO7&4LoC)RL!$hKZ>`YQXkl1@me,Ji'+oU=fT<1llDsB+pZ+m4G]KlIhrSU)Shhe30pr6R2)aK)U4M*gAEKC->$pX]LhB;j^:[\OIJ)^N(&:V,5`!Ic/n\jMgr\9*kpu/Rd^KAXNJ%Q)brV2tOB7C^ZmfFArIO8ZV0oJ,CiNrl>M&"qHR9ne.`EiL'D#uq8c!C?/&[)/$-J*agr*4`s'<Z2C8\]brB<M9)+re;h&\.l7'N+nsI,gY;pRrqpqZXaTeBH]tQUA[j]t/eVIf:HYDZ2V^"CJ6AhuBq7EW:9La8P"E,PI':+8:N:s6_4Xs%p1EoEor+@Q1e]/IrBljN[Zrs/D'oqLfs2<6s>QWXp8%kS=]D`),s<K+B('.3:_P"ge6e6DO#7JQV[GMLQ$5ET@)O'?C-WK?sH!or6mnrh2pP5s?>3@Oc]]_5E@o]E',R!M9>C"eX\ua=Y`ar9u`HkF9C30-BZ!&V'+sf8)Xr^[cq/iXa0?qVWmKTB-5`#QM%BrBl'Z,eNoSpO)rsq$tn5,=5XZm/hd9"k<RF"4Zf.\bahF#l^SD_u@cbm18[Y]mdH/Ck$[cch7X2n0Yn3k(i_^SH&OIo"4p!:Xr8,`QLKH2eLb#%dea"i#_ppD\,M=^XiV^^&GMX^n9TiRNLMk&%m#E&3'MsN/%@g)aKV[beF,.5<I3bInu-_rI062L^sPiq>&lsF7G1ioDZOuZhh&%F$ol\Ae+d0JKnO/0ppSj_e9Y^30J4Vb[i[3nc[3e`!:i82?,<o(;KKSad;X>IXff!e,K211d3%^CIn9A\n1bfStYZG`AX388,^!fho7G*8+?YScg_%oN-;so^qHfAi^9L?s#12]^LnP5lh69tkl7k4qL;P9WVi0's-T%^nMP0SrUQ6Or:+R4NShjI%p=@ZlCdEJ<5BR3eI0!=TUua%kD].Rjqk$^pOG.dJT$&7OL0B2I4=jhGhD<[=rt$D^7"jtT<J.rrI<//#H-p"4+=o.bl++eH@F'\49#>#5CYd*5<h7O"@(rJ$6T:1r!Yg4c/Jd':\s"+%Njd6PQF(lIaDX*KD!&*@OaNcqqr"b\Vk'HXq1P?)"[dr#L+@J#/*7in3S5R3r/X%!8d^MW$_93(2jaK?iRbEao2"%-hkV)q=*)@rP[^Glseg9oE;H!Q>[=aQRDNZ*5D#p2i'\'J)US]?hiVl"TK=XYP`@nCL=/TX?YPE*WGt2DMPFV$N>8F:EIXg/Im4CO!sR[nrG3i..WEIO!sR[nrG3inFDVgklLgD!.VT7
rcs0 --- ring = 0x00000000 fffc6000
:gV',%;2fGHDt!o1e?tsh=t6JDC@+d(XY$D'?MdQ1g=L+Eg%nh8?Il_lCVJB3&#J3HG`a0qG`Wsh(\gO_RPAL*jPY=)jO;?8QfdFG,gUOq=_5TI>eCGnI&##Q^T5rUY:Vi-(Ag_Y,sQ^SWVc5:kBsAu#jaNkbNn1=&`DH/EJ_uUD=bZ#4;$>OZ_-MukMq=KZ_qZ4oi<$e5f`]Y[ZgQ\ep7$HD7C$<kZp@"Nq.am5-3iE1YpsJ-s>tlmk-`k2LD->F.YP:kIM8O4NZq.E^"iDVM+%I#\SDqhl)t!d"-P>:LMP_Mqn[TQO3POH_D)Hkm?W(06AOI/<e^F\/RN']XT&s/):J4FI]so=1T$J)Th=bE\2X3f=\TudHfTp%IT=L_2`_+mY1G-iuQ4#,i8*i4[s@3KNm7U4bI#-XmUnJR4!@Fd$oC*msQ4i665KU4bHN(o=>UWbEG#M^!N*mkm?_rmrTf2I?W;@[!qa94>.j*KEp@-:M>CKd"-C3-a78i_sW25\9_KFh$B(SESaP_d$X^#:LMt+7GS5E>^T=\mY*%F85Entqk7`uS=[QUU!L:bo&@bS?Z[qbF(]b`]:i^%/l,lnd$NPqo&)(4?Z[j5aB1Mih$U?rjRs#6"+AP-QRPMnqk6aYN4[4J\ea1Eam9l5*rDX:>h>1nG=&._2j;r5d(]CIbM<8+-/4pprC#7ZCUqN8%CP!uTb,E(qt1Xb*I!i%g+L"2:L2B^Y]$].G=#^cXH)?DkWf'>U)UZZNSPj##0-(aF=n2QP']JR<8HZOp4dJK#Y/uip'o:&DWjtD>j:$pG=#0`UaNNBh9_G!U/,1NZl&4bGF(38(-A3/DWqJp1UhHV6M?Vq06@GC41r+^:n?DLgRKSH)96hDTk?s`Qb!R:]=b[Md,!LbguCR.*q"a<;pp!sgkG1[j1MppB\r02+*U9]_6:JE\eF_-Gs.r"4o0m)H_CG]a-54"kWq;#-a7,&+mt&:j1\j`h$WlHp[9MekXdk+PLqcg0_[o->SSWnG=!b8muuHTd("Fm:LM%&UYQ8H*W)>!*p?CpFHC[*a1;kDRRuiam1);+pQm'q!dgVcc6SV-T'0$7P:*SFHl^]9L`QQp!!'ID
rcs0 --- HW context = 0x00000000 fffaf000
:h!LVVFd,i9#P#jK'U&]2Et]1DfZ^cTfj,IGnoQT:jMJ=[G2mmd4_IBtlmAhRq*nZ?)mQb+=$Op,MeMK`:pG&4E1'*al[""jNAnNl8Yl)]l6_oQnq[e^o?9><k%DHB;b'nVm.[j9,O`?io/GgGH0:gtjV3M9eGn7R;IRPYj#s::D[/f<`\RB"L,6I#qsDk"<u1"W`uK*<EQ9&=gi:@S7NsZ@T6-W\6n(Nua[HsoaXF81<79UN9A>W/jKR'HYs3$^FG<C<Z%;IjP&jMUA\lcI\9Pj/*@uXY51(8;bFmO&f..rp7=OdA(-/,Bpb033LF`U6g).&*q5Jc"Erq-q?g%?WbPo`AanK<@X,s9N?6W;SA/T0IH[fPn)#nJ[BScJB2e,[$_9o^2:6mEtZ;Ze0ci/CT$faCubd/:.j,YGFCo870KnthX![J=A(hg`T4CI1OQkk3G,f%G+?rIK$)1tU4%%IrLS/,Nh%Ul[J@'bR2$Nln?K4oYj>LS#>at(3er>Oca%f[SNl#O6Qa11[R`uH2V!18#I.M4?d_+"J[-3cGC!gmN7bSjng52_WTiVTgKhduQpo0`KPLZK^u"W.a%&J(BH_aXteis-D<_^5P\dgDHV.U30coZCgUdM8@l_!X3R(PeRT_9=Q@q2X?PY[rJP>O!g\-dc0_f*u*L(GL!KS/-%`@5@Y;k6@Fl(*gP'I/`U+?u3*o-p+3]-!99gDm74I_GH.tr""+^_bRk>.Li"n,:a4qr1G6[P6rGr5;WSj1"*`hK&U%_K^Nm=GV8K*AT5G/[nD>J-b3J1ZJmuW!J]C7Xs:Q!lhtb:./eEc]YtT^j[_lnP)3L^N=^H*K8^4Y=og>I!urLm9JU">%5=c4E(RiW=bP!SL<LQTc5//p%WH]FU%<dc1;ak:i%#:>Xq\Ai>2cN4#JL\XE)"2.57ZI!DW&-[#M_'/i6T=8p^uJWSdPBA=l?3._p8A93`iPD8q3AI!B=?RYd;/AEhAp*-564:!gK>'5WouR<l[)>Q#*lh6X5T54`aZ`Ql*uf$=T,$0PQ57PCcO3K$5QMbVikX(3$g8?uptED%E-;_3PqB(BIWY"`U<S#1j;he],.YP_47)TD`T<9F^c?9F]<pVn[G&/n.e>47"kndf=]kD3GAj<*2a1]AbTApXKr/P>":#]eLT3\4Kpab!Fm#ZYt<6g0kr5JapPuYpi.R6cNC0i:$*$#PiqG0jaT&LN9nSSal:<ihMOkMA]7Pmb;FF4H8JH$tY=^Q$RmU#5u.#-a3:<J!\'`LA-l!UZU6_60@@gg`*K?H?2'LIdaiC:VaQr_UsVZmb>__\8);CTrkEes&Kq4R?H"oK0L]_\uOK:+-cpk"$L^1]W9,<Gcq2[i;<LKPTo;Fk$;%cqOcjl.CW*d/Z27-@iYK[QoP%!YTeF-Ec"bT;uG_:qD7<Y0XH5?j[K(nIdeR4$OJ:Nq7S_i^njKg+V[=lLkj=TJW;FY+r7%q'".g6b3Qm(W)n7,$d\F+I.<D\o>6GKgKO*oT%q^L&]_^$UcSW*%3/eE\/=.6*J\MI7'"L/f*"n4e)uUT#rj9iE1)GUF+_[P"RoLNqm-GgKUrDGn8gmtcX%*-Y`%h`J*rYOZN&^UbNH_+i:Q#;Nu/UIql_2A$`3j:A@l_X#VH<#[3ZVQEm!4&s6l?\bL1YjUs$rbITLgW&GNi]jc`58EMP,VMfIlf&_u!W#@h*aiuo;[mdZCEPb5(Y[Q[pbU6b=Ygde)"OhN@"c'n>Pe%UuPHO:Ogs&dZekD+A>?B@sAGAci<F^I-@,(@#PND!X"hn'2jWKK(aq<<CtjjTnpp*s3IoaDbMh\YV:]!Js"HVg,=nGft?:.#DLMD*HI4LTiaqXc]/Mq\J6o`L5E\FkRJ<S>.AjB]DK5geG%g1kck\GpZV=Q*:Z^G*1TYrL3Gohj\a25SPDMA;r3["c7P.B&eb>d3F[Y=ad;4fl6fD(f@_iiB@:qiRGOU?UK>l,OTHa,&U`U=Q3$*P)n16L-NT*W<W$K3q=<.#guGBNfa^4XUErG?m'`oZ\+fn+0bWlaO>7G!"j(/r*HZ*Em)jm*A_ID>L>[?C(3#%HdK^6J&ahBF5:lrEh*4BSD5:pAPZ(^:Hc[dPWja8'?6-^7ij?SF3tFWZJJ(:q>pUmZ[ZaL"hfc4iji^72`&Rpt!#pd;PTCrL5[1D*\BGbu:;gY1.ZY/mSt`%HEZ@U2Dd(qK_UuPId?rXfEhWO(FoC'N&k6rGsaB_=dW"\nfj0GHDCW>_u9_7U=UAN1I/I%^5C^2*oL#-@Zt%SXlE;5ek3SjIL5'79s4SDc45)ZrI_hNKVIm4at(=hPfqoQ'UR8gZ.(2oss;W&fi[c%5f!kd/U6@N7TmoFGmPJ8R%U.rbI\a](eCO55tes^"NEerMsn365NX1M<<LY>##[t?0N-nlUUlBI+u8Kp#47>olnKof'E&kN^ZMArY4X>Fnk^kZ7=ed[aB<Q8P;KrXlS?*hd3u-A_)VhQW`%(CY..@Rak]%:T4.oc"JJq2;=fGm5t5kI^H[SdF@?_p:m5Rb416+osSjVR(jTmfic[KdDiLrL[68<mHckErYeb^eMb_G)R/*LCO26r54lS0qq#=T[&L7i;(YkX8CoHFcRd[2b[4AdXdMR@coMZAhOBr[kfu#ejn/FN]%6F)aVRs7%&h&7ETR@n:&UP(U#g\#8%Mko>*IK[SR"324qAFm/'t4?/i@eL9NRgclhOIY'hQ<r9-#P+>W,OsB0jjgaHOd@eTPB3@556-;b%suLij),\IU'u6Ug/p'H=,XK489"s2p>$h-I;]dV6e&nc.q,A'K#1.;M-N++1QF]sj`/$(b-LlY0EioiAJM=LJI'BA9bp]Xr;J<.9o4l6cP+0%XO&,#J\(Z?I3Ns#_1N'hf<(M@\t*MmkUaPHa"`O*&97rU!.j[6K-$F\17Ddo@O9PLgTYgX7]]j>&(AQDfQ\<9esSRiQg5W[Ca4.ZJ'.qR4p)44iqEl:[PqkP_.Q4'L/0l1qH^m8[dsF\d$tj;ZXMnl=GejD(]Y]=7*g;fKIN\Z7XY*ccFqlO?#QkH110a1bKLR]0@b0AGN:g,1sk?DJl?(SG&Q&ROL6aG6amRd,%\LQrT]jI7+/XsSK^NgcA_q*)5fNs3onPA<ufSpsJ.Bo8rRK)8i,rglK1h@F.^oGiYk`.$pO;80_JK79D`juHBf=e22InuY=0\aeoR/IajHmWt'diT8`PCA@C^\IT,gRR%uXWU]@MW;CdE[5!&W1lXVEo_ke.[q'Q5NR[KiDZ$puipf4%0?Q?Hj(`^YrSQ=4H6-:`ARCn@;Y*@!ch4I1>ggB0&($&$>GYAa>(!ap7X$)GW6JoqSmb(dAGIlYnf)]dW+&3(d^F""GkG8)0\!G`S,;86X0p7m\d[\UVm`nBmMPRrjX4?c](/>PQ2f+3.DD`fX&iu0G>meMiIA>(#i:?!"[email protected]",=HgSD)\pg=SDO[kMNZF[6eS)35k!nT_5,R?1M/]negn4IhZe=*^6)&>@cF_oeG:2amqLkh7H\S3ka7Z)V0'pd6:9,7q:"4L"#GhGG_i%DQbI7V^C,P/(Rph>,j49MmrA3l(<8pjj>>e=kg>/:)InQfBYMuk$ob2HTGNuI@9AK7N!RKSdjonMKK85)jn$/lSt36$1@c`TcA"3ODelsK\l3&][fTHQ*kkq(!(QYjjp=8m!E*_HeiR`X8$2QL"CAD8"76g]'rI%94)]9m57Pj]B5mRD-^3^r9N-&6$10Dl1ug"hh>?[:os#*I=#(_2e>++8A<pK1cjg0l!2kEKii=F8OE0?Wck_JVigGJCM)mRF7W1/;Y7#dr>.<=k8mp=&,AXm3aTpQp[r7tm\bNuDU86Sd!qMHVrddQ0G2@A([email protected]$(,8]Qo*a-).7LpE#"-Z`2FXfL>MN/H.OY"Y<(],*G]!E%>BLa=\EQ=39DQ;!8Dkg^+6YR&t8CC:?&lsKLDMt`^6GZR,]@;aBW_HS7FfCKXNWI`^:Du8]F),OQ>FF364^3(7\nm)$nMC:DjL8L1_A6`^7"jfu?A-OIY>SGfXcB00%$RA1'KMSh^j^Ksi`Y`^>ZCT+73!e!UD>IEF'qTNir?I'@09*-%\`V"bFls6Z'A%MfR]$bjC]n$!&]Y`Yjhqs)&eK5NftL#6<k!ZX:F$bjK5!T"SF(O^u*DX^rk?1n[d4lO;G]BfZ>Hbr9LJS6Ijm'9L`cafM_e$%"H'0Ze11gm'Z0QDZ&$U?$[?t4j`.>CM/_57r$:'ilg$S\!'0\M(R=i0;+J^+c[Zp&F4#$>?N*`A]T_+"JOir^-3J4r(5`sA%J%%&<U9IaFK&o\_@@5A=^ZNcp8L!1$8S/u(EI,GA7Yf26.V8JT-i%$E>kVfqcD]!Wn*qVu3[nNG72ufHASFTn4(%74=)NF*i1[olR;8&8"ILkTD2mUC\*-'"D`uK*>QhMJ>\_[20O+XX3?>B1:8c801HRs9VH'PQj3auY!*1FR,`uK*>EQ9&=D>hKp&r_)0n6Go2iNhA2pYRti<cA"Ls5&7^i9c96kPaTBs8D-Zs8:IJrX*M9YMWo'k%B+_C!.U*A+/OZMk9bq4o$^04=[SWk[[E_>UfWab:F/1)Tk8g*!p4q`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=`uK*>EQ9&=*-'"=qO`2V`Q*sF!!!!n
available engines: 0
slice total: 0, mask=0000
subslice total: 0
EU total: 0
EU per subslice: 0
has slice power gating: no
has subslice power gating: no
has EU power gating: no
Unavailable
graphics version: 9
media version: 9
display version: 9
gt: 2
memory-regions: 21
page-sizes: 11000
platform: SKYLAKE
ppgtt-size: 48
ppgtt-type: 2
dma_mask_size: 39
is_mobile: no
is_lp: no
require_force_probe: no
is_dgfx: no
has_64bit_reloc: yes
has_64k_pages: no
gpu_reset_clobbers_display: no
has_reset_engine: yes
has_3d_pipeline: yes
has_4tile: no
has_flat_ccs: no
has_global_mocs: no
has_gmd_id: no
has_gt_uc: yes
has_heci_pxp: no
has_heci_gscfi: no
has_guc_deprivilege: no
has_l3_ccs_read: no
has_l3_dpf: no
has_llc: yes
has_logical_ring_contexts: yes
has_logical_ring_elsq: no
has_media_ratio_mode: no
has_mslice_steering: no
has_oa_bpc_reporting: no
has_oa_slice_contrib_limits: no
has_one_eu_per_fuse_bit: no
has_pxp: no
has_rc6: yes
has_rc6p: no
has_rps: yes
has_runtime_pm: yes
has_snoop: no
has_coherent_ggtt: yes
tuning_thread_rr_after_dep: no
unfenced_needs_alignment: no
hws_needs_physical: no
has_pooled_eu: no
cursor_needs_physical: no
has_cdclk_crawl: no
has_cdclk_squash: no
has_ddi: yes
has_dp_mst: yes
has_dsb: no
has_fpga_dbg: yes
has_gmch: no
has_hotplug: yes
has_hti: no
has_ipc: yes
has_overlay: no
has_psr: yes
has_psr_hw_tracking: yes
overlay_needs_physical: no
supports_tv: no
has_hdcp: yes
has_dmc: yes
has_dsc: no
rawclk rate: 24000 kHz
Has logical contexts? yes
scheduler: 1f
i915.vbt_firmware=(null)
i915.modeset=-1
i915.lvds_channel_mode=0
i915.panel_use_ssc=-1
i915.vbt_sdvo_panel_type=-1
i915.enable_dc=0
i915.enable_fbc=1
i915.enable_psr=-1
i915.psr_safest_params=no
i915.enable_psr2_sel_fetch=yes
i915.disable_power_well=1
i915.enable_ips=1
i915.invert_brightness=0
i915.enable_guc=0
i915.guc_log_level=-1
i915.guc_firmware_path=(null)
i915.huc_firmware_path=(null)
i915.dmc_firmware_path=(null)
i915.memtest=no
i915.mmio_debug=0
i915.edp_vswing=0
i915.reset=3
i915.inject_probe_failure=0
i915.fastboot=-1
i915.enable_dpcd_backlight=-1
i915.force_probe=
i915.request_timeout_ms=20000
i915.lmem_size=0
i915.lmem_bar_size=0
i915.enable_hangcheck=yes
i915.load_detect_test=no
i915.force_reset_modeset_test=no
i915.error_capture=yes
i915.disable_display=no
i915.verbose_state_checks=yes
i915.nuclear_pageflip=no
i915.enable_dp_mst=yes
i915.enable_gvt=no

Version

0.12.1-367D724

Frigate config file

detectors:
  ov:
    type: openvino
    device: AUTO
    model:
      path: /openvino-model/ssdlite_mobilenet_v2.xml

model:
  width: 300
  height: 300
  input_tensor: nhwc
  input_pixel_format: bgr
  labelmap_path: /openvino-model/coco_91cl_bkgr.txt

Relevant log output

2023-10-26 08:28:32.567195978  Abort was called at 349 line in file:
2023-10-26 08:28:32.567206908  ./shared/source/os_interface/linux/drm_neo.cpp
2023-10-26 08:28:32.567458049  Fatal Python error: Aborted
2023-10-26 08:28:32.567464583  
2023-10-26 08:28:32.567739698  Thread 0x00007f9cbf5126c0 (most recent call first):
2023-10-26 08:28:32.568538526    File "/usr/lib/python3.9/threading.py", line 312 in wait
2023-10-26 08:28:32.569452250    File "/usr/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
2023-10-26 08:28:32.570242394    File "/usr/lib/python3.9/threading.py", line 892 in run
2023-10-26 08:28:32.571659639    File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
2023-10-26 08:28:32.572554908    File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
2023-10-26 08:28:32.572638681  
2023-10-26 08:28:32.572956562  Current thread 0x00007f9cda9ee740 (most recent call first):
2023-10-26 08:28:32.574168890    File "/usr/local/lib/python3.9/dist-packages/openvino/runtime/ie_api.py", line 387 in compile_model
2023-10-26 08:28:32.575222775    File "/opt/frigate/frigate/detectors/plugins/openvino.py", line 32 in __init__
2023-10-26 08:28:32.576224422    File "/opt/frigate/frigate/detectors/__init__.py", line 24 in create_detector
2023-10-26 08:28:32.577076981    File "/opt/frigate/frigate/object_detection.py", line 52 in __init__
2023-10-26 08:28:32.578028463    File "/opt/frigate/frigate/object_detection.py", line 98 in run_detector
2023-10-26 08:28:32.578899209    File "/usr/lib/python3.9/multiprocessing/process.py", line 108 in run
2023-10-26 08:28:32.579924191    File "/usr/lib/python3.9/multiprocessing/process.py", line 315 in _bootstrap
2023-10-26 08:28:32.580881800    File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71 in _launch
2023-10-26 08:28:32.581830101    File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19 in __init__
2023-10-26 08:28:32.582765925    File "/usr/lib/python3.9/multiprocessing/context.py", line 277 in _Popen
2023-10-26 08:28:32.584059866    File "/usr/lib/python3.9/multiprocessing/context.py", line 224 in _Popen
2023-10-26 08:28:32.585022467    File "/usr/lib/python3.9/multiprocessing/process.py", line 121 in start
2023-10-26 08:28:32.586171794    File "/opt/frigate/frigate/object_detection.py", line 179 in start_or_restart
2023-10-26 08:28:32.587198099    File "/opt/frigate/frigate/object_detection.py", line 147 in __init__
2023-10-26 08:28:32.588047323    File "/opt/frigate/frigate/app.py", line 214 in start_detectors
2023-10-26 08:28:32.589339699    File "/opt/frigate/frigate/app.py", line 379 in start
2023-10-26 08:28:32.590234132    File "/opt/frigate/frigate/__main__.py", line 16 in <module>
2023-10-26 08:28:32.591943599    File "/usr/lib/python3.9/runpy.py", line 87 in _run_code
2023-10-26 08:28:32.592619363    File "/usr/lib/python3.9/runpy.py", line 197 in _run_module_as_main

Operating system

Proxmox

Install method

Docker Compose

Coral version

CPU (no coral), using OpenVINO/Intel detector with Skylake architecture.

Network connection

Wired

@NateMeyer
Copy link
Contributor

I don't I've seen that error before. Do you have any issue if you change the detector device to CPU?

@kevin-david kevin-david changed the title [Support]: OpenVINO detector crashing [Support]: OpenVINO detector crashing, hanging machine Nov 13, 2023
@kevin-david
Copy link
Contributor Author

kevin-david commented Nov 13, 2023

I will experiment with CPU - but my hunch is this device isn't fast enough to do it across 4 cameras. But we'll see!

Thought I had a fix with what I described in a comment here - echo 0 | sudo tee /sys/class/drm/card0/engine/rcs0/preempt_timeout_ms

But unfortinuately that only lasted about a week before:

[drm] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8ed97ff2, in frigate.detect ... 
[drm] Resetting chip for stopped heartbeat on rcs0

Which hung the system. Found a similar issue: #5799

@martini1992
Copy link

I have seen this before too, switched to OpenVINO on CPU and it's been running fine for months. I'll update when I have time to repro.

@markuznw
Copy link

Ok but that's just a workaround. This is caused by some intel GPU bug and maybe try latest drivers/libraries (which are really old in the current tracked Debian) might help the issue.

Had the same and the only real fix was to use a recent machine (intel n100)

@martini1992
Copy link

I agree, I switched because I needed it to just work and didn't have time to chase the issue down. For reference my hardware is an i5-8500t, I am using Debian bookworm currently but I believe the issues I had were before the release so they could be fixed now. Like I said, I'll update when I have the chance to try to reproduce the error.

@martini1992
Copy link

Fingers crossed have been running in OpenVINO GPU mode since my last post and no crashes. Not running any special packages just current Debian Testing.

@martini1992
Copy link

Nevermind it crashed again, taking the system with it (eventually, it ground to a halt over a few minutes and I had to pull the power)

kern.log

2023-12-08T18:26:05.711278+00:00 server kernel: [7340131.565299] BUG: kernel NULL pointer dereference, address: 0000000000000270
2023-12-08T18:26:05.711292+00:00 server kernel: [7340131.565851] #PF: supervisor read access in kernel mode
2023-12-08T18:26:05.711293+00:00 server kernel: [7340131.566420] #PF: error_code(0x0000) - not-present page
2023-12-08T18:26:05.711294+00:00 server kernel: [7340131.566936] PGD 80000001d16ed067 P4D 80000001d16ed067 PUD 143373067 PMD 0 
2023-12-08T18:26:05.711294+00:00 server kernel: [7340131.567466] Oops: 0000 [#1] PREEMPT SMP PTI
2023-12-08T18:26:05.711295+00:00 server kernel: [7340131.567994] CPU: 4 PID: 158 Comm: kworker/4:1H Tainted: G     U             6.4.0-3-amd64 #1  Debian 6.4.11-1
2023-12-08T18:26:05.711296+00:00 server kernel: [7340131.568568] Hardware name: HP HP ProDesk 600 G4 DM/83EF, BIOS Q22 Ver. 02.20.01 07/28/2022
2023-12-08T18:26:05.711296+00:00 server kernel: [7340131.569191] Workqueue: events_highpri heartbeat [i915]
2023-12-08T18:26:05.711297+00:00 server kernel: [7340131.569954] RIP: 0010:__i915_gpu_coredump+0x223/0x760 [i915]
2023-12-08T18:26:05.711297+00:00 server kernel: [7340131.570716] Code: 44 24 08 85 c0 79 37 49 8b 74 24 08 48 8b 44 24 20 49 8d 54 24 18 48 8b 36 48 8b 48 20 4c 8b 40 28 48 8b 7e 08 48 8b 74 24 18 <44> 0f b7 8e 70 02 00 00 48 c7 c6 b8 67 4b c1 e8 49 df 2b d6 48 8b
2023-12-08T18:26:05.711298+00:00 server kernel: [7340131.571398] RSP: 0000:ffffa10dc052bcb0 EFLAGS: 00010286
2023-12-08T18:26:05.711298+00:00 server kernel: [7340131.572041] RAX: ffff93bc03c33c80 RBX: ffff93bc32ab7800 RCX: 0000000000000889
2023-12-08T18:26:05.711299+00:00 se

log stops dead there.

Obviously since the machine crashed I couldn't collect the GPU error log.

The server runs headless and the only tasks running on the GPU are the OpenVINO detector and the FFMPEG threads for each camera (6).

@kevin-david
Copy link
Contributor Author

I will experiment with CPU - but my hunch is this device isn't fast enough to do it across 4 cameras. But we'll see!

Just to hopefully close this loop - beyond one weird, unrelated storage error, my machine has been stable and working with CPU-only detectors and GPU-accelerated ffmpeg for two straight weeks now.

14:31:27 up 14 days, 1:13, 1 user, load average: 0.94, 0.94, 0.96

So I think I'm going to just leave it at that.

I am vaguely curious if CPU OpenVINO would also solve the problem for me, like you said @martini1992? But I am wondering if it's worth the headache and further debugging of hangs honestly given this machine is remote to me.

Are you switching back to that? Does that have a discernible benefit over regular old CPU detectors (one per camera?)

@kevin-david
Copy link
Contributor Author

P.S. Did you see this alternative solution from a related issue? #8470 (comment)

@martini1992
Copy link

I hadn't seen that I'll check it out when I have time.

I seems to use slightly less CPU for me than the other detector, vaguely remember the inference speed was faster.
It's quick to test literally just swap out device: AUTO or device: GPU for device: CPU in the standard OpenVINO config.

@FeatherKing
Copy link

FeatherKing commented Dec 8, 2023

hi @kevin-david, i commented in the other thread #8470 . Moving my openvino model to yolov8 seems to have fixed my issues on my i7-7700 kaby lake. I am using this config now successfully for the last few hours

detectors:
  ov:
    type: openvino
    device: GPU
    model:
      path: /config/yolov8n/yolov8n.xml
model:
  width: 416
  height: 416
  input_tensor: nchw
  input_pixel_format: bgr
  model_type: yolov8
  labelmap_path: /config/yolov8n/coco_80cl.txt

Copy link

github-actions bot commented Jan 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 8, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2024
@martini1992
Copy link

Finally found the time to switch to yolov8n on GPU, I'll update if I have trouble.

@martini1992
Copy link

Nope, went out for the day and it had crashed by the time I'd got home.

@nickp27
Copy link

nickp27 commented Feb 6, 2024

I just discovered that I am having this issue on an i5 6500 Debian Book/Linux kernel 6.5 build. Exactly the same error in logs and (random) frequency. Almost happy to see that it is a known issue. Will change OpenVino to CPU for now, and await a fix.

@roldengarm
Copy link

hi @kevin-david, i commented in the other thread #8470 . Moving my openvino model to yolov8 seems to have fixed my issues on my i7-7700 kaby lake. I am using this config now successfully for the last few hours

detectors:
  ov:
    type: openvino
    device: GPU
    model:
      path: /config/yolov8n/yolov8n.xml
model:
  width: 416
  height: 416
  input_tensor: nchw
  input_pixel_format: bgr
  model_type: yolov8
  labelmap_path: /config/yolov8n/coco_80cl.txt

@FeatherKing I've used this and it worked well, thanks for that! But after upgrading to Frigate 0.14 it broke as "yolov8" is not a valid model type, Valid values are ssd, yolox, yolonas

For now I've switched to the config on the website . Would yolov8 be possible as well?

@sakcaj
Copy link

sakcaj commented Nov 21, 2024

Same issue here, running on i5 6500. For me it's the same for both the default config from the website and on yolonas. I do not recall having such issues with 0.13.

I am running Frigate under docker on latest TrueNas scale, tried few tweaks from the internet but none worked so far.

What makes me think it's not Frigate specific but rather host/os, is that the crash usually happened when I received a notifiaciton from one of the cams about person/car in zone, but yesterday it seems to have crashed when jellyfin was being used - which runs on the same host as a container. Unless this was just a coeindicinace and there was a detection running on any other cam, just not within any od the notifiaciton configured zones...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants