From 7788145d2463cabfb9f2f676a455b1f1386d93f8 Mon Sep 17 00:00:00 2001 From: shi-su Date: Mon, 15 Mar 2021 20:43:32 +0000 Subject: [PATCH 01/11] Doc for SAI failure handling --- doc/SAI_failure_handling/Framework.png | Bin 0 -> 12299 bytes .../SAI_failure_handling.md | 121 ++++++++++++++++++ 2 files changed, 121 insertions(+) create mode 100644 doc/SAI_failure_handling/Framework.png create mode 100644 doc/SAI_failure_handling/SAI_failure_handling.md diff --git a/doc/SAI_failure_handling/Framework.png b/doc/SAI_failure_handling/Framework.png new file mode 100644 index 0000000000000000000000000000000000000000..9f9a80d9d2fd1adc9bf9c0662ec1699337ec6aa3 GIT binary patch literal 12299 zcmdVAbyOX}*DXi_!QI{6A-KC<+?|WN2A2ePcL|nY!ESJui@U?c-Q9x^zwfOzvu5Uf zZ@u|*dadqiSzT3q>YP)%dhaMzWmyzNLPRJiC=_`)DRn5QPoR%|6CV2G&JF3%#YgeU zU0qfJs(OO>=;H#$T3ksS3hHkH(hKkl6x3%XRe+}S`}_Oh*}tluJ)i7(_w-rA&@m~u zzS8!c?W6nE-P?4~`pDw>)aqq!!zM6t+&XTuuz4%&*V5zj>&^Xh=ftsN(sbYK>G!Szn_qh@X-Mm7nkSfXCosc7#Ntew6xmV zT24+*Z*T9Ln;SYhx~;7(J3BizHn#5WZgq8aIXO91RaGr5Em~UIuC6X#US3gA(bm>h zOH0e_>}(|^rSb9c;^N|^r6npVs@2uis;Vj`CMF>vAyQJ(p`oGF)YRVIUIotq?e8OJ z*N=&ntGq_}b^QmElamXZ*SW>@ik<^&`?og_&ob`)0I$Kr^M|6MB0)jHkdTmofPm7{ zQrFZO`-CaWn2Gp`)u`fSv#9Zeik14ogXNu@&hg`)^_vb!(_Wc##^GbXTDEI?_nL=~ ztYarfm(G227Am`U=Qpp6!^S)_=5p#cQ>)iX+P6arAzFbW`XQr3^XJ<7rh>u}`6Ykh z;Su`BR~9!ec=&`uqf%y8jtPlKLL!q9k&vKaV18FL`3FVg;NrJ*PuMzo_CkC~d9ii?Y5U|=LACE?=Y78Dft`T3onpS!!ecXoDiadAmWNo8eaSy@?4 zO-+5(O0l=M9~~Y27{2ZBJpGTN<(s;)#)sSd|8j#8Lpmb;_y`0SIX!nMDAeBn?4Jgm z%PgRvn19PliEDZrp9M@h@5r=PC54egFVvd+=v*3eA^bG=3QO?&3&4I0ZdgnOcfbI~ zsip-12CKRGR-t|uAahDC%$wC2(vrhbKg^2mTM$;6@%+%?-dPh}M6Hm%nhenQppUxC zDZZF1XH?g`>hAfUZrishqoe07y{}Up?f-r|Ft=c(vbCMD3vr7?c(@XV<7}YCIX{=3 zUg{>`?Ts?+Ec3834QySy+B`bP+E=(-FXNANx;kK`4e=3+kX9+~KuA%&rs?nDG7K)N zKQB@^VU`;0KyXJ^M7leHE1x4*kzMO-9DSTFeEU##97c`O$%rC)!(E*24Z!I!8Gr?((798DS&E0z%i6 z#}heM#&b{KzBw2^_ES+h@Lh%iVPnv3!N9P(Xa3&#kcoGi7N9Q%xX#erdeT^O|r$oa2i5$ zn%>R@x9|f0^PV2X{DLVe6Jjt3@Sa{*NOUF;$LW4ljZEt?I5Xv389BlJ!Z4urIB%|jRQ3DpQ7n&T<@j=!B4 zG@h0#4OLGS-N%f!)G+QJ>%ntnO}3Ldx2FE;VjdcJya2`tWhhsTxEU%5CD0J@>UP?U!sjX6#J(kvO0GZj^b6zu78>EL>eNY?ACF1 z1EAEwSYWU%xbN3j1fn1jSMr?=upw`G;fu2$<&uXEDB%TlDY}~#zgUZid zbQcMV@dO~1+8D&e?L@158KldmbV zlGNZRF1EJTZJ0YLJ8B(;6c1<%&`7$kj96CoYZv^>7Frzf=gzdoU=Kza=p|O;u?v@w z33`0o2vA*noi8o+ZVwe!|6h3B|I;OV(^-~1n3A3F4QR($DB6Iz97yHh{>z;4j$1-Hyi(X?KOIHGSXSS5jP4O$&(JvV&efoEdY4#9h37neZkMp% zdR^&`xDDDme5JoQq&|w<@T7e9&eirTV(Xz>3mg?TzyIa3cA@Qg4Fd>q4xXXi^chgY zetD{J^=HnzMO@kzp5%)TX1npof0$6?0>__yAHBzsri!g-!9NHrG7S|s^<(`Ti~$>T z*3uU(oeA&IMWf9OmNI>2`CFtU5UsO?()-^<9m1XO+RY`fJ*?XpOS=yV&QB?`h3nxx zwh0|_!Evg1tyRh7{;ZTK=P+0i|DF0ihrL?G7^Jx2r;Jan%^Y+l2!v}WdTp$~2qnQg zTU>c1XQnPvXq!1V`Io|S$;QUY_W3kru@s=0bc^Q8+dL;`FJuIqM7q}7Nr3KhY|l>K?|h6>GQp&-!Nrtz)Tidl$nAV*Kn}hJ46lGM1=ClsjH$LalHYW zz=ToLtNH_{Z$ZyJSL~AwW7-KfDw&MzKBTII-Y03Hp$aZZlsCin7&AupfJB7kThTa>Ppso@8zpw6%p) zMb743m=HY(h3owET(?>(J)4+UUVZ^x9F!K9V3^$Eee*h-Xw=DRA=6)Lbalkic{FnP z3y;qws)ZYFZ_$>U8~`bh0?#Vd`^e8Y%kPp*F9O)lCM?g|&P3*u~1Y&MzL+=~m1>|=b zHvZyk^QaX27$(_7>m%E4li;Q$@jf7$$N6>B=&67xIuON{(Z38eY@-@q`?2g$n$5VcX``E;R|~7YI1oar z^(xwKy#}Nflk=LC|E?dn44rAy4s-5?fe3Q1X#bKr!$U6ztzz@E2_NPB!*yn8GZFPL z0na3^Ko~2BN-J^s4Z~3@NDDkh^tzCi+L>tKeko+m%5g5td; z?;7g;o^$VK9Ke(+u+(JKH)BzVEtI=Mjj~`-0+|}FEBV&Iwd+2R?P_*v4!{~){?O-F zrQHmDf)T!4z&&SYGIQU^@X8YfB3>n8jX9gd02#1+!aXq>(KUu|yK)*VDbbd% zIVAI^J!f2a0gF>3M*gER&Qma08#Q8%J5IYntUelq{S>; zEjEq$dKCB2XFv8kN5UAM$fvip%2u+AK)yDdZ2FZ#14zrAa&1p^aWao zu7>5?!RSo3G%cZ=1@#wJO(*>xeagn=r#KCThA;($f~BYjb0zT~@*Na^iWh-G{_-lf z^DFo=dzKXbt2xaVPz<1AP_Cob@Tza{N9Q31 zsqu6Se7&05$B&-YF9fN@#hlB3ognS4Jn!7**J8sv2<$BXR9!`81 zraA7dsr{Lw6LDCogago-s6^*jRRTmxL<1i$)604m!ExQ5M7}Kbz z^i5oiyGVhFp<;~o`r)LsXGvGuYP?$6Qp$zEu0%%H;>L6|aNyOS2DP)|dkfE1O4>F5 zR`C)JgtWkZD>*Cn(Z8g?BR$G7K`?Rq64kzixfsIwe*SL<+R+=u&Oa>YtO@KUQmjmvSgk zm`6x}O<|*R96$E@ueWPX38DiREv{jxrgYwJLU#q$ zvM7+q`chsz6Q>TH?VpPF&3@TkY z;W0-`aAQ7lisY4DyBZ&u2T2w}Obp}?|F4`gZuT6y=FC}7^^s;0d!=(|BLxSNWUjIy zG92wg7l;{>q7_a$K2Bzf9}5<=(m=s{rF*5hVJAg?V>XU^FW_j|Y}ZB+@`}`IV5<$w z{w(>BHJ0GXv!L&(_^y|G{S)cZb=xSeLdW@w>?Ge!$jRsgDpUV~nUGJn&^FgQf0e{i$|zTtwpiaJIovMdh`ZV(qj%;*P8EY zz9R=dXZ4T956j|Bb7$Xc^9FWQSjoIQ`X1^|Xhh({f!_#*?Cj52MB$TF?2up7kG5;w z0w!l;>Tn60D>IrE>wX^jM!iqJ)5_j7@WoP!z}NQ5k~z6>wAqW4|2=ar%aB-1$Ig{%jjLkl$|0y1#%8L`nz0WjGSYVO z*?F}44)Yn$0lt!ou{lf9rK8XD*?OVV&y_%U4(sY1P9zU!q(TbE2kP?F%rnoL7*@rS zMx;6b&**CfgXK!(&dHo_ls)&~%qV?;F+wH|O=X?2*>qy>fo={htYPgHC`WE<(Il?7 zsCS+Wnl(eWMP_19PMt8!@ABW_$3jvAI}T*7;@rD#@NuG6`!baGf5Fv(F6q;z|1Wq| zu59rnF9L0c1?D~*M41W=^iZiixZJ8|NHgp=YpN*G?EPpVz{54 z_;^R8?@B-{M51ul;Db0afi!}C8qXd%+?iyP1mF39%(SwW9p-qio- zuKqhLFY7~B@r}4Xg9UVX_;*^*tiqXiMXLR^-}5>)4zn>Ot?gy-Eu&!%!VK)K{_tPx zFkD^3E9>|AHrk2txAns((b5vgN``KdENH8E(i8&&I`V7$M*S@=xW06)~SBXQ)8@-JRWDyN5MnePE-XAB4ZI@3-Yb32gJ4a+?%pVvcd~}f! zLH2?IVBD?Q$Z>e7I zKL{7=Qi&Yvj2J^o-HG$OelBZ?g4&87{Vkxbj2fRtvJ`H)B}$syGQsQ zceMbyJG_i(ir91BAVnwzC~+$&aCG&F3v5t^0HxtmQ!l@xfD^YtPq&YU<(KHLIe6yre&QevR3Eo-{@f4H4SWt5!eDqKqNnfrA}abChsQIr5&b9}uaQi} z+V#a?gAy|e70wtfd0w6RC4!y!#q05P(_V|aMZtCqqY+yZ-;eg}xjycf;_ufk*%}Yc zw`VjCKe^+bLnN4hgPv8Bw1oBdy}@fWzJCW8)Dn*6U@lpFog~aTPK^7)SfFiXV_=J9g*(v+K2k3|# zzlJ5P3zkUcqWa4w|N3MZ@Vzh9n}p;TUuIA3*LK4D{TcA0<`4U{j(SWkJgMh=3qWVdOa8W3ZDQNfWdM*J2WG>>zC52j-(7Z zEQvb4C~6tIXRb5GC_$l(ERN@G%I*)a7(*&{2+M9FtJ*h(Fd%Y_UTwtw4}ZR9`cVgK zerksT|6xSNtassedN8VtJp-V5Mus&->QOl6tF%ox(KADy`0xi(Egspc)154HsBA|6 zp*kS|h23{bpL*aTfFmE>hw!kZl52I17%9wSBR6_Me@j$b@ys=Ps11sNoooea@D>-9k(z??9R;;HJ-`>wknj!b> zGhQ-#AB^h*C%|7R66Ppua|bU`kYN4s&q^%>P@}!0f-_Z6^H1I7Al@?=>*@;lf5qVI zc?Jq#{WeIs;5LiG-BmeN;e;1ZTzTwb1(V{_%85(?3hcJW@I>==75C9JJoZbM683N= zQ;13q)skhq=w^k5Tq0Ec88~c7Zs|Nx>Y#(2M{eawDd5+ORg8OpMR79**wn-K*z6_b zi9-@Xhxilj4x??nrY~Sec(4>%U1^-l3aD{3O;gPsegqLOqUYf3CqZE%#Da5V#ZFRH z>glr@(;&G{aav1^uT4V##d2eX|E&Hk4Iyo21CyS|E3~Ho0qBw4%ztT7OW%lx&k0I_ zFM1^K>-l?lnvS4!T=Ec|baZ<%jyb<=g*A-N+E+pPw^`TduBecQ!h@kNxEE!d=qC6xa9ocP zMQKG69h9*r_Cm>m5k39G*g|L2PGNtm|Do?+>k?)hF^f51xKABhlsZi~uhmT!zdk7z zxb1*Cw@N&K&0|3x%`poPyC$cd1)D96V~L)FK5%<7f!V*15;&2KN(A_W>Hx-qCdce$ zX`l{Ygt<;vIx*>SE-@6q0FfjUG7EQ=kwJ|m~X{iu)T`i8Ig0EG- z64daOijm8=Dj(;rYfE1}PtMtyOJwzOnT732MkOoFth4&1ShIaqgEC~}^1EK_nqkG|?N|cg7+kt$V%(IX#fn&iik{uhah5bNTeNVZY7%!<9bE9z3ZZeD z%0uu8mJdZzCFmI?6BNKrC`mLGKWQdyGUD`8pUiN0(I!I1+|ArrJy`{ciQz<=`Es66 zmE3~)KfIVOyvaB0BSBxq*y+sV3}btht96FI>|nGo5^|^bA|~l0YE>3`BF^l7ZtaCHI7r`AST&u|D(WTQ11)c0%q}rF1$1e`ZXS z8Dcf7g<{hr+?_j6$!#TRtK9$|nTr3)>1nNH?tirZ3i-oXCy%ALKa^c!9b5kyqr_>X z>KL%rLVcQ-ILou?*(1HpmjF#@|`e z7$J~p@jKeyLQrq-@dg~`J4w5!3ppU3!oR$?mcF&0Xe%KGlznP{601I%WO($WA(YsO zv%ijX5BVOO2Q?+X{?GdJKZCdCF~!G?y57_cdj9ydt~*~>tqWN1i`SHlAT>mr={Gb% zu1~H#<+44l_@Cd+DNmssag8XSly=^|Ar*JBkV0oFVTa4poNxGFZ+Q9kI)n00<49`X zr%&$Qp16Ieb*_;{6V|zFk=IE&19YtXeJ(fJ^Jw5WQsmQ1oum%=28C<4=o3z=%l9*P zw9U<*Bs$_A`J59wlu>LXPHKhmxnZa(zy;)M-w(oan-!hqvXVw6-)LX^1>X)|5;8@L z3ax)bnY7*bMd@&p=yw97rTUDaWv0G#;vH^k6FH0I-Uf^PZD{4|I$CGR zixiM}A0KD1h+Qcn@7kVyU|spg4BwzC9$Ju6Z*hjdra1(+TdxQap6CGkc-ta@ux{*1 zb4*noTT zVL*-a(l49oWz|hUbr^YEo7OXGBp8r`XK;)iVw7&FNOf3p$v*e#v^P`98E*4;eiK~x zzufQLD#BO&Vpp|^x1h3o^>!(RQ~=QjP6A%MaKnK5-8)Ng`(?iNR*rmuBqG3)fY`0S zV;4VJ(qYshcrb^jMoKg=O!e0uS;b;n_vV7nC8AEIrqpYL&}kH#vEWcDoa62NkuxZz5VzbL{q4R~2JR$z;~N^1fr0w+Y`tU~ z*px`3Ut%5980qNLr}oNxpM(>Y#4Sg{aOB5tKLG5dL5UAhNV#plqM||~M4vVV6G{UK zE;%SX0~t95ooJ#cRz+dv-#6tg`MzP7oE^jCy$>i(jtPCOG)iDjOJOI8GNl_WnT4fD z#TB(6x8F_Mb4+W(EI*kct+txE-d;?%P*uzn_JPeepOUS#X8-gv%TqEG!&*?M2n9wX zk=5Gz9xc%Fml4TZ=Thhbo9X3eXj1Eh44@qXj1<=~8sP(xLh#4v{Z+}$bWnvBl0l4; zb0NaXGpA8e+#!K>s+5|(bH&&7)SR%+IkK2jAyLJ8XI{^tG+k?W$ku@lN@eN4G**OB z%78raex5eOtpN1NW(Qc5={OPU^lyabGqVvqV)+PM%Tg4sORR5Q9$t61?g?&iZqDatQORq_x?#_zdm@RH%fQZbFUyYvy)qYTlhGt?gZ zw`@v3m76;Eos)Ft$6k7mOQ zzN30T&}-qprmv&(+&9_7QRcAkxD1i~-??O9PUt+HV)|Oplc{%N3-ZE^AS@0q zKZ#?HJ|;&e9>jI*AqPh`V9%if5ewX_xjK(McDDhNpA2xvnQfLjyGX1zlahFwt=^=P z;g1#t`%8og$@ZN4kbTlvRzH$0Zw)y;KB*ji90CS{^E(u+j6GV(y-3oXgNazH_+?5y zQTLYi#X1f5JEjSgQ@+3+byqN-41H_z*Oe$@NBGO3i5r}iJlOmtMD-e_ow;3rcuk;F zUDqWAAj8X_J|NmG+#Hm(%12-8EICTJIQ5U>2gWnKkEBw4Effr@)9_|@B0aJ~HU(P2iBmRf zQUHzQ>3T`3da*ovvn`_Rm!IA0=oL%S3s@GIG9g8xze&r!Tz^%=4~;-Y15UwyE+DI5 zzhbIaoWuU=HzNtA*GP?K*YLhGF=22-VqE76hg^qr@qrh<4vIM&twApuU_o4tKoBYyPl!=&>cU0Mt!S*zfKHS>{+ z^HWg5U8|JNspCtm7Q$Z+`jk&-ENYt=D&B*kZ_S&DLqu8HpV&U^5(JpHPF$nGlxI7< zROOq<3}vB>)o_hiqeKldZZopgdFe)D4c^?|26tW3!Lzddr6`f8&lKUw;)rsG@N}zc z+CKDzuHLS8Kq+OYX+|-+(w_ncbv}p7#Zn)`iMt%#c%IzEY}T4Yi6BUNt=!_Gho>oQ zihhv{oJuHymlFD__V9(zA?TU>FkOKJk$v^xozV$-%j4M^eGl&JP+)zTcncea648IC z+QfE=bQUO#T(e^H5~zCQm{7(aS~ZfE5Z^enydj#s4!4+AFo(c6d}}($hw`i9oIl?A z1$7i%Hju$4V#Newi#61SJ8;7zQ4yo5El^MvMeuFlqTcAD8Qyg&&duv;i|ZKIJF3dj z3^G{(-0o)X+D{y?jMpKx|8jooi~DL=gT3fRm?O=wX#Si7uFh-YPQ9*TmxVAZ=kQ(H6MPDC(edk{y273-B zSjb$!=Y%y3F$^Ge1-x#$4*+$q6vf^4`Z0K7>ZP~o57SwfJK$QcG47Ye4}>CTBD_M! z%_b;a=N=SQLKN!TO!Owe3jG(UFM*cl1D`&6Qp*w?AJt_pGQV-z^v%+n-izGuo1zg{ z!WYpD0kGBIIm+Hkva~}=KDM5cz_!oxVOvO8UZMp*ND*5%*MtrnP!~r(cq<27CVgtu zoydoYf2{-i8lsnMFA8582a%X9B+OTEN>K6?WOh$(j4QnR-xfYP`mAR$)V8u+2F`&{ zqGMWgI<&8T_Ud{p{iuh(De~j^zm%|480-$*1ls4m%r-4Tug|}S9Kq-GkKp=fD`|V-2kr-y1(8DnhKdQHsx4-0HqD>ZS&L*F2;Yz2 z=Tp5b2Z(`nx=WNw7in6v&S5PjmC5(Z&nO}%;UW7KSzpL${SnQNg|CcNZd&o%i#i9P ziPT5<>4bA79BP>W&9VA*MUQf?|5Sp*qORi@_pKI=i-X@fk=~(R6-ah!V~Hd4rfzg> z^g%zGNyofwYuU5evvtYcS!Mn~iv!QpB{CqF4uYWNSyHgE{PF%{t8q%h!HFNGM$w-l z8W!h}Y0r@0@gcX+MrT^qQS=iSo+_1ANkmMAw>HwKw+e9ZL+3lEQFsjfr8tH+?o91% z|FMl7u8*b8c!zX8WX?XwJSmMnc2+NPH zbotiz=GtvbDZuY;B|UJfUqrD0_1QrVl-GUK&_|eDP5%*G#M-q(RS$wY$-}*%fg>NK zSSPshR_1+yY@_3A=;3QZy8S@l&jmE?ym`S7`olfuBMP_vhk~kifW@pv1f#+@#RT~y zG;;r^3aVnQnqy()4->$ zFO8oqX!yGV9NSU7sNNp#`+Wv$PBR;U5POqvJzBeY$?j;m;cB9MY>$C$0VGZh&V=E> zwsF}`BV+AGzuy63#Ik8WDnO5KlV_ea4dpm&=02OY8xMlTH=C5vtCBdpvAi5)21{L= z=1k(GQwPviE#79{2QE}6m(n;^rpma}D6FhZX~r)LsXn4Z(3D5LLbLDn^h{#emG+1v?~BlcXW^dg1*PUPI)_ zpWQJ>SvcWCx^w&7VthU%Gs>*e%h$#>gm=_XyfshbHM8usSCre8iZ$&9Z$`f9;7;5v~IocNP+VN_Hx3 z|A+Jc-xclspYkaXe+pNJ`0>IPmdMwD+rJN(A8rilrtal9U2Ep8{>k5{yv z$abJ)wEaqGhuC+W(v0LoOJJHD#9P+0*~PuCHAM_Kv}J^YX6qpB`^~&p5^YOTR}YWuiCpCockIb#LG|}vB*htcnXESuB9vHA_yPyoN1W9vsW+_66G8#8 z!Vuu5^b`bUE!TaU%!xL!8)0jeIPdd_7UOu8?+4eLqeL7{pITPXTtwkGdJph(2QEh9M?=x{Z-Cg-m}oY|`N!1b!OV_8_pK?MAzKeQmNmY2s>(l)Ui8kK^AjFJ@WyAD zXs(Ksb0>A$HT*Rv?qQci5wgtSk8g|tsFzyErOi<7DBXyLVd5BZe(s?OSQ}%;`25Q& zRXriqhe}!AeqPYU9CEDdkZewc8P6^-#uX;0Pee9v0c&Geu1A@?5zP26BAF+MrLLMd zA@v(?cvX-kZA)5by!wo-zyLb}o)5r+W2O-Psm^ne;AQCfvM4XtfdIlu#9i=*rC{sT zczgdN{r9S;3`MaG&%9021GMS@z>>B~-#B?jr1U7#+0UmfdfNicIqY`wf;=lUfd#2E z=9$gv<`7+Ltf|c5ByjgLG`5Y^zhycB@Pb&0eEeOsi4wPa2tuebQ|x1yNa#!eSdTJm zTPWlc%zoUA$w<4ZM9C=N8(qKw60f2mbQUczgWG0P^0_sL|0bKTS4zZ}G&~C?{Fe;3 z8vYKU5I50UcOjjDe3nqYqIz`r8nY{kMb2+|M{n*OXoA{&!d`l`$!GEz^Wy`!c_|<| zX?c0@MPU}gM-#qM`!8hZ_xZ{|CUcw=KzN1G2b_t}0@Qk|gB(AIXHc;GZRN4}$3)PJ zNSh^^u}tuxb{T4t|2Y2dsaUnyQ%R3AQqn$Zkfd9T>HP?U%>y5j-NKL2eO)Ko8KH1% zFL?{0XlIWX0t?%cL-UjNt$%Z?e1nsm%-sPL0wKEiE;}JMW2J zigbVc7FlPQ?;^yk9O8K{g}e7eK!p}JtH7(15P + +### 2.3 Failure handling functions in Orchagent +#### 2.3.1 Failure handling functions +To support a failure handling logic in general while also allow each orch to have its specific logic, we include the following virtual functions in Orch +1. `virtual bool handleSaiCreateStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` +2. `virtual bool handleSaiSetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` +3. `virtual bool handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` + +The three functions handle SAI failures in create, set, and remove operations, respectively. +With the type of SAI API and SAI status as an input, the function could handle the failure according to the two information. + +In the scenario where a specific logic is required in one of the Orchs, this design allows the Orch to inherit the function and include the specific login in the inherited function. + +#### 2.3.2 Possible execution results +1. Return True -- No crash, no retry + +The failure handling function should return true when the failed SAI call does not require a retry after executing the funciton. +This behavior should happen in two scenraios: + +* The failure is properly handled without need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). + +* The failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such scenario, the funciton should prevent orchagent from retrying and escalate the failure to upper layers. + +2. Return False -- No crash, retry + +The failure handling function should return true when the failed SAI call may be resolved in a subsequent attempt. + +3. exit(EXIT_FAILURE) -- Crash and trigger SwSS auto-restart + +Some of the failures can be resolved by restarting SwSS. +In the scenario where such failures happens, the failure handling function in orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. + + + +### 2.4 DB changes +An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd. + +The schema of ERROR_DB is designed as follows: `is a counter needed?` +``` +ERROR_{{SAI_API}}_TABLE|entry + "opcode": {{method}} + "status": {{sai_status}} + {{attr1}}: {{value1}} + {{attr2}}: {{value2}} + ... +``` + +An example ERROR_DB entry for route table and neighbor table in BGP error handling is available at https://github.com/Azure/SONiC/blob/master/doc/error-handling/error_handling_design_spec.md#3431-Error-Tables +``` +ERROR_ROUTE_TABLE|prefix + "opcode": {{method}} + "nexthop": {{list_of_nexthops}} + "intf": ifindex ? PORT_TABLE.key + "status": {{return_code}} +``` + +``` +ERROR_NEIGH_TABLE|INTF_TABLE.name/ VLAN_INTF_TABLE.name / LAG_INTF_TABLE.name|prefix + "opcode": {{method}} + "neigh": {{mac_address}} + "family": {{ip_address_family}} + "status": {{return_code}} +``` + +# 3. Failure handling logic in orchagent +### 3.1 Failure status that could be handled in orchagent +| SAI status | Create | Set | Remove | +|-----|-----|-----|-----| +| ITEM ALREADY EXISTS | `Set the corresponding attribute instead?` | Should not happen. No retry. | Should not happen. No retry. | +| ITEM NOT FOUND  | Should not happen. No retry. | `Create the item and set attribute?` | No retry. +| OBJECT IN USE | Should not happen. No retry. | Retry for a few times | Retry for a few times | + + + + + +### 3.2 SAI API specific handling logic +TODO: Add SAI API specific handling logic + + +### 3.3 Orch specific handling logic +TODO: Add Orch specific handling logic From 2dae5183dade1c184fa293361e974b623658c43d Mon Sep 17 00:00:00 2001 From: shi-su Date: Mon, 15 Mar 2021 23:27:57 +0000 Subject: [PATCH 02/11] polish doc --- doc/SAI_failure_handling/SAI_failure_handling.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index 990ca6b848..9d4b3edfbe 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -30,6 +30,8 @@ With the type of SAI API and SAI status as an input, the function could handle t In the scenario where a specific logic is required in one of the Orchs, this design allows the Orch to inherit the function and include the specific login in the inherited function. +The function also allows an optional input `context`, which allow to pass context (e.g., object entry, attribute, etc.) into the function so that it could escalate the information to the ERROR_DB and upper layers. + #### 2.3.2 Possible execution results 1. Return True -- No crash, no retry @@ -63,6 +65,15 @@ ERROR_{{SAI_API}}_TABLE|entry {{attr2}}: {{value2}} ... ``` + +The tables in ERROR_DB correspond to the SAI API type (e.g., ERROR_ROUTE_TABLE, ERROR_NEIGH_TABLE, etc.), and the key of each entry corresponds to the entry of SAI failure. + +The field `opcode` indicates the method that failed. +Possible values include `CREATE/SET/DELETE`. + +The field `status` saves the status of the SAI operation (e.g., SAI_STATUS_NOT_SUPPORTED, SAI_STATUS_FAILURE). + +The ERROR_DB also include a list of attributes and the corresponding values that the failed operation tries to set. An example ERROR_DB entry for route table and neighbor table in BGP error handling is available at https://github.com/Azure/SONiC/blob/master/doc/error-handling/error_handling_design_spec.md#3431-Error-Tables ``` From dd0cd9d927eae8409d283b248b6098e743bd040f Mon Sep 17 00:00:00 2001 From: shi-su Date: Tue, 16 Mar 2021 00:39:57 +0000 Subject: [PATCH 03/11] Remove comments and add warm boot support --- .../SAI_failure_handling.md | 27 +++---------------- 1 file changed, 4 insertions(+), 23 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index 9d4b3edfbe..ace8b98cbd 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -100,29 +100,6 @@ ERROR_NEIGH_TABLE|INTF_TABLE.name/ VLAN_INTF_TABLE.name / LAG_INTF_TABLE.name|pr | ITEM NOT FOUND  | Should not happen. No retry. | `Create the item and set attribute?` | No retry. | OBJECT IN USE | Should not happen. No retry. | Retry for a few times | Retry for a few times | - - - ### 3.2 SAI API specific handling logic TODO: Add SAI API specific handling logic @@ -130,3 +107,7 @@ TODO: Add SAI API specific handling logic ### 3.3 Orch specific handling logic TODO: Add Orch specific handling logic + +# Warm boot support +A warm reboot should not be issued in the scenario with unhandled SAI failures. +A check or ERROR_DB should be added to pre-warm-reboot check functions to prevent doing warm reboots with unhandled SAI failures. From efd115f2b9784ca23c76748c7f5c53479413e4ef Mon Sep 17 00:00:00 2001 From: shi-su Date: Tue, 16 Mar 2021 04:29:59 +0000 Subject: [PATCH 04/11] Fix grammar issues --- .../SAI_failure_handling.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index ace8b98cbd..fc1e03b8ce 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -11,10 +11,10 @@ This document describes the high-level design for orchagent in handling SAI fail ### 2.2 An overview of the failure handling framework An illustrative figure of the failure handling framework is shown below. -The orchagent generates SAI calls according to the information in APPL_DB given by upper layer. +The orchagent generates SAI calls according to the information in APPL_DB given by upper layers. In the case of SAI failures, the orchagent gets the failure status via the feedback mechanism in synchronous mode. Based on the failure information, the failure handling functions in orchagent make the first attempt to address the failure. -An ERROR_DB is also introduced to suppport escalation to upper layers. +An ERROR_DB is also introduced to support escalation to upper layers. In the scenario where orchagent is unable to resolve the problem, the failure handling functions would escalate the failure to upper layers by pushing the failure into the ERROR_DB. @@ -26,21 +26,21 @@ To support a failure handling logic in general while also allow each orch to hav 3. `virtual bool handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` The three functions handle SAI failures in create, set, and remove operations, respectively. -With the type of SAI API and SAI status as an input, the function could handle the failure according to the two information. +With the type of SAI API and SAI status as an input, the function could handle the failure according to the two pieces ofinformation. In the scenario where a specific logic is required in one of the Orchs, this design allows the Orch to inherit the function and include the specific login in the inherited function. -The function also allows an optional input `context`, which allow to pass context (e.g., object entry, attribute, etc.) into the function so that it could escalate the information to the ERROR_DB and upper layers. +The function also allows an optional input `context`, which allows passing context (e.g., object entry, attribute, etc.) into the function so that it could escalate the information to the ERROR_DB and upper layers. #### 2.3.2 Possible execution results 1. Return True -- No crash, no retry -The failure handling function should return true when the failed SAI call does not require a retry after executing the funciton. -This behavior should happen in two scenraios: +The failure handling function should return true when the failed SAI call does not require a retry after executing the function. +This behavior should happen in two scenarios: -* The failure is properly handled without need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). +* The failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). -* The failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such scenario, the funciton should prevent orchagent from retrying and escalate the failure to upper layers. +* The failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such a scenario, the function should prevent orchagent from retrying and escalate the failure to upper layers. 2. Return False -- No crash, retry @@ -49,7 +49,7 @@ The failure handling function should return true when the failed SAI call may be 3. exit(EXIT_FAILURE) -- Crash and trigger SwSS auto-restart Some of the failures can be resolved by restarting SwSS. -In the scenario where such failures happens, the failure handling function in orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. +In the scenario where such failures happen, the failure handling function in orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. @@ -73,7 +73,7 @@ Possible values include `CREATE/SET/DELETE`. The field `status` saves the status of the SAI operation (e.g., SAI_STATUS_NOT_SUPPORTED, SAI_STATUS_FAILURE). -The ERROR_DB also include a list of attributes and the corresponding values that the failed operation tries to set. +The ERROR_DB also includes a list of attributes and the corresponding values that the failed operation tries to set. An example ERROR_DB entry for route table and neighbor table in BGP error handling is available at https://github.com/Azure/SONiC/blob/master/doc/error-handling/error_handling_design_spec.md#3431-Error-Tables ``` From 93e54212845ea387cd99717e162c5ba9f9905b90 Mon Sep 17 00:00:00 2001 From: shi-su Date: Wed, 17 Mar 2021 04:51:33 +0000 Subject: [PATCH 05/11] Update return value type --- .../SAI_failure_handling.md | 55 +++++++------------ 1 file changed, 21 insertions(+), 34 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index fc1e03b8ce..e75a7a3d1b 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -21,9 +21,9 @@ In the scenario where orchagent is unable to resolve the problem, the failure ha ### 2.3 Failure handling functions in Orchagent #### 2.3.1 Failure handling functions To support a failure handling logic in general while also allow each orch to have its specific logic, we include the following virtual functions in Orch -1. `virtual bool handleSaiCreateStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` -2. `virtual bool handleSaiSetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` -3. `virtual bool handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` +1. `virtual task_process_status handleSaiCreateStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` +2. `virtual task_process_status handleSaiSetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` +3. `virtual task_process_status handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` The three functions handle SAI failures in create, set, and remove operations, respectively. With the type of SAI API and SAI status as an input, the function could handle the failure according to the two pieces ofinformation. @@ -33,20 +33,19 @@ In the scenario where a specific logic is required in one of the Orchs, this des The function also allows an optional input `context`, which allows passing context (e.g., object entry, attribute, etc.) into the function so that it could escalate the information to the ERROR_DB and upper layers. #### 2.3.2 Possible execution results -1. Return True -- No crash, no retry +1. Return `task_success` -- No crash, no retry, handled successfully. -The failure handling function should return true when the failed SAI call does not require a retry after executing the function. -This behavior should happen in two scenarios: - -* The failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). +The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). -* The failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such a scenario, the function should prevent orchagent from retrying and escalate the failure to upper layers. +2. Return `task_failed` -- No crash, no retry, not handled successfully. -2. Return False -- No crash, retry +The failure handling function should return `task_failed` when the failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such a scenario, the function should prevent orchagent from retrying and escalate the failure to upper layers. -The failure handling function should return true when the failed SAI call may be resolved in a subsequent attempt. +3. Return `task_need_retry` -- No crash, retry -3. exit(EXIT_FAILURE) -- Crash and trigger SwSS auto-restart +The failure handling function should return `task_need_retry` when the failed SAI call may be resolved in a subsequent attempt. + +4. exit(EXIT_FAILURE) -- Crash and trigger SwSS auto-restart Some of the failures can be resolved by restarting SwSS. In the scenario where such failures happen, the failure handling function in orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. @@ -56,11 +55,12 @@ In the scenario where such failures happen, the failure handling function in orc ### 2.4 DB changes An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd. -The schema of ERROR_DB is designed as follows: `is a counter needed?` +The schema of ERROR_DB is designed as follows: ``` ERROR_{{SAI_API}}_TABLE|entry "opcode": {{method}} "status": {{sai_status}} + "counter": {{count}} {{attr1}}: {{value1}} {{attr2}}: {{value2}} ... @@ -71,35 +71,22 @@ The tables in ERROR_DB correspond to the SAI API type (e.g., ERROR_ROUTE_TABLE, The field `opcode` indicates the method that failed. Possible values include `CREATE/SET/DELETE`. -The field `status` saves the status of the SAI operation (e.g., SAI_STATUS_NOT_SUPPORTED, SAI_STATUS_FAILURE). +The field `status` stores the status of the SAI operation (e.g., SAI_STATUS_NOT_SUPPORTED, SAI_STATUS_FAILURE). -The ERROR_DB also includes a list of attributes and the corresponding values that the failed operation tries to set. - -An example ERROR_DB entry for route table and neighbor table in BGP error handling is available at https://github.com/Azure/SONiC/blob/master/doc/error-handling/error_handling_design_spec.md#3431-Error-Tables -``` -ERROR_ROUTE_TABLE|prefix - "opcode": {{method}} - "nexthop": {{list_of_nexthops}} - "intf": ifindex ? PORT_TABLE.key - "status": {{return_code}} -``` +The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. + +The ERROR_DB entry also includes a list of attributes and the corresponding values that the failed operation tries to set. -``` -ERROR_NEIGH_TABLE|INTF_TABLE.name/ VLAN_INTF_TABLE.name / LAG_INTF_TABLE.name|prefix - "opcode": {{method}} - "neigh": {{mac_address}} - "family": {{ip_address_family}} - "status": {{return_code}} -``` # 3. Failure handling logic in orchagent ### 3.1 Failure status that could be handled in orchagent | SAI status | Create | Set | Remove | |-----|-----|-----|-----| -| ITEM ALREADY EXISTS | `Set the corresponding attribute instead?` | Should not happen. No retry. | Should not happen. No retry. | -| ITEM NOT FOUND  | Should not happen. No retry. | `Create the item and set attribute?` | No retry. -| OBJECT IN USE | Should not happen. No retry. | Retry for a few times | Retry for a few times | +| ITEM ALREADY EXISTS | Set the corresponding attribute instead? | Should not happen. No retry. | Should not happen. No retry. | +| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute? | Return success. No retry. +| OBJECT IN USE | Should not happen. No retry. | Should not happen. Retry after a while. | Retry after a while. | +TODO: Add handling logic for other SAI statuses. ### 3.2 SAI API specific handling logic TODO: Add SAI API specific handling logic From 312e885c3c19f3e9506cfd10fcc86dbb8eac0309 Mon Sep 17 00:00:00 2001 From: shi-su Date: Wed, 17 Mar 2021 04:53:42 +0000 Subject: [PATCH 06/11] fix a typo --- doc/SAI_failure_handling/SAI_failure_handling.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index e75a7a3d1b..953722f1f2 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -82,8 +82,8 @@ The ERROR_DB entry also includes a list of attributes and the corresponding valu ### 3.1 Failure status that could be handled in orchagent | SAI status | Create | Set | Remove | |-----|-----|-----|-----| -| ITEM ALREADY EXISTS | Set the corresponding attribute instead? | Should not happen. No retry. | Should not happen. No retry. | -| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute? | Return success. No retry. +| ITEM ALREADY EXISTS | Set the corresponding attribute instead. | Should not happen. No retry. | Should not happen. No retry. | +| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute. | Return success. No retry. | OBJECT IN USE | Should not happen. No retry. | Should not happen. Retry after a while. | Retry after a while. | TODO: Add handling logic for other SAI statuses. @@ -95,6 +95,6 @@ TODO: Add SAI API specific handling logic ### 3.3 Orch specific handling logic TODO: Add Orch specific handling logic -# Warm boot support +# 4. Warm boot support A warm reboot should not be issued in the scenario with unhandled SAI failures. A check or ERROR_DB should be added to pre-warm-reboot check functions to prevent doing warm reboots with unhandled SAI failures. From f7efad19c2c483228f9e0968f73320e01d24b1f1 Mon Sep 17 00:00:00 2001 From: Shi Su Date: Tue, 18 May 2021 16:06:42 +0000 Subject: [PATCH 07/11] Update ERROR_DB design --- .../SAI_failure_handling.md | 54 ++++++++++++------- 1 file changed, 36 insertions(+), 18 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index 953722f1f2..3618417e44 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -1,26 +1,35 @@ -# 1. Scope +# HLD for handling SAI failures + +## 1. Scope + This document describes the high-level design for orchagent in handling SAI failures +## 2. Failure Handling Framework -# 2. Failure Handling Framework ### 2.1 Requirements for failure handling functions in orchagent + 1. Allow different handling for Create/Set/Remove operations. 1. Allow each Orch to have its specific handling logic. 1. Adapt handling logic based on SAI API type and SAI status. 1. Escalate the failure to upper layers when the failure cannot be handled in orchagent. ### 2.2 An overview of the failure handling framework + An illustrative figure of the failure handling framework is shown below. The orchagent generates SAI calls according to the information in APPL_DB given by upper layers. In the case of SAI failures, the orchagent gets the failure status via the feedback mechanism in synchronous mode. Based on the failure information, the failure handling functions in orchagent make the first attempt to address the failure. An ERROR_DB is also introduced to support escalation to upper layers. In the scenario where orchagent is unable to resolve the problem, the failure handling functions would escalate the failure to upper layers by pushing the failure into the ERROR_DB. + ### 2.3 Failure handling functions in Orchagent + #### 2.3.1 Failure handling functions + To support a failure handling logic in general while also allow each orch to have its specific logic, we include the following virtual functions in Orch + 1. `virtual task_process_status handleSaiCreateStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` 2. `virtual task_process_status handleSaiSetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` 3. `virtual task_process_status handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` @@ -33,6 +42,7 @@ In the scenario where a specific logic is required in one of the Orchs, this des The function also allows an optional input `context`, which allows passing context (e.g., object entry, attribute, etc.) into the function so that it could escalate the information to the ERROR_DB and upper layers. #### 2.3.2 Possible execution results + 1. Return `task_success` -- No crash, no retry, handled successfully. The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). @@ -50,51 +60,59 @@ The failure handling function should return `task_need_retry` when the failed SA Some of the failures can be resolved by restarting SwSS. In the scenario where such failures happen, the failure handling function in orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. - - ### 2.4 DB changes + An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd. The schema of ERROR_DB is designed as follows: + ``` -ERROR_{{SAI_API}}_TABLE|entry - "opcode": {{method}} +ERROR_{{TABLE_TYPE}}_TABLE|entry + "failed_orch": {{failed orch type}} + "failed_SAI": {{failed SAI type}} + "opcode": {{failed SAI operation type}} "status": {{sai_status}} + "attributes": "attr_type0,attr_type1,..." + "attr_values": "attr_value0,attr_value1,..." "counter": {{count}} - {{attr1}}: {{value1}} - {{attr2}}: {{value2}} - ... ``` -The tables in ERROR_DB correspond to the SAI API type (e.g., ERROR_ROUTE_TABLE, ERROR_NEIGH_TABLE, etc.), and the key of each entry corresponds to the entry of SAI failure. +The table and key in ERROR_DB correspond to the table and key in APPL_DB where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_ROUTE_TABLE:0.0.0.0/0`). + +The field `failed_orch` indicates the type of orch where the SAI failure happens. -The field `opcode` indicates the method that failed. +The field `failed_SAI` indicates the type of SAI in which the SAI failure happens. + +The field `opcode` indicates the method that failed. Possible values include `CREATE/SET/DELETE`. The field `status` stores the status of the SAI operation (e.g., SAI_STATUS_NOT_SUPPORTED, SAI_STATUS_FAILURE). -The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. +The ERROR_DB entry also includes a list of attributes (comma separated) and the corresponding values (comma separated) that the failed operation tries to set. -The ERROR_DB entry also includes a list of attributes and the corresponding values that the failed operation tries to set. +The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. +## 3. Failure handling logic in orchagent -# 3. Failure handling logic in orchagent ### 3.1 Failure status that could be handled in orchagent + | SAI status | Create | Set | Remove | |-----|-----|-----|-----| | ITEM ALREADY EXISTS | Set the corresponding attribute instead. | Should not happen. No retry. | Should not happen. No retry. | -| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute. | Return success. No retry. +| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute. | Return success. No retry. | OBJECT IN USE | Should not happen. No retry. | Should not happen. Retry after a while. | Retry after a while. | -TODO: Add handling logic for other SAI statuses. +TODO: Add handling logic for other SAI statuses. ### 3.2 SAI API specific handling logic -TODO: Add SAI API specific handling logic +TODO: Add SAI API specific handling logic ### 3.3 Orch specific handling logic + TODO: Add Orch specific handling logic -# 4. Warm boot support +## 4. Warm boot support + A warm reboot should not be issued in the scenario with unhandled SAI failures. A check or ERROR_DB should be added to pre-warm-reboot check functions to prevent doing warm reboots with unhandled SAI failures. From 910126b6a44396fe9215f6353df31811aa12052e Mon Sep 17 00:00:00 2001 From: Shi Su Date: Thu, 20 May 2021 17:09:29 +0000 Subject: [PATCH 08/11] Add handling for get operation and discussion for writing ERROR_DB --- .../SAI_failure_handling.md | 33 +++++++++++-------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index 3618417e44..c3a6743012 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -8,7 +8,7 @@ This document describes the high-level design for orchagent in handling SAI fail ### 2.1 Requirements for failure handling functions in orchagent -1. Allow different handling for Create/Set/Remove operations. +1. Allow different handling for Create/Set/Remove/Get operations. 1. Allow each Orch to have its specific handling logic. 1. Adapt handling logic based on SAI API type and SAI status. 1. Escalate the failure to upper layers when the failure cannot be handled in orchagent. @@ -18,9 +18,9 @@ This document describes the high-level design for orchagent in handling SAI fail An illustrative figure of the failure handling framework is shown below. The orchagent generates SAI calls according to the information in APPL_DB given by upper layers. In the case of SAI failures, the orchagent gets the failure status via the feedback mechanism in synchronous mode. -Based on the failure information, the failure handling functions in orchagent make the first attempt to address the failure. +Based on the failure information, the failure handling functions in the orchagent make the first attempt to address the failure. An ERROR_DB is also introduced to support escalation to upper layers. -In the scenario where orchagent is unable to resolve the problem, the failure handling functions would escalate the failure to upper layers by pushing the failure into the ERROR_DB. +In the scenario where the orchagent is unable to resolve the problem, the failure handling functions would escalate the failure to the upper layers by pushing the failure into the ERROR_DB. @@ -33,9 +33,10 @@ To support a failure handling logic in general while also allow each orch to hav 1. `virtual task_process_status handleSaiCreateStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` 2. `virtual task_process_status handleSaiSetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` 3. `virtual task_process_status handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` +4. `virtual task_process_status handleSaiGetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)` -The three functions handle SAI failures in create, set, and remove operations, respectively. -With the type of SAI API and SAI status as an input, the function could handle the failure according to the two pieces ofinformation. +The four functions handle SAI failures in create, set, remove, and get operations, respectively. +With the type of SAI API and SAI status as an input, the function could handle the failure according to the two pieces of information. In the scenario where a specific logic is required in one of the Orchs, this design allows the Orch to inherit the function and include the specific login in the inherited function. @@ -47,9 +48,9 @@ The function also allows an optional input `context`, which allows passing conte The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). -2. Return `task_failed` -- No crash, no retry, not handled successfully. +2. Return `task_failed` -- No crash, no retry, although the failure is not handled successfully, orchagent will keep running without exit. -The failure handling function should return `task_failed` when the failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such a scenario, the function should prevent orchagent from retrying and escalate the failure to upper layers. +The failure handling function should return `task_failed` when the failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such a scenario, the function should prevent orchagent from retrying and escalate the failure to upper layers. Example scenarios where the functions should return this status include invalid user input (e.g., a wrong ACL, conflicting IP addresses), hardware permanent error, non-critical internal logic error, etc. 3. Return `task_need_retry` -- No crash, retry @@ -58,7 +59,7 @@ The failure handling function should return `task_need_retry` when the failed SA 4. exit(EXIT_FAILURE) -- Crash and trigger SwSS auto-restart Some of the failures can be resolved by restarting SwSS. -In the scenario where such failures happen, the failure handling function in orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. +In the scenario where such failures happen, the failure handling function in the orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart. ### 2.4 DB changes @@ -92,15 +93,21 @@ The ERROR_DB entry also includes a list of attributes (comma separated) and the The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. +To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB. +To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. +One possible implementation could be escalating failures to ERROR_DB when the input `context` is valid. +And during development, we only give the `context` to the failure handling functions when the corresponding failure handling in the upper layer is available. + ## 3. Failure handling logic in orchagent ### 3.1 Failure status that could be handled in orchagent -| SAI status | Create | Set | Remove | -|-----|-----|-----|-----| -| ITEM ALREADY EXISTS | Set the corresponding attribute instead. | Should not happen. No retry. | Should not happen. No retry. | -| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute. | Return success. No retry. -| OBJECT IN USE | Should not happen. No retry. | Should not happen. Retry after a while. | Retry after a while. | +| SAI status | Create | Set | Remove | Get +|-----|-----|-----|-----|-----| +| ITEM ALREADY EXISTS | Set the corresponding attribute instead. | Should not happen. No retry. | Should not happen. No retry. |Should not happen. No retry. | +| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute. | Return success. No retry. | No retry. | +| OBJECT IN USE | Should not happen. No retry. | Should not happen. Retry after a while. | Retry after a while. | Should not happen. No retry. |Should not happen. No retry. | +| NOT_SUPPORTED | Should not happen. Crash orchagent. | Should not happen. Crash orchagent. | Should not happen. Crash orchagent. | Should not happen. Crash orchagent. | TODO: Add handling logic for other SAI statuses. From e9f14814c38d40022f584c1484537103037d5089 Mon Sep 17 00:00:00 2001 From: Shi Su Date: Fri, 21 May 2021 16:38:43 +0000 Subject: [PATCH 09/11] Minor fix --- doc/SAI_failure_handling/SAI_failure_handling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index c3a6743012..e44ec8c0a0 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -44,7 +44,7 @@ The function also allows an optional input `context`, which allows passing conte #### 2.3.2 Possible execution results -1. Return `task_success` -- No crash, no retry, handled successfully. +1. Return `task_success` -- No crash, no retry, handled successfully. The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). From 22b0a577ce411c258faad6e2a56fe9ee7b922456 Mon Sep 17 00:00:00 2001 From: Shi Su Date: Wed, 26 May 2021 01:10:18 +0000 Subject: [PATCH 10/11] Clarify ERROR_DB removal logic --- doc/SAI_failure_handling/SAI_failure_handling.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index e44ec8c0a0..c95148bf69 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -93,10 +93,8 @@ The ERROR_DB entry also includes a list of attributes (comma separated) and the The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. -To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB. -To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. -One possible implementation could be escalating failures to ERROR_DB when the input `context` is valid. -And during development, we only give the `context` to the failure handling functions when the corresponding failure handling in the upper layer is available. +The upstream processes are expected to consume the ERROR_DB entries and remove the handled failures from the ERROR_DB. +Assuming the upstream processes have the proper consumption of ERROR_DB entries and failure handling logic (these are not currently available for upstreams processes and need to be added), the ERROR_DB should not keep accumulating failures in ERROR_DB and consuming memory. ## 3. Failure handling logic in orchagent From e181ec782d8e9940cead9e20d0c5f668d4a1be34 Mon Sep 17 00:00:00 2001 From: Shi Su Date: Mon, 21 Jun 2021 02:53:20 +0000 Subject: [PATCH 11/11] Add DB type for ERROR_DN key --- doc/SAI_failure_handling/SAI_failure_handling.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/SAI_failure_handling/SAI_failure_handling.md b/doc/SAI_failure_handling/SAI_failure_handling.md index c95148bf69..b9ca290eb6 100644 --- a/doc/SAI_failure_handling/SAI_failure_handling.md +++ b/doc/SAI_failure_handling/SAI_failure_handling.md @@ -68,17 +68,17 @@ An ERROR_DB will be introduced to escalate the failures from orchagent to upper The schema of ERROR_DB is designed as follows: ``` -ERROR_{{TABLE_TYPE}}_TABLE|entry +ERROR_{{DB_TYPE}}_{{TABLE_TYPE}}_TABLE|entry "failed_orch": {{failed orch type}} "failed_SAI": {{failed SAI type}} "opcode": {{failed SAI operation type}} "status": {{sai_status}} - "attributes": "attr_type0,attr_type1,..." - "attr_values": "attr_value0,attr_value1,..." + "attributes": "attr_type0,attr_type1,..." (Optional) + "attr_values": "attr_value0,attr_value1,..." (Optional) "counter": {{count}} ``` -The table and key in ERROR_DB correspond to the table and key in APPL_DB where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_ROUTE_TABLE:0.0.0.0/0`). +The table and key in ERROR_DB correspond to the DB, table, and key where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_APPL_ROUTE_TABLE:0.0.0.0/0`). The field `failed_orch` indicates the type of orch where the SAI failure happens.