-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[syncd.sh,pmon.service] Prevent pmon from starting ahead of syncd #8
[syncd.sh,pmon.service] Prevent pmon from starting ahead of syncd #8
Conversation
During system starting, pmon isn't supposed to start ahead of syncd starting in order to avoid racing condition between syncd and pmon. Currently it is done by killing pmon if is alive when syncd is starting. However such implementation is still risky. Consider the following flow: 1. pmon is inactive when syncd.sh is checking. but syncd.sh is scheduled out somehow just ahead of "chipdown" called 2. systemd is switched in and starts pmon service 3. at this point, pmon and syncd are running simultaneously, critical section broken and racing condition formed To prevent that issue, ony solution is to add syncd as "After" in pmon.service, which ensure that whenever pmon starts syncd has been started. However, dong so requires to defer starting pmon.service after syncd.service has fully started otherwise a deadlock is formed as following: 1. syncd.sh starts pmon ahead of itself fully started, while 2. pmon not being able to start due to syncd, one of its "After", not fully started. 3. as a result, syncd and pmon have to wait for each other forever To solve that, move starting pmon.service to "wait()" so that pmon is started after syncd fully started, breaking the deadlock.
@stephenxs Maybe it is a good idea to have the entire synchronization flow based on systemd dependencies and remove all the rest stuff from syncd.sh? The pmon stop for warm-fast flows can be done in a dedicated scripts. Could you please check sych an option and compare it to existing approach? If you have any concerns we can discuss them. |
The idea to have Wants=pmon.service in syncd.service and After=syncd.service with Requires=syncd.service in pmon.service. This will definitely simplify the entire flow. What do you think? |
This update brings in the following commits. 86c1108 Enable arm architecture to build in addition to amd64 (#37) 4acb2c3 fix bugs and enhance Transformer (#35) 49e5a22 ygot related enhancements and fixes (#34) 51224de Fix ietf yang search path for cvl schema builds (#32) 3c6cdb3 CVL Changes #8: 'must' and 'when' expression evaluation (#31) dabf231 CVL Changes #7: 'leafref' evaluation (#28) 6f9535f CVL Changes #6: Customized Xpath Engine integration (#27) 5e2466b DB-Layer fixes/enhancements (#26) 9a27302 CVL Changes #4: Implementation of new CVL APIs (#22) dbf1093 Translib support for authorization, yang versioning and Delete flag (#21) 80f369e CVL Changes #5: YParser enhancement (#23) 904ce18 CVL Changes #3: Multi-db instance support (#20) 9d24a34 CVL Changes #2: YValidator infra changes for evaluating xpath expression (#19) f3fc40f CVL Changes #1: Initial CVL code reorganization and common infra changes (#18) 4922601 Bulk and RPC API support in translib (#16) 1d730df RFC7895 yang module library implementation (#15)
Update Barefoot platform support for Bullseye and 5.10 kernel, and add python3-venv.
#### What I did [sonic-linkmgrd][master] submodule update 6c6151b Fix unstable unit tests (state change handler wasn't invoked) (#8) 2f7dc0a support code diff coverage (#5) 83f0002 Force mux state switch to standby if triggered from Cli (#6) signed-off-by: Jing Zhang [email protected]
6c6151b Fix unstable unit tests (state change handler wasn't invoked) (#8) 2f7dc0a support code diff coverage (#5) 83f0002 Force mux state switch to standby if triggered from Cli (#6) signed-off-by: Jing Zhang [email protected]
* [BFN] Updated platform APIs impl Signed-off-by: Andriy Kokhan <[email protected]> * Extended BFN platform SFP APIs implementation * Update sfp.py * [BFN] Extended SFP platform plugin implementation Signed-off-by: Andriy Kokhan <[email protected]> * [BFN] Extended Fans platform plugin implementation * [BFN] divided classes Fan and FanDrawer into 2 files * Signed-off-by: Vadym Yashchenko <[email protected]> What I did Add get_model() function Add get_low_critical_threshold() function Change __get(...) function. How I did it Differnece from previous implementation of __get(...) function is return real value or -9999.9 if value is not provided by thrift API * Add get_presence() function and revised __get() function Signed-off-by: Vadym Yashchenko <[email protected]> * [BFN] Updated PSU platform APIs impl Signed-off-by: Dmytro Lytvynenko <[email protected]> * Added BFN PSU cache (#9) Signed-off-by: Andriy Kokhan <[email protected]> * [BFN] Fans and Fantray platform APIs update (#7) * [BFN] Updated SFP platform APIs (#10) Signed-off-by: Volodymyr Boyko <[email protected]> * [BFN] Updated platform API for thermal (#8) * Signed-off-by: Vadym Yashchenko <[email protected]> * Revert "[BFN] Fans and Fantray platform APIs update (#7)" (#11) This reverts commit c62a733. * Add support health monitor system (#15) Signed-off-by: Petro Bratash <[email protected]> * Update chassis.py * [BFN] Updated FANs and FAN Tray platform API (#14) * Fix fix_alignment (#17) Signed-off-by: Petro Bratash <[email protected]> * [BFN] Improvement show environment (#16) * Added PSU temperature skip into platform.json (#18) Signed-off-by: Andriy Kokhan <[email protected]> * Do not skip psud on Newport Signed-off-by: Andriy Kokhan <[email protected]> * [BFN] fix fan status from Not OK to Ok (#19) * [BFN] Updated SFP platform plugin (#13) Signed-off-by: Volodymyr Boyko <[email protected]> * [DPB] Fix typo for Ethernet0 2x200G[100G,40G] breakout mode (#21) Signed-off-by: Mykola Gerasymenko <[email protected]> * [barefoot] Tmp fix vendor_rev (#22) Signed-off-by: Volodymyr Boyko <[email protected]> * Fixed python issues in sonic_platform/fan_drawer.py Signed-off-by: Andriy Kokhan <[email protected]> * Updated fan_drawer.py * Fixing trailing white spaces in fan_drawer.py * [BFN] Fix thrift for SFPs API Signed-off-by: Volodymyr Boyko <[email protected]> * In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue Signed-off-by: Andriy Kokhan <[email protected]> * [Newport] Thermal manager (#23) * Signed-off-by: Vadym Yashchenko <[email protected]> * Revert "In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue" This reverts commit 1e73127. * Removed 'controllable' options from platform.json to fix factory default config generation Signed-off-by: Andriy Kokhan <[email protected]> * Update thermal_manager.py * Migrated SFP plugin to sonic_xcvr API (#30) Signed-off-by: Andriy Kokhan <[email protected]> Co-authored-by: KostiantynYarovyiBf <[email protected]> Co-authored-by: Vadym Yashchenko <[email protected]> Co-authored-by: Dmytro Lytvynenko <[email protected]> Co-authored-by: Volodymyr Boiko <[email protected]> Co-authored-by: Petro Bratash <[email protected]> Co-authored-by: Mykola Gerasymenko <[email protected]>
…et#21095) Adding the below fix from FRR FRRouting/frr#17297 This is to fix the following crash which is a statistical issue [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'. Program terminated with signal SIGABRT, Aborted. #0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 [Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))] (gdb) bt #0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678 #4 0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352 #5 0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258 #6 route_next (node=<optimized out>) at ../lib/table.c:436 #7 route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410 #8 0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020") at ../zebra/interface.c:312 #9 0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867 #10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221 #11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810 #12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990 #13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198 #14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
- What I did
Prevent pmon from starting ahead of syncd by adding syncd.service as "After" of pmon.service.
This PR is one of the options that solve the comments of [syncd.sh] stop pmon ahead of syncd in flows except warm reboot #7
- How I did it
During system starting, pmon isn't supposed to start ahead of syncd starting in order to avoid racing condition between syncd and pmon.
Currently it is done by killing pmon if is alive when syncd is starting. However such implementation is still risky. Consider the following flow:
To prevent that issue, ony solution is to add syncd as "After" in pmon.service, which ensure that whenever pmon starts syncd has been started.
However, dong so requires to defer starting pmon.service after syncd.service has fully started otherwise a deadlock is formed as following:
To solve that, move starting pmon.service to "wait()" so that pmon is started after syncd fully started, breaking the deadlock.
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)