-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot #2286
[fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot #2286
Conversation
Why is this PR in draft mode? Is this not complete? |
# Clear all routes except of default routes for faster reconciliation time. | ||
sonic-db-cli APPL_DB eval " | ||
for _, k in ipairs(redis.call('keys', '*')) do | ||
if string.match(k, 'ROUTE_TABLE:') and not string.match(k, 'ROUTE_TABLE:0.0.0.0/0') and not string.match(k, 'ROUTE_TABLE:::/0') then \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this improve reconciliation time? Is that due to less routes to reconcile?
Why do we need this for fast-reboot, if this is not done for warm-reboot?
In fast-reboot case, does this mean we will have dataplane impact much sooner than the time when ASIC reset happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same as the previous implementation of "Fast-Reboot", we save only the default routes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, can you please point me to the present logic for save only the default routes
in fast-reboot case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving of default routes was done by the fast-reboot-dump script other than that it was saving also ARP and FDB tables. It was being saved to /host/fast-reboot/default_routes.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new suggested approach was tested side-by-side with the previous approach of dumping the arp/fdb/default_routes to json files and reloading them. This single change improves dataplane downtime by more than a second by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this change, will use the current implementation of using fast-reboot-dump and filter_fdb_entries scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inspected the last change made - reintroducing the use of fast-reboot-dump and filder_fdb_entries and the implications it had on fast-reboot performance and I wanted to rest you assured that it is not needed.
The change of having fast-reboot over warm-reboot infrastructure introduces the call to backup_database which wasn't used previously for fast-reboot. This function backups FDB_TABLE and APP_DB is being dumped as well to dump.rdb so also media settings are being stored and alongside with ARPs. This is being restored in the change made in the sonic-buildimage PR which should be merged together. In this case the issue https://github.com/sonic-net/sonic-buildimage/issues/9165 won't reproduce.
The main change that the relevant data will be restored in a similar way to as it is being restored in warm-reboot scenario and not using fdb.json; arp.json; default_routes.json; media_config.json files.
If we will call backup_database function without deleting all routes except default routes will result in storing and then restoring all the routes after reset. This will affect dataplane downtime as the amount of routes increases and this is not a desired behavior.
The specific implementation wasn't explicitly described in the HLD but the use of warm-reboot infrastructure i.e. use backup_database and dump.rdb is the cause for this change as the current logic of storing arp/fdb/default_routes and media settings is not changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read in detail. but the explanation sound reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjchadaga can you please check this PR to ensure sonic-net/sonic-buildimage#9165 will not be seen again after this new design change gets merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavhd - based on the design of backup and restore of app db using dump.rdb, i think media settings should be restored on time to avoid sonic-net/sonic-buildimage#9165
/easycla |
|
sonic-db-cli STATE_DB SET "FAST_REBOOT|system" "1" "EX" "180" &>/dev/null | ||
config warm_restart enable system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For people who does not know the background, it is really confuse here, I would suggest to add comment here to explain why fastboot requires "enable warmboot".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it confused me.
This reverts commit 5154f42.
/azpw run Azure.sonic-utilities |
/AzurePipelines run Azure.sonic-utilities |
Azure Pipelines successfully started running 1 pipeline(s). |
@vaibhavhd Could you help review? |
@@ -659,6 +631,15 @@ if [[ "$REBOOT_TYPE" = "fast-reboot" ]]; then | |||
unload_kernel | |||
exit "${EXIT_COUNTERPOLL_DELAY_FAILURE}" | |||
fi | |||
|
|||
# Clear all routes except of default routes for faster reconciliation time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a delete call (cannot recover from here), shouldn't this step be executed after set +e
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! modified.
will update also in #2365 (PR for 202205)
This PR is similar to #2286 This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1121) and sonic-buildimage PR (sonic-net/sonic-buildimage#12026). Improve fast-reboot flow by using warm-reboot infrastructure. Clear all routes except of default routes for faster reconciliation time. Verified by community fast-reboot test, manual testing.
Feel free to merge this PR. bandit checker is optional. |
Update sonic-utilities submodule pointer to include the following: * 0a7557b [minigraph] add option to specify golden path in load_minigraph ([sonic-net#2350](sonic-net/sonic-utilities#2350)) * 322aefc [GCU]Remove GCU unique lane check for duplicate lanes platforms ([sonic-net#2343](sonic-net/sonic-utilities#2343)) * 7099fff [fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot ([sonic-net#2286](sonic-net/sonic-utilities#2286)) * 09026ed [warm-reboot] fix warm-reboot when /tmp/cache is missing ([sonic-net#2367](sonic-net/sonic-utilities#2367)) * a3c404c Fix typo in platform_sfputil_helper.is_rj45_port ([sonic-net#2374](sonic-net/sonic-utilities#2374)) * 637d834 Vnet_route_check Vxlan tunnel route update. ([sonic-net#2281](sonic-net/sonic-utilities#2281)) * 29a3e51 Added support for tunnel route status in show vnet routes all. ([sonic-net#2341](sonic-net/sonic-utilities#2341)) * 1ac584b Use 'default' VRF when VRF name is not provided ([sonic-net#2368](sonic-net/sonic-utilities#2368)) * 4d377a6 [subinterface]Added additional checks in portchannel and subinterface commands ([sonic-net#2345](sonic-net/sonic-utilities#2345)) * bbcdf2e disk_check: Publish event for RO state ([sonic-net#2320](sonic-net/sonic-utilities#2320)) * 3fd537b Support the bandit check by GitHub Action ([sonic-net#2358](sonic-net/sonic-utilities#2358)) * 491d3d3 [generate dump]Added error message when saisdkdump fails ([sonic-net#2356](sonic-net/sonic-utilities#2356)) * 6830e01 [counterpoll]Fixing counterpoll show for tunnel and acl stats ([sonic-net#2355](sonic-net/sonic-utilities#2355)) * 3be2ad7 [fast-reboot]Avoid stopping masked services during fast-reboot ([sonic-net#2335](sonic-net/sonic-utilities#2335)) * 0e1b0cf [GCU] Fix missing backend in dry run ([sonic-net#2347](sonic-net/sonic-utilities#2347)) * 676c31b Add verification for override ([sonic-net#2305](sonic-net/sonic-utilities#2305)) * 48997c2 Add Password Hardening CLI support ([sonic-net#2338](sonic-net/sonic-utilities#2338)) * 414e239 update unit tests for swap ([#locato](https://github.com/sonic-net/sonic-utilities/pull/locato)) * a91a492 consider swap checking memory in ([#stalle](https://github.com/sonic-net/sonic-utilities/pull/stalle)) * f0ce586 [route_check]: Ignore standalone tunnel routes ([sonic-net#2325](sonic-net/sonic-utilities#2325)) Signed-off-by: dprital <[email protected]>
This PR should be merged together with the sonic-utilities PR (sonic-net/sonic-utilities#2286) and sonic-sairedis PR (sonic-net/sonic-sairedis#1100). Use redis contents from dump file in fast-reboot. Improve fast-reboot flow by utilizing the warm-reboot infrastructure. This followes https://github.com/sonic-net/SONiC/blob/master/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
…t-boot (sonic-net#2286) This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594). This is done to improve fast-reboot flow by: Using warm-reboot infrastructure. Clear all routes except of default routes for faster reconciliation time.
…691) **Depends on:** - sonic-net/sonic-sairedis#1100 - sonic-net/sonic-utilities#2286 - sonic-net/sonic-buildimage#11594 **Why I did this?** Daemons which are not related to warm/fast restart might affect the performance of warm/fast restart. A hardcoded start up delay is the current solution to avoid this. This PR implements a function to wait warm/fast restart done. This function provided a efficiency and graceful way for daemons to wait warm/fast restart done. **How I did it?** Implement a utility function RestartWaiter::waitRestartDone. This function waits warm restart done flag in STATE DB and return true if the flag is set by warm restart finalizer. This function is also exposed as python extension so that python daemons can utilize it. This PR depends on new fastboot design: https://github.com/sonic-net/SONiC/blob/master/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
…t-boot (sonic-net#2286) This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594). This is done to improve fast-reboot flow by: Using warm-reboot infrastructure. Clear all routes except of default routes for faster reconciliation time.
…t-boot (sonic-net#2286) This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594). This is done to improve fast-reboot flow by: Using warm-reboot infrastructure. Clear all routes except of default routes for faster reconciliation time.
This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594).
What I did
Improve fast-reboot flow.
How I did it
Using warm-reboot infrastructure.
Clear all routes except of default routes for faster reconciliation time.
How to verify it
Community fast-reboot test, manual testing.
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)