-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-reboot Flow Improvements HLD #980
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,14 +30,14 @@ Some feature flows in SONiC are delayed with a timer to keep the CPU dedicated t | |
In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used. | ||
|
||
Each network application will experience similar processing flow. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have a hard time to understand the relation in the paragraph. Can you rephrase? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. is it clear now? |
||
Application and corresponding orchagent sub modules need to work together to restore the original data and push it to the ASIC. | ||
Application and corresponding orchagent sub modules need to work together to restore the preboot state and push it to the ASIC. | ||
Take neighbor as an example, upon restart operation every neighbor we had prior to the reboot should be created again after resetting the ASIC. | ||
We should also synchronize the actual neighbor state after recovering it, the MAC of the neighbor could have changed, went down for some reason etc. | ||
In this case, restore_neighbors.py script will align the network state with the switch state by sending ARP/NDP to all known neighbors prior the reboot. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please mark up and link to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
neighsyncd will update the internal cache with all neighbors and push all to APP DB, orchagent will then add/remove/update any neighbor and get syncd to program the HW with the new data. | ||
|
||
In addition to the recover mechanism, the warmboot-finalizer can be enhanced to finalize fast-reboot as well and introduce a new flag indicating the process is done. | ||
This new flag can be used later on for any functionality which we want to start only after init flow finished in case of fast-reboot. | ||
This new flag can be used later on for any functionality, we want to start only after init flow finished in case of fast-reboot. | ||
This is to prevent interference in the fast-reboot reconciliation process and impair the performance, for example enablement of flex counters. | ||
|
||
References: | ||
|
@@ -49,11 +49,11 @@ https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-orchagent/e | |
|
||
The new Fast-reboot design should meet the following requirments: | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mention the time goal again? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is mentioned, 2 last bullets. |
||
- Reboot the switch into a new SONiC software version using kexec. | ||
- Upgrade the FW by the new SONiC image if needed. | ||
- Recover all applications state with the new image to the previous state prior the reboot. | ||
- Reboot the switch into a new SONiC software version using kexec - less than 5 seconds. | ||
- Upgrade the switch FW by the new SONiC image if needed. | ||
- Recover all application's state with the new image to the previous state prior the reboot. | ||
- Recover ASIC state after reset to the previous state prior the reboot. | ||
- Recover the Kernel state after reset to the previous state prior the reboot. | ||
- Recover the Kernel internal DB state after reset to the previous state prior the reboot. | ||
- Sync the Kernel and ASIC with changes on the network which happen during fast-reboot. | ||
- Control plane downtime will not exceed 90 seconds. | ||
- Data plane downtime will not exceed 30 seconds. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All these requirements are already mostly being met. Do you want to instead only emphasize the new-requirements that are expected to be met by new design? Like "Do not exceed CPU/mem consumption during bootup path until fast-reboot flow is finished. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought it will be better to gather all requirements, even the ones which are already satisfied with previous implementation since we are changing it to a new one. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: can we just call it warm-reboot instead of fastfast-reboot. Warm-reboot is generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warm-reboot is indeed generic, but the approach is different.
warm-reboot is divided into 2 implementations, the normal warm-reboot and fastfast-reboot.
Although the CLI command is the same "warm-reboot" under the hood the flow is different.
Call it warm-reboot might be confusing, so this is why I mentioned it as fastfast-reboot - to be more accurate.
Do you agree?