Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-reboot Flow Improvements HLD #980

Merged
merged 5 commits into from
May 25, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix more review comments
Shlomi Bitton committed May 1, 2022
commit 316979e2e7b4a755b9d7afcab8f17f9fdb06f0c6
12 changes: 6 additions & 6 deletions doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
Original file line number Diff line number Diff line change
@@ -30,14 +30,14 @@ Some feature flows in SONiC are delayed with a timer to keep the CPU dedicated t
In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: can we just call it warm-reboot instead of fastfast-reboot. Warm-reboot is generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warm-reboot is indeed generic, but the approach is different.
warm-reboot is divided into 2 implementations, the normal warm-reboot and fastfast-reboot.
Although the CLI command is the same "warm-reboot" under the hood the flow is different.
Call it warm-reboot might be confusing, so this is why I mentioned it as fastfast-reboot - to be more accurate.
Do you agree?


Each network application will experience similar processing flow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time to understand the relation in the paragraph. Can you rephrase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. is it clear now?
If not can you specify what is not clear?

Application and corresponding orchagent sub modules need to work together to restore the original data and push it to the ASIC.
Application and corresponding orchagent sub modules need to work together to restore the preboot state and push it to the ASIC.
Take neighbor as an example, upon restart operation every neighbor we had prior to the reboot should be created again after resetting the ASIC.
We should also synchronize the actual neighbor state after recovering it, the MAC of the neighbor could have changed, went down for some reason etc.
In this case, restore_neighbors.py script will align the network state with the switch state by sending ARP/NDP to all known neighbors prior the reboot.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark up and link to restore_neighbors.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

neighsyncd will update the internal cache with all neighbors and push all to APP DB, orchagent will then add/remove/update any neighbor and get syncd to program the HW with the new data.

In addition to the recover mechanism, the warmboot-finalizer can be enhanced to finalize fast-reboot as well and introduce a new flag indicating the process is done.
This new flag can be used later on for any functionality which we want to start only after init flow finished in case of fast-reboot.
This new flag can be used later on for any functionality, we want to start only after init flow finished in case of fast-reboot.
This is to prevent interference in the fast-reboot reconciliation process and impair the performance, for example enablement of flex counters.

References:
@@ -49,11 +49,11 @@ https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-orchagent/e

The new Fast-reboot design should meet the following requirments:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the time goal again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned, 2 last bullets.

- Reboot the switch into a new SONiC software version using kexec.
- Upgrade the FW by the new SONiC image if needed.
- Recover all applications state with the new image to the previous state prior the reboot.
- Reboot the switch into a new SONiC software version using kexec - less than 5 seconds.
- Upgrade the switch FW by the new SONiC image if needed.
- Recover all application's state with the new image to the previous state prior the reboot.
- Recover ASIC state after reset to the previous state prior the reboot.
- Recover the Kernel state after reset to the previous state prior the reboot.
- Recover the Kernel internal DB state after reset to the previous state prior the reboot.
- Sync the Kernel and ASIC with changes on the network which happen during fast-reboot.
- Control plane downtime will not exceed 90 seconds.
- Data plane downtime will not exceed 30 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these requirements are already mostly being met.

Do you want to instead only emphasize the new-requirements that are expected to be met by new design? Like "Do not exceed CPU/mem consumption during bootup path until fast-reboot flow is finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it will be better to gather all requirements, even the ones which are already satisfied with previous implementation since we are changing it to a new one.
I think it is important to mention all when the implementation is different.
What do you think?