-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[warm-reboot]: added pre-check for ISSU file #915
[warm-reboot]: added pre-check for ISSU file #915
Conversation
Signed-off-by: Vadym Hlushko <[email protected]>
scripts/fast-reboot
Outdated
exit "${MLNX_ISSU_BANK_BROKEN}" | ||
fi | ||
|
||
issu_file_chars_count=`wc -m ${ISSU_BANK_FILE} | awk '{print $1}'`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wc -m ${ISSU_BANK_FILE} | awk '{print $1}' [](start = 27, length = 42)
A more simple solution
stat -c %s FILE
``` #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try? This comment is still applicable.
In reply to: 427512540 [](ancestors = 427512540)
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As comments
Signed-off-by: Vadym Hlushko <[email protected]>
request_pre_shutdown | ||
|
||
wait_for_pre_shutdown_complete_or_fail | ||
|
||
if [[ "x$sonic_asic_type" == x"mellanox" ]]; then | ||
check_issu_bank_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the last minute issue.
How can we distinguish if the error message came out from line 531 or 523? Are you able to passing in a differentiating token to check issu_bank_file and use that to distinguish if the error happens before or after pre-shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the last minute issue.
How can we distinguish if the error message came out from line 531 or 523? Are you able to passing in a differentiating token to check issu_bank_file and use that to distinguish if the error happens before or after pre-shutdown?
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't see this earlier, even before pre-shutdown, the swss has been stopped. Exiting here will leave the device eventually in failure mode requires human interaction.
I think both check should generate log event stating problem found and mark if the problem was found before or after pre-shutdown, but not exiting.
Is there a chance that there was a problem before pre-shutdown, but pre-shutdown fixed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only safe place to check and exit is here: https://github.com/Azure/sonic-utilities/blob/master/scripts/fast-reboot#L255
… issu_bank.txt was corrupted Signed-off-by: Vadym Hlushko <[email protected]>
scripts/fast-reboot
Outdated
|
||
if [[ ! -s "$ISSU_BANK_FILE" ]]; then | ||
error "(${ISSU_BANK_FILE}) does NOT exist or empty ..." | ||
recover_issu_bank_file_instruction | ||
exit "${MLNX_ISSU_BANK_BROKEN}" | ||
if [[ "$1" = true ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't mix tab and space. We prefer 4 spaces for each indentation level. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
scripts/fast-reboot
Outdated
if [[ "$1" = true ]]; then | ||
exit "${MLNX_ISSU_BANK_BROKEN_BEFORE_PRE_SHUTDOWN}" | ||
else | ||
exit "${MLNX_ISSU_BANK_BROKEN_AFTER_PRE_SHUTDOWN}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this issue in my second round of review.
Exit here will cause the fast/warm reboot to abort. I don't think this is what we want here. Because at the time when this check happens, we've already shutdown some critical services, e.g. BGP, RADV. Depending on if pre-shutdown is called, we could also lose syncd. It is unfortunate that there was an issue. But I think at this point, we would have to proceed with reboot and take the consequence. The logic is that aborting reboot sequence will require human intervention to recover. Continue rebooting will cause data plane disruption in the future. That sucks, but that would self heal. Right?
I think here the best we can do is to generate a log message so that we have log to look into after reboot is done and learn what happened when.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this issue in my second round of review.
Exit here will cause the fast/warm reboot to abort. I don't think this is what we want here. Because at the time when this check happens, we've already shutdown some critical services, e.g. BGP, RADV. Depending on if pre-shutdown is called, we could also lose syncd. It is unfortunate that there was an issue. But I think at this point, we would have to proceed with reboot and take the consequence. The logic is that aborting reboot sequence will require human intervention to recover. Continue rebooting will cause data plane disruption in the future. That sucks, but that would self heal. Right?
I think here the best we can do is to generate a log message so that we have log to look into after reboot is done and learn what happened when.
Fixed
Signed-off-by: Vadym Hlushko <[email protected]>
Signed-off-by: Vadym Hlushko <[email protected]>
scripts/fast-reboot
Outdated
@@ -483,10 +512,18 @@ systemctl stop swss | |||
if [[ "$REBOOT_TYPE" = "warm-reboot" || "$REBOOT_TYPE" = "fastfast-reboot" ]]; then | |||
initialize_pre_shutdown | |||
|
|||
if [[ "x$sonic_asic_type" == x"mellanox" ]]; then | |||
check_issu_bank_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_issu_bank_file [](start = 8, length = 20)
This check is before request_pre_shutdown
. If you find anything wrong, you may still exit
the script and control/data plane are still running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice new issue
…re call request_pre_shutdown Signed-off-by: Vadym Hlushko <[email protected]>
Retest this please |
* [warm-reboot]: added pre-check for ISSU file Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: fixed problems according to PR review comments Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: added distinguish token and error code to clarify when issu_bank.txt was corrupted Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: fixed indentation problems Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: removed script interrupt when issu_bank.txt is broken Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: added script interrupt if issu_bank.txt is broken before call request_pre_shutdown Signed-off-by: Vadym Hlushko <[email protected]>
* [warm-reboot]: added pre-check for ISSU file Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: fixed problems according to PR review comments Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: added distinguish token and error code to clarify when issu_bank.txt was corrupted Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: fixed indentation problems Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: removed script interrupt when issu_bank.txt is broken Signed-off-by: Vadym Hlushko <[email protected]> * [warm-reboot]: added script interrupt if issu_bank.txt is broken before call request_pre_shutdown Signed-off-by: Vadym Hlushko <[email protected]>
[201911][thermal control] Backport changes from master branch (sonic-net#929) [201911][config] Support abbreviation (sonic-net#933) Add 'hw-management-generate-dump.sh' to 'show techsupport' command (sonic-net#934) [fwutil]: Update fwutil to v2.0.0.0. (sonic-net#942) Fixes bug for PFCWD feature parameters (sonic-net#838) Fixed fast-reboot for BFN platform (sonic-net#871) [config] Add 'interface transceiver' subgroup with 'lpmode' and 'reset' subcommands (sonic-net#904) [warm-reboot]: added pre-check for ISSU file (sonic-net#915) [config] Don't attempt to restart disabled services (sonic-net#944)
Signed-off-by: Vadym Hlushko [email protected]
- What I did
Added pre-check for file /host/warmboot/issu_bank.txt in warm-reboot script.
Added instruction on how to recover issu_bank.txt.
- How I did it
Modify the 'fast-reboot' script
- How to verify it
Need to somehow corrupt /host/warmboot/issu_bank.txt (list below) and then run the warm-reboot command.
- Previous command output (if the output of a command-line utility has changed)
root@arc-switch1041:~# warm-reboot
Warning: Stopping telemetry.service, but it can still be activated by:
telemetry.timer
- New command output (if the output of a command-line utility has changed)
root@arc-switch1041:~# warm-reboot
(/host/warmboot/issu_bank.txt) does NOT exist or empty ...
To recover (/host/warmboot/issu_bank.txt) file, do the following:
$ docker exec -it syncd sx_api_dbg_generate_dump.py
$ docker exec -it syncd cat /tmp/sdkdump | grep 'ISSU Bank'
Command above will print the VALUE of ISSU BANK - 0 or 1, use this VALUE in the next command
$ printf VALUE > /host/warmboot/issu_bank.txt