Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warm-reboot]: added pre-check for ISSU file #915

Merged
merged 6 commits into from
Jun 9, 2020
Merged

[warm-reboot]: added pre-check for ISSU file #915

merged 6 commits into from
Jun 9, 2020

Conversation

vadymhlushko-mlnx
Copy link
Contributor

Signed-off-by: Vadym Hlushko [email protected]

- What I did
Added pre-check for file /host/warmboot/issu_bank.txt in warm-reboot script.
Added instruction on how to recover issu_bank.txt.

- How I did it
Modify the 'fast-reboot' script

- How to verify it
Need to somehow corrupt /host/warmboot/issu_bank.txt (list below) and then run the warm-reboot command.

  1. Remove issu_bank.txt
  2. Clear issu_bank.txt
  3. Change characters count in issu_bank.txt
  4. Write invalid content into issu_bank.txt

- Previous command output (if the output of a command-line utility has changed)
root@arc-switch1041:~# warm-reboot
Warning: Stopping telemetry.service, but it can still be activated by:
telemetry.timer

- New command output (if the output of a command-line utility has changed)
root@arc-switch1041:~# warm-reboot
(/host/warmboot/issu_bank.txt) does NOT exist or empty ...
To recover (/host/warmboot/issu_bank.txt) file, do the following:
$ docker exec -it syncd sx_api_dbg_generate_dump.py
$ docker exec -it syncd cat /tmp/sdkdump | grep 'ISSU Bank'
Command above will print the VALUE of ISSU BANK - 0 or 1, use this VALUE in the next command
$ printf VALUE > /host/warmboot/issu_bank.txt

@msftclas
Copy link

msftclas commented May 14, 2020

CLA assistant check
All CLA requirements met.

stepanblyschak
stepanblyschak previously approved these changes May 15, 2020
exit "${MLNX_ISSU_BANK_BROKEN}"
fi

issu_file_chars_count=`wc -m ${ISSU_BANK_FILE} | awk '{print $1}'`;
Copy link
Contributor

@qiluo-msft qiluo-msft May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wc -m ${ISSU_BANK_FILE} | awk '{print $1}' [](start = 27, length = 42)

A more simple solution

stat -c %s FILE
``` #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try? This comment is still applicable.


In reply to: 427512540 [](ancestors = 427512540)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try? This comment is still applicable.

In reply to: 427512540 [](ancestors = 427512540)

Fixed

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments

qiluo-msft
qiluo-msft previously approved these changes May 27, 2020
request_pre_shutdown

wait_for_pre_shutdown_complete_or_fail

if [[ "x$sonic_asic_type" == x"mellanox" ]]; then
check_issu_bank_file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the last minute issue.

How can we distinguish if the error message came out from line 531 or 523? Are you able to passing in a differentiating token to check issu_bank_file and use that to distinguish if the error happens before or after pre-shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the last minute issue.

How can we distinguish if the error message came out from line 531 or 523? Are you able to passing in a differentiating token to check issu_bank_file and use that to distinguish if the error happens before or after pre-shutdown?

Fixed

Copy link
Contributor

@yxieca yxieca Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't see this earlier, even before pre-shutdown, the swss has been stopped. Exiting here will leave the device eventually in failure mode requires human interaction.

I think both check should generate log event stating problem found and mark if the problem was found before or after pre-shutdown, but not exiting.

Is there a chance that there was a problem before pre-shutdown, but pre-shutdown fixed it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… issu_bank.txt was corrupted

Signed-off-by: Vadym Hlushko <[email protected]>

if [[ ! -s "$ISSU_BANK_FILE" ]]; then
error "(${ISSU_BANK_FILE}) does NOT exist or empty ..."
recover_issu_bank_file_instruction
exit "${MLNX_ISSU_BANK_BROKEN}"
if [[ "$1" = true ]]; then
Copy link
Contributor

@qiluo-msft qiluo-msft May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't mix tab and space. We prefer 4 spaces for each indentation level. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if [[ "$1" = true ]]; then
exit "${MLNX_ISSU_BANK_BROKEN_BEFORE_PRE_SHUTDOWN}"
else
exit "${MLNX_ISSU_BANK_BROKEN_AFTER_PRE_SHUTDOWN}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this issue in my second round of review.

Exit here will cause the fast/warm reboot to abort. I don't think this is what we want here. Because at the time when this check happens, we've already shutdown some critical services, e.g. BGP, RADV. Depending on if pre-shutdown is called, we could also lose syncd. It is unfortunate that there was an issue. But I think at this point, we would have to proceed with reboot and take the consequence. The logic is that aborting reboot sequence will require human intervention to recover. Continue rebooting will cause data plane disruption in the future. That sucks, but that would self heal. Right?

I think here the best we can do is to generate a log message so that we have log to look into after reboot is done and learn what happened when.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this issue in my second round of review.

Exit here will cause the fast/warm reboot to abort. I don't think this is what we want here. Because at the time when this check happens, we've already shutdown some critical services, e.g. BGP, RADV. Depending on if pre-shutdown is called, we could also lose syncd. It is unfortunate that there was an issue. But I think at this point, we would have to proceed with reboot and take the consequence. The logic is that aborting reboot sequence will require human intervention to recover. Continue rebooting will cause data plane disruption in the future. That sucks, but that would self heal. Right?

I think here the best we can do is to generate a log message so that we have log to look into after reboot is done and learn what happened when.

Fixed

@@ -483,10 +512,18 @@ systemctl stop swss
if [[ "$REBOOT_TYPE" = "warm-reboot" || "$REBOOT_TYPE" = "fastfast-reboot" ]]; then
initialize_pre_shutdown

if [[ "x$sonic_asic_type" == x"mellanox" ]]; then
check_issu_bank_file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_issu_bank_file [](start = 8, length = 20)

This check is before request_pre_shutdown. If you find anything wrong, you may still exit the script and control/data plane are still running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice new issue

…re call request_pre_shutdown

Signed-off-by: Vadym Hlushko <[email protected]>
@qiluo-msft
Copy link
Contributor

Retest this please

@yxieca yxieca merged commit 1ddd3f2 into sonic-net:master Jun 9, 2020
abdosi pushed a commit that referenced this pull request Jun 16, 2020
* [warm-reboot]: added pre-check for ISSU file

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: fixed problems according to PR review comments

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: added distinguish token and error code to clarify when issu_bank.txt was corrupted

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: fixed indentation problems

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: removed script interrupt when issu_bank.txt is broken

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: added script interrupt if issu_bank.txt is broken before call request_pre_shutdown

Signed-off-by: Vadym Hlushko <[email protected]>
abdosi pushed a commit to abdosi/sonic-utilities that referenced this pull request Aug 4, 2020
* [warm-reboot]: added pre-check for ISSU file

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: fixed problems according to PR review comments

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: added distinguish token and error code to clarify when issu_bank.txt was corrupted

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: fixed indentation problems

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: removed script interrupt when issu_bank.txt is broken

Signed-off-by: Vadym Hlushko <[email protected]>

* [warm-reboot]: added script interrupt if issu_bank.txt is broken before call request_pre_shutdown

Signed-off-by: Vadym Hlushko <[email protected]>
stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 28, 2022
 [201911][thermal control] Backport changes from master branch (sonic-net#929)
     [201911][config] Support abbreviation (sonic-net#933)
       Add 'hw-management-generate-dump.sh' to 'show techsupport'
       command (sonic-net#934)
       [fwutil]: Update fwutil to v2.0.0.0. (sonic-net#942)
       Fixes bug for PFCWD feature parameters (sonic-net#838)
     Fixed fast-reboot for BFN platform (sonic-net#871)
     [config] Add 'interface transceiver' subgroup with 'lpmode' and
     'reset' subcommands (sonic-net#904)
      [warm-reboot]: added pre-check for ISSU file (sonic-net#915)
       [config] Don't attempt to restart disabled services (sonic-net#944)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants