Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_reboot.py validation issue for the warmboot_finalizer activating state #4817

Closed
vikneels opened this issue Dec 6, 2021 · 10 comments · Fixed by #5083
Closed

test_reboot.py validation issue for the warmboot_finalizer activating state #4817

vikneels opened this issue Dec 6, 2021 · 10 comments · Fixed by #5083
Assignees

Comments

@vikneels
Copy link

vikneels commented Dec 6, 2021

This is with respect to the below PR that went in
#4706

The above PR adds check to validate the warmboot finalizer state with below
if reboot_type == 'warm':
logger.info('waiting for warmboot-finalizer service to become activating on {}'.format(hostname))
# Check if finalizer state reaches "activating" before the "wait" period,
# the default wait is 90s since issue of warm-reboot).
# If the finalizer state is activating, however time passed is greater than "wait",
# then fail the testcase. Start with empty value to verify time passed before
# checking finalizer state for the first time.
finalizer_state = ''
while finalizer_state != 'activating':
dut_datetime_after_ssh = duthost.get_now_time()
time_passed = float(dut_datetime_after_ssh.strftime("%s")) - float(dut_datetime.strftime("%s"))
if time_passed > wait:
raise Exception('warmboot-finalizer never reached state "activating" on {}'.format(hostname))
time.sleep(1)
finalizer_state = get_warmboot_finalizer_state(duthost)

This was done to make sure we dont allow the false positive (first time we get warmboot finalizer state it might be more than 90s and the script used to let it go through before).

However the above change causes failure to valid case:

t0 - > warmboot was issued
t0+50sec -> warmboot finalizer state became activating
t0+91sec -> script checks the warmboot finalizer state and thinks that 90sec has passed and fails the case. This is not really valid since warmboot finalizer state did become activating below 90s and just that ssh to device to validate took long.
Since the test case is about validating "When warmboot finalizer state became activating", the check that was added causes right case to fail and we might have to address that.

@liat-grozovik
Copy link
Collaborator

@yxieca could you please help to triage this issue?

@yxieca
Copy link
Collaborator

yxieca commented Dec 19, 2021

@vaibhavhd do we need to adjust the wait time?

@liat-grozovik , if we spent 90 seconds waiting for finalizer to become activating, would that imply we missed LAG 90 seconds limit already?

@vaibhavhd
Copy link
Contributor

@vikneels I don't understand the failure case that you mentioned:

t0 - > warmboot was issued
t0+50sec -> warmboot finalizer state became activating
t0+91sec -> script checks the warmboot finalizer state and thinks that 90sec has passed and fails the case. This is not really valid since warmboot finalizer state did become activating below 90s and just that ssh to device to validate took long.
Since the test case is about validating "When warmboot finalizer state became activating", the check that was added causes right case to fail and we might have to address that.

In this example, if warmboot finalizer state becomes activating in t0 + 50s, we do not wait for the remaining of 90s in the checks following that.
Specifically, after warmboot state reaches activating within 90s, we then wait for 300s for the state to become NOT activating.

Also, the part just that ssh to device to validate took long is also incorrect. The SSH started working when we detected the state to be activating.

The timeouts:
To reach IN activating state: https://github.com/Azure/sonic-mgmt/blob/4db6018c1f2f67c8433a4cfbb4bcd8c733874a07/tests/common/reboot.py#L67
https://github.com/Azure/sonic-mgmt/blob/4db6018c1f2f67c8433a4cfbb4bcd8c733874a07/tests/common/reboot.py#L186

To come OUT of activating state: https://github.com/Azure/sonic-mgmt/blob/4db6018c1f2f67c8433a4cfbb4bcd8c733874a07/tests/common/reboot.py#L66
https://github.com/Azure/sonic-mgmt/blob/4db6018c1f2f67c8433a4cfbb4bcd8c733874a07/tests/common/reboot.py#L198

@vaibhavhd
Copy link
Contributor

Regarding the case where SSH took too long (>90s), and then warmboot finalizer has already reached and crossed activating state, then it it should also be a valid failure case.

It should be noted that the 90s check is not only to test When warmboot finalizer state became activating.
The 90s wait time is actually for the control plane downtime period. The present checks are in fact a little too lenient.
Some changes that I think should be incorporated into test:

  1. Do not start the 90s timeout at the time of issuing the warmboot command.
    1.1. Start the T0 when the control plane goes down OR when the teamd is stopped gracefully.
  2. Do not end the 90s timeout based on when FINALIZER state has reached activating.
    2.1 End the 90s timeout when control plane is back up OR when the teamd starts again.

I think we can move these checks into test_advance_reboot::test_warm_reboot instead of this testcase (test_reboot::test_warm_reboot).
The reason is that log_analyzer is already enabled in advanced_reboot cases.

@vikneels
Copy link
Author

vikneels commented Dec 20, 2021

Regarding the case where SSH took too long (>90s), and then warmboot finalizer has already reached and crossed activating state, then it it should also be a valid failure case.

usually from our platform we do some checks before we let the login happen. however the activating state before that. So what you are saying is even dut ssh should happen before 90sec? that was not explicit from test case. hence wanted the clarification. if thats the case, I will check how i can let the ssh happen before 90s. is there is any separate requirement for ssh timeout? if so let me know.

@vikneels
Copy link
Author

Hello.. can you please respond to this?
usually from our platform we do some checks before we let the login happen. however the activating state before that. So what you are saying is even dut ssh should happen before 90sec? that was not explicit from test case. hence wanted the clarification. if thats the case, I will check how i can let the ssh happen before 90s. is there is any separate requirement for ssh timeout? if so let me know.
Thanks..

@vaibhavhd
Copy link
Contributor

Hi @vikneels,
To clarify about the significance of 90s time expectation:

  1. It is NOT for SSH to work within 90s per-se.
  2. It is for LAGs to be back up in 90s.

90s is the default lacp timeout - the amount of time that a port-channel interface waits for a LACPDU from the remote system before terminating the LACP session.

The whole confusion about SSH and finalizer timeout arises due to not-so-precise timers in the test. We rely on SSH in the test as we want to check the state of warmboot-finalizer, and for that we would need the DUT's SSH to work. This mechanism of verifying LACP-timeout can be improved without involving SSH and finalizer.
The placement of start and end timer for 90s should be fixed by the changes I mentioned in #4817 (comment).

I will target making these changes this week.

Does this answer your concerns?

@vikneels
Copy link
Author

Thanks. It sounds good . As long as we dont assume things based on ssh I am fine.

Please share the PR once its out. Thanks!

@sachinv-msft
Copy link

@vaibhavhd any update on this issue?

@vaibhavhd
Copy link
Contributor

Sorry this took a while to be addressed. Added a fix for this at PR 5083. Please review the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants