-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_reboot.py validation issue for the warmboot_finalizer activating state #4817
Comments
@yxieca could you please help to triage this issue? |
@vaibhavhd do we need to adjust the wait time? @liat-grozovik , if we spent 90 seconds waiting for finalizer to become activating, would that imply we missed LAG 90 seconds limit already? |
@vikneels I don't understand the failure case that you mentioned:
In this example, if warmboot finalizer state becomes activating in t0 + 50s, we do not wait for the remaining of 90s in the checks following that. Also, the part The timeouts: To come OUT of |
Regarding the case where SSH took too long (>90s), and then warmboot finalizer has already reached and crossed It should be noted that the 90s check is not only to test
I think we can move these checks into |
Regarding the case where SSH took too long (>90s), and then warmboot finalizer has already reached and crossed activating state, then it it should also be a valid failure case.
|
Hello.. can you please respond to this? |
Hi @vikneels,
90s is the default lacp timeout - the amount of time that a port-channel interface waits for a LACPDU from the remote system before terminating the LACP session. The whole confusion about SSH and finalizer timeout arises due to not-so-precise timers in the test. We rely on SSH in the test as we want to check the state of warmboot-finalizer, and for that we would need the DUT's SSH to work. This mechanism of verifying LACP-timeout can be improved without involving SSH and finalizer. I will target making these changes this week. Does this answer your concerns? |
Thanks. It sounds good . As long as we dont assume things based on ssh I am fine. Please share the PR once its out. Thanks! |
@vaibhavhd any update on this issue? |
Sorry this took a while to be addressed. Added a fix for this at PR 5083. Please review the PR. |
This is with respect to the below PR that went in
#4706
The above PR adds check to validate the warmboot finalizer state with below
if reboot_type == 'warm':
logger.info('waiting for warmboot-finalizer service to become activating on {}'.format(hostname))
# Check if finalizer state reaches "activating" before the "wait" period,
# the default wait is 90s since issue of warm-reboot).
# If the finalizer state is activating, however time passed is greater than "wait",
# then fail the testcase. Start with empty value to verify time passed before
# checking finalizer state for the first time.
finalizer_state = ''
while finalizer_state != 'activating':
dut_datetime_after_ssh = duthost.get_now_time()
time_passed = float(dut_datetime_after_ssh.strftime("%s")) - float(dut_datetime.strftime("%s"))
if time_passed > wait:
raise Exception('warmboot-finalizer never reached state "activating" on {}'.format(hostname))
time.sleep(1)
finalizer_state = get_warmboot_finalizer_state(duthost)
This was done to make sure we dont allow the false positive (first time we get warmboot finalizer state it might be more than 90s and the script used to let it go through before).
However the above change causes failure to valid case:
t0 - > warmboot was issued
t0+50sec -> warmboot finalizer state became activating
t0+91sec -> script checks the warmboot finalizer state and thinks that 90sec has passed and fails the case. This is not really valid since warmboot finalizer state did become activating below 90s and just that ssh to device to validate took long.
Since the test case is about validating "When warmboot finalizer state became activating", the check that was added causes right case to fail and we might have to address that.
The text was updated successfully, but these errors were encountered: