Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterizing sdn to ovn migration timeout #42672

Conversation

vishnuchalla
Copy link
Contributor

@vishnuchalla vishnuchalla commented Aug 24, 2023

Adding a generic timeout for the ovn-sdn-migration ref as it is resulting in some of the jobs failures. More context here.
This option should provide us a flexibility to decide on a generic timeout for all the steps involved. If not set, It will run with the default values.

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@vishnuchalla: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.15-e2e-aws-ovn-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.15-e2e-aws-sdn-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.14-e2e-aws-ovn-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.14-e2e-aws-sdn-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.13-e2e-aws-ovn-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.13-e2e-aws-sdn-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.12-e2e-aws-ovn-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.12-e2e-aws-sdn-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.11-e2e-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.11-e2e-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.10-e2e-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.10-e2e-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.9-e2e-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.9-e2e-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-network-migration openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-network-migration-rollback openshift/cluster-network-operator presubmit Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-arm64-nightly-azure-ipi-sdn-migration-ovn-f14 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.10-amd64-nightly-azure-ipi-sdn-migration-ovn-f28 N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.8-e2e-network-migration N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.11-e2e-network-migration N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.8-e2e-network-migration-rollback N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-network-migration N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-azure-ipi-sdn-migration-ovn-f14 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-arm64-nightly-gcp-ipi-sdn-migration-ovn-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-gcp-ipi-sdn-migration-ovn-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.12-arm64-nightly-aws-ipi-sdn-migration-ovn-f14 N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.12-e2e-network-migration-rollback N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.13-e2e-network-migration-rollback N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-aws-ipi-sdn-migration-ovn-f14 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-gcp-ipi-sdn-migration-ovn-p2-f14 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-ec-aws-ipi-sdn-migration-ovn-f14 N/A periodic Registry content changed
periodic-ci-openshift-release-master-ci-4.11-e2e-network-migration-rollback N/A periodic Registry content changed

A total of 67 jobs have been affected by this change. The above listing is non-exhaustive and limited to 35 jobs.

A full list of affected jobs can be found here
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 10 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 20 rehearsals
Comment: /pj-rehearse max to run up to 35 rehearsals
Comment: /pj-rehearse auto-ack to run up to 10 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci openshift-ci bot requested review from danwinship and jluhrsen August 24, 2023 23:11
@vishnuchalla
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14

@vishnuchalla
Copy link
Contributor Author

/assign @jtaleric
/assign @krishvoor

@vishnuchalla
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14

@vishnuchalla
Copy link
Contributor Author

/joke

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 25, 2023

@vishnuchalla: What's the best thing about elevator jokes? They work on so many levels.

In response to this:

/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 25, 2023

@vishnuchalla: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14 5710db3 link unknown /pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vishnuchalla
Copy link
Contributor Author

/assign @jluhrsen

@pliurh
Copy link
Contributor

pliurh commented Aug 31, 2023

@vishnuchalla I guess you want to increase the timeout when running this script in a large cluster. May I ask what is the size of the cluster? And at which step the current timeout is not big enough?

@vishnuchalla
Copy link
Contributor Author

@vishnuchalla I guess you want to increase the timeout when running this script in a large cluster. May I ask what is the size of the cluster? And at which step the current timeout is not big enough?

We have cluster with 120 worker nodes on which we plan to trigger some large scale tests.

@vishnuchalla
Copy link
Contributor Author

/pj-rehearse ack

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Aug 31, 2023
@pliurh
Copy link
Contributor

pliurh commented Aug 31, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 31, 2023
@danwinship
Copy link
Contributor

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

@vishnuchalla
Copy link
Contributor Author

vishnuchalla commented Aug 31, 2023

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

@jluhrsen
Copy link
Contributor

jluhrsen commented Sep 1, 2023

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the
timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.

That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each
step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

@vishnuchalla
Copy link
Contributor Author

vishnuchalla commented Sep 1, 2023

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.

That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

Yes defaults will stay in place. This change is to provide users flexibility to modify timeout according to their use case. If they set a lower value its their own set failure, but if they set it to a greater one then it will be beneficial. And also I don't think that the user who is going to set this timeout value is going to set it without totally being aware of where and how it is being used. And also if someone reaches out with a failure reason due to timeout, we can simply redirect them to adjust OVN_SDN_MIGRATION_TIMEOUT value according to their own use case. Its totally up to them to decide.

@jluhrsen
Copy link
Contributor

jluhrsen commented Sep 2, 2023

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.
That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

Yes defaults will stay in place. This change is to provide users flexibility to modify timeout according to their use case. If they set a lower value its their own set failure, but if they set it to a greater one then it will be beneficial. And also I don't think that the user who is going to set this timeout value is going to set it without totally being aware of where and how it is being used. And also if someone reaches out with a failure reason due to timeout, we can simply redirect them to adjust OVN_SDN_MIGRATION_TIMEOUT value according to their own use case. Its totally up to them to decide.

gotcha, and it makes sense. I like the idea. my only concern is that I would assume that if someone is going
to tweak this timeout on purpose they are going to use a larger value (> 2700 for example) and that is
going to override a 60s timeout which is pretty dramatic. There are two cases where that is the case. Anyway,
if you think it's overkill to add a new variable for each specific timeout case, then no worries on my side.

@vishnuchalla
Copy link
Contributor Author

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.
That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

Yes defaults will stay in place. This change is to provide users flexibility to modify timeout according to their use case. If they set a lower value its their own set failure, but if they set it to a greater one then it will be beneficial. And also I don't think that the user who is going to set this timeout value is going to set it without totally being aware of where and how it is being used. And also if someone reaches out with a failure reason due to timeout, we can simply redirect them to adjust OVN_SDN_MIGRATION_TIMEOUT value according to their own use case. Its totally up to them to decide.

gotcha, and it makes sense. I like the idea. my only concern is that I would assume that if someone is going to tweak this timeout on purpose they are going to use a larger value (> 2700 for example) and that is going to override a 60s timeout which is pretty dramatic. There are two cases where that is the case. Anyway, if you think it's overkill to add a new variable for each specific timeout case, then no worries on my side.

Yes, Even if the value is larger timeout is just a timeout. As a process need not wait until the timeout to finish and as its just a limit on execution time, I think it should be fine.

@jtaleric
Copy link
Contributor

jtaleric commented Sep 7, 2023

/lgtm

1 similar comment
@jluhrsen
Copy link
Contributor

jluhrsen commented Sep 7, 2023

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jluhrsen, jtaleric, pliurh, vishnuchalla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2023
@openshift-merge-robot openshift-merge-robot merged commit af193b3 into openshift:master Sep 7, 2023
prb112 pushed a commit to prb112/openshift-release that referenced this pull request Sep 12, 2023
heyselbi pushed a commit to heyselbi/release that referenced this pull request Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants