Parameterizing sdn to ovn migration timeout #42672

vishnuchalla · 2023-08-24T23:09:48Z

Adding a generic timeout for the ovn-sdn-migration ref as it is resulting in some of the jobs failures. More context here.
This option should provide us a flexibility to decide on a generic timeout for all the steps involved. If not set, It will run with the default values.

openshift-ci-robot · 2023-08-24T23:10:45Z

[REHEARSALNOTIFIER]
@vishnuchalla: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.15-e2e-aws-ovn-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.15-e2e-aws-sdn-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.14-e2e-aws-ovn-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.14-e2e-aws-sdn-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.13-e2e-aws-ovn-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.13-e2e-aws-sdn-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.12-e2e-aws-ovn-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.12-e2e-aws-sdn-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.11-e2e-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.11-e2e-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.10-e2e-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.10-e2e-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.9-e2e-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.9-e2e-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-network-migration	openshift/cluster-network-operator	presubmit	Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-network-migration-rollback	openshift/cluster-network-operator	presubmit	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-arm64-nightly-azure-ipi-sdn-migration-ovn-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.10-amd64-nightly-azure-ipi-sdn-migration-ovn-f28	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.8-e2e-network-migration	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.11-e2e-network-migration	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.8-e2e-network-migration-rollback	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-network-migration	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-azure-ipi-sdn-migration-ovn-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-arm64-nightly-gcp-ipi-sdn-migration-ovn-f28	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-gcp-ipi-sdn-migration-ovn-f28	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.12-arm64-nightly-aws-ipi-sdn-migration-ovn-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.12-e2e-network-migration-rollback	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.13-e2e-network-migration-rollback	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-aws-ipi-sdn-migration-ovn-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-gcp-ipi-sdn-migration-ovn-p2-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-ec-aws-ipi-sdn-migration-ovn-f14	N/A	periodic	Registry content changed
periodic-ci-openshift-release-master-ci-4.11-e2e-network-migration-rollback	N/A	periodic	Registry content changed

A total of 67 jobs have been affected by this change. The above listing is non-exhaustive and limited to 35 jobs.

A full list of affected jobs can be found here
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 10 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 20 rehearsals
Comment: /pj-rehearse max to run up to 35 rehearsals
Comment: /pj-rehearse auto-ack to run up to 10 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

vishnuchalla · 2023-08-24T23:13:05Z

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14

vishnuchalla · 2023-08-24T23:13:24Z

/assign @jtaleric
/assign @krishvoor

vishnuchalla · 2023-08-25T01:46:28Z

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14

vishnuchalla · 2023-08-25T01:47:46Z

/joke

openshift-ci · 2023-08-25T01:47:48Z

@vishnuchalla: What's the best thing about elevator jokes? They work on so many levels.

In response to this:

/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-08-25T04:03:45Z

@vishnuchalla: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14	`5710db3`	link	unknown	`/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-gcp-ipi-sdn-migration-ovn-f14`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vishnuchalla · 2023-08-25T10:35:18Z

/assign @jluhrsen

pliurh · 2023-08-31T13:18:50Z

@vishnuchalla I guess you want to increase the timeout when running this script in a large cluster. May I ask what is the size of the cluster? And at which step the current timeout is not big enough?

vishnuchalla · 2023-08-31T13:21:30Z

@vishnuchalla I guess you want to increase the timeout when running this script in a large cluster. May I ask what is the size of the cluster? And at which step the current timeout is not big enough?

We have cluster with 120 worker nodes on which we plan to trigger some large scale tests.

vishnuchalla · 2023-08-31T13:22:07Z

/pj-rehearse ack

pliurh · 2023-08-31T13:28:34Z

/lgtm

danwinship · 2023-08-31T14:19:05Z

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

vishnuchalla · 2023-08-31T14:28:50Z

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

jluhrsen · 2023-09-01T17:19:49Z

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the
timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.

That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each
step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

vishnuchalla · 2023-09-01T20:08:48Z

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.

That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

Yes defaults will stay in place. This change is to provide users flexibility to modify timeout according to their use case. If they set a lower value its their own set failure, but if they set it to a greater one then it will be beneficial. And also I don't think that the user who is going to set this timeout value is going to set it without totally being aware of where and how it is being used. And also if someone reaches out with a failure reason due to timeout, we can simply redirect them to adjust OVN_SDN_MIGRATION_TIMEOUT value according to their own use case. Its totally up to them to decide.

jluhrsen · 2023-09-02T17:40:56Z

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.
That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

Yes defaults will stay in place. This change is to provide users flexibility to modify timeout according to their use case. If they set a lower value its their own set failure, but if they set it to a greater one then it will be beneficial. And also I don't think that the user who is going to set this timeout value is going to set it without totally being aware of where and how it is being used. And also if someone reaches out with a failure reason due to timeout, we can simply redirect them to adjust OVN_SDN_MIGRATION_TIMEOUT value according to their own use case. Its totally up to them to decide.

gotcha, and it makes sense. I like the idea. my only concern is that I would assume that if someone is going
to tweak this timeout on purpose they are going to use a larger value (> 2700 for example) and that is
going to override a 60s timeout which is pretty dramatic. There are two cases where that is the case. Anyway,
if you think it's overkill to add a new variable for each specific timeout case, then no worries on my side.

vishnuchalla · 2023-09-07T12:55:39Z

I guess you want to increase the timeout when running this script in a large cluster.

It seems like it would be better to just count the number of nodes at the top and then adjust the timeouts based on that? (eg timeout $((num_nodes * 20)) bash)

That might work in some cases and we cannot guarantee it to be stable across all types of runs and we might often re-visit this part of code to change that fixed value if something breaks. Instead I think we should have a variable exposed as in this PR, so that users can have flexibility to experiment and decide on the right number based on their use case. That way the code will have minimal changes and provides flexibility at the user level to set their desired value on timeouts.

if this variable is set to some value (non zero) then that is the timeout used for all steps, whereas before the timeouts were specific to each step. e.g., 1200s for co to stop progressing, 60s for CNO to be patched, etc.
That could really blow up on us if there is an issue. would it be overkill to add specific timeouts for each step so you can keep defaults like 60s for one thing, but tweak something more major like CO progressing?

Yes defaults will stay in place. This change is to provide users flexibility to modify timeout according to their use case. If they set a lower value its their own set failure, but if they set it to a greater one then it will be beneficial. And also I don't think that the user who is going to set this timeout value is going to set it without totally being aware of where and how it is being used. And also if someone reaches out with a failure reason due to timeout, we can simply redirect them to adjust OVN_SDN_MIGRATION_TIMEOUT value according to their own use case. Its totally up to them to decide.

gotcha, and it makes sense. I like the idea. my only concern is that I would assume that if someone is going to tweak this timeout on purpose they are going to use a larger value (> 2700 for example) and that is going to override a 60s timeout which is pretty dramatic. There are two cases where that is the case. Anyway, if you think it's overkill to add a new variable for each specific timeout case, then no worries on my side.

Yes, Even if the value is larger timeout is just a timeout. As a process need not wait until the timeout to finish and as its just a limit on execution time, I think it should be fine.

jtaleric · 2023-09-07T17:18:43Z

/lgtm

jluhrsen · 2023-09-07T18:14:43Z

/lgtm

openshift-ci · 2023-09-07T18:15:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jluhrsen, jtaleric, pliurh, vishnuchalla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/ovn/sdn-migration/OWNERS~~ [jluhrsen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Parameterizing sdn to ovn migration timeout

5710db3

openshift-ci bot requested review from danwinship and jluhrsen August 24, 2023 23:11

openshift-ci bot assigned jtaleric and krishvoor Aug 24, 2023

openshift-ci bot assigned jluhrsen Aug 25, 2023

vishnuchalla mentioned this pull request Aug 30, 2023

[PERFSCALE-2349]: Loaded Migration - OpenShift SDN to OVNK #42306

Closed

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Aug 31, 2023

openshift-ci bot assigned pliurh Aug 31, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 31, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2023

openshift-merge-robot merged commit af193b3 into openshift:master Sep 7, 2023

prb112 pushed a commit to prb112/openshift-release that referenced this pull request Sep 12, 2023

Parameterizing sdn to ovn migration timeout (openshift#42672)

e7fe033

heyselbi pushed a commit to heyselbi/release that referenced this pull request Sep 14, 2023

Parameterizing sdn to ovn migration timeout (openshift#42672)

55d6cb0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameterizing sdn to ovn migration timeout #42672

Parameterizing sdn to ovn migration timeout #42672

vishnuchalla commented Aug 24, 2023 •

edited

Loading

openshift-ci-robot commented Aug 24, 2023

vishnuchalla commented Aug 24, 2023

vishnuchalla commented Aug 24, 2023

vishnuchalla commented Aug 25, 2023

vishnuchalla commented Aug 25, 2023

openshift-ci bot commented Aug 25, 2023

openshift-ci bot commented Aug 25, 2023

vishnuchalla commented Aug 25, 2023

pliurh commented Aug 31, 2023

vishnuchalla commented Aug 31, 2023

vishnuchalla commented Aug 31, 2023

pliurh commented Aug 31, 2023

danwinship commented Aug 31, 2023

vishnuchalla commented Aug 31, 2023 •

edited

Loading

jluhrsen commented Sep 1, 2023

vishnuchalla commented Sep 1, 2023 •

edited

Loading

jluhrsen commented Sep 2, 2023

vishnuchalla commented Sep 7, 2023

jtaleric commented Sep 7, 2023

jluhrsen commented Sep 7, 2023

openshift-ci bot commented Sep 7, 2023

Parameterizing sdn to ovn migration timeout #42672

Parameterizing sdn to ovn migration timeout #42672

Conversation

vishnuchalla commented Aug 24, 2023 • edited Loading

openshift-ci-robot commented Aug 24, 2023

vishnuchalla commented Aug 24, 2023

vishnuchalla commented Aug 24, 2023

vishnuchalla commented Aug 25, 2023

vishnuchalla commented Aug 25, 2023

openshift-ci bot commented Aug 25, 2023

openshift-ci bot commented Aug 25, 2023

vishnuchalla commented Aug 25, 2023

pliurh commented Aug 31, 2023

vishnuchalla commented Aug 31, 2023

vishnuchalla commented Aug 31, 2023

pliurh commented Aug 31, 2023

danwinship commented Aug 31, 2023

vishnuchalla commented Aug 31, 2023 • edited Loading

jluhrsen commented Sep 1, 2023

vishnuchalla commented Sep 1, 2023 • edited Loading

jluhrsen commented Sep 2, 2023

vishnuchalla commented Sep 7, 2023

jtaleric commented Sep 7, 2023

jluhrsen commented Sep 7, 2023

openshift-ci bot commented Sep 7, 2023

vishnuchalla commented Aug 24, 2023 •

edited

Loading

vishnuchalla commented Aug 31, 2023 •

edited

Loading

vishnuchalla commented Sep 1, 2023 •

edited

Loading