Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPCLOUD-2431: rebase on upstream 1.29.0 release #285

Merged
merged 374 commits into from
Feb 2, 2024

Conversation

elmiko
Copy link

@elmiko elmiko commented Jan 29, 2024

This commit rebases the autoscaler on top of the Kubernetes/Autoscaler 1.29.0 release. There are several commits that we carry on top of the upstream autoscaler and the rebase process allows us to preserve those. Here is a description of the process I used to create this PR.

(inspired by the commit description for the 1.18 rebase. pr #139)

Process

First we need to identify the carry commits that we currently have, this is done against our previous rebase to catch new changes. Once identified we will drop commits which have merged upstream and only carry unique commits. (see below for the carried and dropped commits).

Identify carry commits (run from the openshift/master branch), these are the commits that begin with UPSTREAM: up until the merge commit for the previous rebase commit (merge upstream/cluster-autoscaler-1.28.0)

git log -n 30 --oneline --no-merges

After identifying the carry commits, the next step is to create the new commit-tree that will be used for the rebase and then cherry pick the carry commits into the new branch. The following commands cover these steps:

$ git remote update # make sure we update our refs
$ git checkout cluster-autoscaler-1.29.0
$ git checkout -b merge-tmp # create a temporary branch for our merge commit
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-1.29.0' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F - # create a new merge commit for our history
deadbeef12345678 # id of new merge commit
$ git branch merge-1.29 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.29
$ git cherry-pick <carry commits> # cherry pick the needed commits into the new branch

With the merge-1.29 branch in place, I cherry picked the carry commits which applied, resolved merge conflicts, and finally tested the resulting tree against the unit test and end-to-end suite.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are specific to the releases we create for OpenShift.

2120c26a5 UPSTREAM: <drop>: Temporarily reintroduce the flag --scale-up-from-zero-default-arch
10ff5b566 UPSTREAM: 6491: add informer argument to clusterapi provider builder
efeb8e308 UPSTREAM: <carry>: Fix unstructured taint parsing in Cluster API provider
1f937037f UPSTREAM: <carry>: revert capacity annotations
b3faab26a UPSTREAM: <carry>: add machine api label and taint functionality
5ad4b142e UPSTREAM: <carry>: update os::util::list_test_packages_under
cca0d470b UPSTREAM: <carry>: remove deprecated go test command
9ce518c79 UPSTREAM: <carry>: Have VPA ignore phantom containers named "POD"
9b1d287f3 UPSTREAM: <carry>: Handle old Machine API specific machine delete annotation
c3ff7aaad UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage
fcc7648db UPSTREAM: <carry>: configure repository for OpenShift releases

Dropped Commits

These commits were squashed into the fcc7648db commit.

d017ad320 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler-container image to be consistent with ART for 4.16 Reconciling with https://github.com/openshift/ocp-build-data/tree/79948f91df396fed6784d24a983f2a6acb2c4fb8/images/atomic-openshift-cluster-autoscaler.yml
e239872a0 UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART for 4.16 Reconciling with https://github.com/openshift/ocp-build-data/tree/17f81aa502d841340a0b97ae1365568f8253b6ec/images/vertical-pod-autoscaler.yml
dd8ccf7d9 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler-container image to be consistent with ART for 4.16 Reconciling with https://github.com/openshift/ocp-build-data/tree/b7ba6c3881993b1c2e0a96905cce738d604e1f1d/images/atomic-openshift-cluster-autoscaler.yml
908ccaaf9 UPSTREAM: <carry>: Add Snyk file to exclude vendor directory on scan
57d3f0f3e UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/4022cd290f00a44d667dda03f2d78d84a488c7ed/images/vertical-pod-autoscaler.yml
050999591 UPSTREAM: <carry>: Sync OWNERS file
67cfac729 UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/e5df10e41b14c724468c8f78e2690f219279bfc4/images/vertical-pod-autoscaler.yml
1f412862e UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/79604dfeca884ebe3d33791aaca81b0a77d21d2c/images/vertical-pod-autoscaler.yml
db61a2ee9 UPSTREAM: <carry>: Add vendor for balancer

This commit was dropped

efea62a0b UPSTREAM: 6066: Allow overriding the kubernetes.io/arch label set by the scale from zero methods via a new cmdline arg

Of special note in this rebase is this commit

c74af5632 UPSTREAM: <carry>: revert capacity annotations

due to the scale from zero changes being accepted upstream we can now drop our carried patch. but, the upstream implementation has differed slightly from our's (mainly around annotation names). we will need to carry this patch until we can fix all the providers to properly use the new annotations. This patch should be able to be dropped in the 4.17 release of openshift, see https://issues.redhat.com/browse/OCPCLOUD-944.

raywainman and others added 30 commits September 26, 2023 19:58
Fix vpa-up.sh script by removing duplicate recommender
…port

Adds support for startup taints and extends support for status taints
CA - Update k/k vendoring to 1.29.0-alpha.1
…belling

CA - Update CAPI Cloudprovider PR labeling to existing label
Trim managedFields in shared informer factory
InitLogs unconditionally disables contextual logging, while ValidateAndApply
checks the feature gate for that. InitLogs must come first, otherwise

    --feature-gates=ContextualLogging=true

doesn't work.
…le-dynamic

Refactor scale-down for better integration with drainability rules
When PermissionToAddNode gets called without actually adding a new node, the
node counter in the thresholdBasedEstimationLimiter gets out of sync with the
actual number of new nodes. This can happen when the "is the last node empty"
check triggers.

The solution used here is to rearrange the checks so that PermissionToAddNode
is followed by adding a new node. Alternatively, it might also be possible
to pass the current number of nodes as parameter.
It's possible that an EC2 instance is in the Terminated state for
multiple autoscaler loops, so to avoid trying to terminate it more
than once and causing the bug seen for this initial fix, we include
this check as well.

Also, make the log more verbose and write out what state we found the
instance in.
Fixing clutter-autoscaler docs to point to correct tutorial for auto-discovery
aws: check for all possible Terminated states
Bump default VPA version to 1.0.0 in master
Fix duplicate -addext when generating certificates with admission-controller/gencerts.sh
@elmiko
Copy link
Author

elmiko commented Jan 31, 2024

@aleskandro i was looking at your pr as i was cherry-picking the commits, it seems like the code to support the environment variable is in here, do we need an extra patch?

@elmiko
Copy link
Author

elmiko commented Jan 31, 2024

i read the CAO thread, i think we should just add the no-op commit as a drop here.

@aleskandro
Copy link
Member

aleskandro commented Jan 31, 2024

Hey @elmiko, the operator is still setting the --scale-up-from-zero-default-arch parameter in the Args of the autoscaler's deployment, but with the rebase, this parameter is not defined anymore. I'm getting a doubt now as I don't remember anymore the tests that I did at the time: can the autoscaler accept unrecognised command line arguments? My commit linked above redefined the flag to let the autoscaler recognize it

elmiko and others added 10 commits January 31, 2024 17:23
This change carries files and modifications that are used by OpenShift
release infrastructure and related files.

* spec file
* dockerfiles
  * vertical-pod-autoscaler/Dockerfile.rhel
  * images/cluster-autoscaler/Dockerfile
  * images/cluster-autoscaler/Dockerfile.rhel
* hack scripts (ci and build related)
* Makefile
* JUnit tools
* update gitignore
* update/remove OWNERS files
* ci-operator config yaml
* update vendor folders
* add vendor folder for balancer
* add Snyk file to exclude vendor directory on scan
…otation

The delete annotation upstream has a different format, but is now
inferred dynamically from the API group. If we update this in MAO to use
the new format, we can drop this old key
this change removes the `go test -i ...` command from the
hack/test-go.sh file. this option is deprecated in the go command and
attempts to install dependencies of the test. when this is enabled with
the build tags we are using, it appears to make the test builds unstable
in CI.
this change adds all the cloudproviders except clusterapi to the list of
ignored directories in for the cluster-autoscaler tests. this is being
done so that the tests match our build tags and do not cause errors when
passing the package list to the test script.
This change re-adds the machine api support for labels and taints on
node groups. The code was removed upstream as it is openshift specific,
see this pull request[0].

It also adds in the functionality of the upstream override annotation
for labels and taints[1] to support
https://issues.redhat.com/browse/MIXEDARCH-259

[0]: kubernetes#5249
[1]: kubernetes#5382
the upstream annotations for the scale from zero capacity resources is
slighty different than the openshift implementation. the largest
difference is the addition of a gpu type annotation. openshift does not
yet utilize this annotation and thus this patch should be carried until
the machineset controllers for the various providers on openshift have
been modified to use the new annotations.

another important change is the modification of the memory annotation.
previously in openshift we expected this value to be a count of memory
in Mebibytes. the conversion function and tests have been modified to
allow continued openshift operation.

this change can be dropped when the annotations in openshift have been
updated, the progress for this effort can be followed at
https://issues.redhat.com/browse/OCPCLOUD-944
This change adds the informer factory as an argument to the
`buildCloudProvider` function for clusterapi so that building with tags
will work properly.
@elmiko
Copy link
Author

elmiko commented Jan 31, 2024

adding a fix for the goimports failure

@elmiko
Copy link
Author

elmiko commented Feb 1, 2024

looks like the failures are coming from the missing flag now, i will add that patch tomorrow.

…ro-default-arch

This commit reintroduces the flag --scale-up-from-zero-default-arch. It is a temporary workaround to allow a
smoother transition from the downstream changes in openshift#261 to the upstream ones in
kubernetes#6066. It can be removed after the cluster-autoscaler-operator
is updated not to set this flag anymore.
Note that the flag is a no-op because the upstream behavior is guaranteed
by setting up the expected CAPI environment variable, as implemented in openshift/cluster-autoscaler-operator#297

See openshift/cluster-autoscaler-operator#311 for more details.
@elmiko
Copy link
Author

elmiko commented Feb 1, 2024

i added @aleskandro 's no-op patch as a drop

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 1, 2024

@elmiko: This pull request references OCPCLOUD-2431 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

This commit rebases the autoscaler on top of the Kubernetes/Autoscaler 1.29.0 release. There are several commits that we carry on top of the upstream autoscaler and the rebase process allows us to preserve those. Here is a description of the process I used to create this PR.

(inspired by the commit description for the 1.18 rebase. pr #139)

Process

First we need to identify the carry commits that we currently have, this is done against our previous rebase to catch new changes. Once identified we will drop commits which have merged upstream and only carry unique commits. (see below for the carried and dropped commits).

Identify carry commits (run from the openshift/master branch), these are the commits that begin with UPSTREAM: up until the merge commit for the previous rebase commit (merge upstream/cluster-autoscaler-1.28.0)

git log -n 30 --oneline --no-merges

After identifying the carry commits, the next step is to create the new commit-tree that will be used for the rebase and then cherry pick the carry commits into the new branch. The following commands cover these steps:

$ git remote update # make sure we update our refs
$ git checkout cluster-autoscaler-1.29.0
$ git checkout -b merge-tmp # create a temporary branch for our merge commit
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-1.29.0' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F - # create a new merge commit for our history
deadbeef12345678 # id of new merge commit
$ git branch merge-1.29 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.29
$ git cherry-pick <carry commits> # cherry pick the needed commits into the new branch

With the merge-1.29 branch in place, I cherry picked the carry commits which applied, resolved merge conflicts, and finally tested the resulting tree against the unit test and end-to-end suite.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are specific to the releases we create for OpenShift.

2120c26a5 UPSTREAM: <drop>: Temporarily reintroduce the flag --scale-up-from-zero-default-arch
10ff5b566 UPSTREAM: 6491: add informer argument to clusterapi provider builder
efeb8e308 UPSTREAM: <carry>: Fix unstructured taint parsing in Cluster API provider
1f937037f UPSTREAM: <carry>: revert capacity annotations
b3faab26a UPSTREAM: <carry>: add machine api label and taint functionality
5ad4b142e UPSTREAM: <carry>: update os::util::list_test_packages_under
cca0d470b UPSTREAM: <carry>: remove deprecated go test command
9ce518c79 UPSTREAM: <carry>: Have VPA ignore phantom containers named "POD"
9b1d287f3 UPSTREAM: <carry>: Handle old Machine API specific machine delete annotation
c3ff7aaad UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage
fcc7648db UPSTREAM: <carry>: configure repository for OpenShift releases

Dropped Commits

These commits were squashed into the fcc7648db commit.

d017ad320 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler-container image to be consistent with ART for 4.16 Reconciling with https://github.com/openshift/ocp-build-data/tree/79948f91df396fed6784d24a983f2a6acb2c4fb8/images/atomic-openshift-cluster-autoscaler.yml
e239872a0 UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART for 4.16 Reconciling with https://github.com/openshift/ocp-build-data/tree/17f81aa502d841340a0b97ae1365568f8253b6ec/images/vertical-pod-autoscaler.yml
dd8ccf7d9 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler-container image to be consistent with ART for 4.16 Reconciling with https://github.com/openshift/ocp-build-data/tree/b7ba6c3881993b1c2e0a96905cce738d604e1f1d/images/atomic-openshift-cluster-autoscaler.yml
908ccaaf9 UPSTREAM: <carry>: Add Snyk file to exclude vendor directory on scan
57d3f0f3e UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/4022cd290f00a44d667dda03f2d78d84a488c7ed/images/vertical-pod-autoscaler.yml
050999591 UPSTREAM: <carry>: Sync OWNERS file
67cfac729 UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/e5df10e41b14c724468c8f78e2690f219279bfc4/images/vertical-pod-autoscaler.yml
1f412862e UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/79604dfeca884ebe3d33791aaca81b0a77d21d2c/images/vertical-pod-autoscaler.yml
db61a2ee9 UPSTREAM: <carry>: Add vendor for balancer

This commit was dropped

efea62a0b UPSTREAM: 6066: Allow overriding the kubernetes.io/arch label set by the scale from zero methods via a new cmdline arg

Of special note in this rebase is this commit

c74af5632 UPSTREAM: <carry>: revert capacity annotations

due to the scale from zero changes being accepted upstream we can now drop our carried patch. but, the upstream implementation has differed slightly from our's (mainly around annotation names). we will need to carry this patch until we can fix all the providers to properly use the new annotations. This patch should be able to be dropped in the 4.17 release of openshift, see https://issues.redhat.com/browse/OCPCLOUD-944.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-ci bot commented Feb 1, 2024

@elmiko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/git-history 2120c26 link false /test git-history

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@elmiko
Copy link
Author

elmiko commented Feb 1, 2024

tests looking good

.snyk Outdated
# https://docs.snyk.io/snyk-cli/commands/ignore
exclude:
global:
- vendor/**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs quoting, can we fix it up?

@JoelSpeed
Copy link

Fixup is minor
Lets merge and sort later
/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 2, 2024
Copy link

openshift-ci bot commented Feb 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit cb4f7f2 into openshift:master Feb 2, 2024
11 of 12 checks passed
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build atomic-openshift-cluster-autoscaler-container-v4.16.0-202402021641.p0.gcb4f7f2.assembly.stream for distgit atomic-openshift-cluster-autoscaler.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.