Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2104619: Remove rollback deployment #3243

Merged
merged 2 commits into from
Jul 13, 2022

Conversation

jmarrero
Copy link
Member

This reverts commit 25a7812.
to provide a fix to:
https://bugzilla.redhat.com/show_bug.cgi?id=2104619

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jul 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2022

@jmarrero: This pull request references Bugzilla bug 1907333, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target the "4.12.0" release, but it targets "4.7.0" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is CLOSED (ERRATA) instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Revert "Bug 1907333: daemon: Revert code to remove rollback"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member

Let's retitle the commit to be: Bug 2104619: Remove rollback deployment or so (the OCP bugbot is getting confused by the prior BZ reference)

One thing to debate here too is whether we should always remove the rollback, or whether we should only do so before upgrades.

If we always remove the rollback, it kind of obviates the grub password bits, and means it will never be available if administrators need it for some reason.

(This is the same debate as we need to have in ostreedev/ostree#2670 - do we always cleanup or do we only do so when we need to because of disk spaces issues)

@@ -720,6 +720,20 @@ func (dn *Daemon) updateHypershift(oldConfig, newConfig *mcfgv1.MachineConfig, d
return nil
}

// removeRollback removes the rpm-ostree rollback deployment. It
// takes up space, and we don't generally expect administrators to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's tweak this comment now to link to https://bugzilla.redhat.com/show_bug.cgi?id=2104619
(IIRC when I was talking about "space" there I was mainly thinking of / not /boot, though both are valid)

Comment on lines 733 to 734
_, err := runGetOut("rpm-ostree", "cleanup", "-r")
return err
Copy link
Member

@cgwalters cgwalters Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can now be return runRpmOstree("cleanup", "-r")

@jmarrero jmarrero changed the title Revert "Bug 1907333: daemon: Revert code to remove rollback" Bug 2104619: Remove rollback deployment Jul 11, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. and removed bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Jul 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2022

@jmarrero: This pull request references Bugzilla bug 2104619, which is invalid:

  • expected the bug to target the "4.12.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2104619: Remove rollback deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Prashanth684
Copy link
Contributor

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jul 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2022

@Prashanth684: This pull request references Bugzilla bug 2104619, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.12.0) matches configured target release for branch (4.12.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @djslavens

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from djslavens July 11, 2022 15:10
@jmarrero
Copy link
Member Author

jmarrero commented Jul 11, 2022

One thing to debate here too is whether we should always remove the rollback, or whether we should only do so before upgrades.

If we always remove the rollback, it kind of obviates the grub password bits, and means it will never be available if administrators need it for some reason.

(This is the same debate as we need to have in ostreedev/ostree#2670 - do we always cleanup or do we only do so when we need to because of disk spaces issues)

Does the MCO has a way to check for a Config Map value for example? It could be something where we always delete unless "KEEP_ROLLBACK" is set. That way is more intentional in case an Admin is trying to debug something.

If we just do it for upgrades, (If you mean OCP upgrades) then if someone install kernel-rt we might get caught again by the lack of space.

@travier
Copy link
Member

travier commented Jul 11, 2022

Should we restrict this workaround to ppc64le only or maybe check for X% amount of free space before cleaning-up?

@jmarrero
Copy link
Member Author

jmarrero commented Jul 11, 2022

Should we restrict this workaround to ppc64le only or maybe check for X% amount of free space before cleaning-up?

I don't have strong opinions on this, but I think we might start seeing the same issues in other architectures eventually. If we need to check for a specific condition I would prefer space.

@jmarrero
Copy link
Member Author

jmarrero commented Jul 11, 2022

We could also merge this this as is, and change it once: ostreedev/ostree#2670 lands and change the MCO/RHCOS to just add the new config in a future version?

@sinnykumari
Copy link
Contributor

Should we restrict this workaround to ppc64le only or maybe check for X% amount of free space before cleaning-up?

I will lean towards keeping it consistent across all arches

// do not attempt to rollback on non-RHCOS/FCOS machines
return nil
}
return runRpmOstree("cleanup", "r")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably already noticed, but if not -- looks like the dash on the r is missing ( -r ) 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch John :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, thanks for catching that.

@travier
Copy link
Member

travier commented Jul 11, 2022

I will lean towards keeping it consistent across all arches

Unfortunately this is slightly weakening our "resiliency on upgrade failure" stance thus the suggestion for using this only when needed. But I also agree that consistency helps with debugging.

@sinnykumari
Copy link
Contributor

My thoughts were also in line with what Colin's point.
/approve

Will let other reviewers add lgtm

@jmarrero Thanks for working on the fix. As a follow-up can you also create a jira card in MCO board to track removal of this workaround when ostreedev/ostree#2670 is fixed.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2022
@jmarrero
Copy link
Member Author

/retest-required

@sinnykumari
Copy link
Contributor

/retest

@sinnykumari
Copy link
Contributor

sinnykumari commented Jul 12, 2022

gcp-op test failure:

    mcd_config_drift.go:273: Setting contents of /rootfs/etc/etc-file on ci-op-qs493q58-1354f-fvtgf-worker-c-n7gsj to expected-file-data
    mcd_config_drift.go:242: 
        	Error Trace:	helpers.go:33
        	            				mcd_config_drift.go:242
        	Error:      	Not equal: 
        	            	expected: "BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-a7cbdbb7fe083f943e4041a8f16a8bab9d20520eaa303c1080351c2aaa292838/vmlinuz-4.18.0-372.13.1.el8_6.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/a7cbdbb7fe083f943e4041a8f16a8bab9d20520eaa303c1080351c2aaa292838/0 ignition.platform.id=gcp root=UUID=45d3e726-4780-41ff-a3e5-29dc3eaaa7fc rw rootflags=prjquota boot=UUID=01b28731-c73c-45cd-9f90-e0ca753bcd47 foo=bar foo=baz\n"
        	            	actual  : "BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-a7cbdbb7fe083f943e4041a8f16a8bab9d20520eaa303c1080351c2aaa292838/vmlinuz-4.18.0-372.13.1.el8_6.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.0/rhcos/a7cbdbb7fe083f943e4041a8f16a8bab9d20520eaa303c1080351c2aaa292838/0 ignition.platform.id=gcp root=UUID=45d3e726-4780-41ff-a3e5-29dc3eaaa7fc rw rootflags=prjquota boot=UUID=01b28731-c73c-45cd-9f90-e0ca753bcd47 foo=bar foo=baz\n"

Since we are doing rollback, index of boot changes from 1 to 0 and hence mismatch is happening.
Looks like we are checking for all the kargs in the test https://github.com/openshift/machine-config-operator/blob/master/test/e2e-shared-tests/mcd_config_drift.go#L242 .
@cheesesashimi wondering if need to check kargs in this test considering we are not performing any karg update in this test? kargs test should be ideally handled by existing TestKernelArguments test.

@sinnykumari
Copy link
Contributor

Both e2e aws and agnostic-upgrade are failing because of known infra issue Failed while waiting on imagestream import

The Config Drift Monitor has no interaction with kernel args. This
particular case is better covered by the TestKernelArguments test.
@sinnykumari
Copy link
Contributor

Since aws test will most likely fail, let's test upgrade on gcp
/test e2e-gcp-upgrade

@sinnykumari
Copy link
Contributor

/test e2e-azure-upgrade
/test e2e-gcp-upgrade

@sinnykumari
Copy link
Contributor

/test e2e-gcp-op

@jmarrero
Copy link
Member Author

Looks like the image import affects gcp and azure too?

@sinnykumari
Copy link
Contributor

Looks like imagestream issue is fixed now, some of our recent test are green. Let's give another try
/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2022

@jmarrero: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-upgrade 01af7ed link false /test e2e-gcp-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sinnykumari
Copy link
Contributor

Tests are looking good, let's merge it
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmarrero, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit dd64813 into openshift:master Jul 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2022

@jmarrero: All pull requests linked via external trackers have merged:

Bugzilla bug 2104619 has been moved to the MODIFIED state.

In response to this:

Bug 2104619: Remove rollback deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor

/cherry-pick release-4.11

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Jul 13, 2022
@openshift-cherrypick-robot

@sinnykumari: new pull request created: #3249

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters added a commit to cgwalters/os that referenced this pull request Jul 18, 2022
This is a hack to work around the lack of
coreos/rpm-ostree@0556152
for RHEL 8.6 and below.

A recent PR in the MCO openshift/machine-config-operator#3243
tipped things over the edge and we now see failures a lot more often.
cgwalters added a commit to cgwalters/os that referenced this pull request Jul 19, 2022
This is a hack to work around the lack of
coreos/rpm-ostree@0556152
for RHEL 8.6 and below.

A recent PR in the MCO openshift/machine-config-operator#3243
tipped things over the edge and we now see failures a lot more often.
@travier
Copy link
Member

travier commented Jul 19, 2022

We've cherry-picked this one into 4.11, but don't we also need to cherry-pick it into 4.10?
I've read https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md but it's not clear to me if the 4.11 MCO is used to upgrade a 4.10 node to 4.11.

@cgwalters
Copy link
Member

cgwalters commented Jul 19, 2022

We've cherry-picked this one into 4.11, but don't we also need to cherry-pick it into 4.10?

This is a great question! I have had to dig into this at least twice before and I forget the answer, but clearly we need to add the answer to the docs.

OK right, the answer is in the ordering listed here:

var syncFuncs = []syncFunc{

Notice that the MCD is rolled out before we block on the pools.

So I don't think we need to backport this to 4.10 - but that said, doing so seems like a good idea.

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/os that referenced this pull request Jul 19, 2022
This is a hack to work around the lack of
coreos/rpm-ostree@0556152
for RHEL 8.6 and below.

A recent PR in the MCO openshift/machine-config-operator#3243
tipped things over the edge and we now see failures a lot more often.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants