Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-update: simple rollback #11074

Merged
merged 1 commit into from
Aug 6, 2021

Conversation

vrothberg
Copy link
Member

@vrothberg vrothberg commented Jul 29, 2021

Add support for simple rollbacks during podman auto-update. Rollbacks
are enabled by default. If a systemd unit cannot be restarted after an
update, the previous image will be retagged and the unit will be
restarted a second time.

Add system tests for rollbacks. Also fix a bug in the restart sequence;
we have to use the channel to actually know whether the restart was
successful or not.

NOTE: To make rollbacks really useful, users must run their containers
with --sdnotify=container such that the containers send the ready
message over the (mounted) socket. This way, restarting the systemd
units during auto update will block until the message has been received
(or a timeout kicked in).

Signed-off-by: Valentin Rothberg [email protected]

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2021
@vrothberg
Copy link
Member Author

@edsantiago, I'd love your eyes on the system tests.

@vrothberg
Copy link
Member Author

FYI: @fatherlinux @mrguitar @nullr0ute

@containers/podman-maintainers PTAL

@vrothberg
Copy link
Member Author

Restarted. Had to make the CI image quay.io/libpod/autoupdatebroken public. It worked locally since I was logged into quay.

@Luap99
Copy link
Member

Luap99 commented Jul 29, 2021

What happens if the application in the container crashes (exits with non zero exit code)? Will the rollback be performed? Or does it only work when podman run -d ... fails.

@vrothberg
Copy link
Member Author

What happens if the application in the container crashes (exits with non zero exit code)? Will the rollback be performed? Or does it only work when podman run -d ... fails.

Good call. I tested only with an empty image. Let me add a test when the command inside the container fails. I guess we need some massaging for that.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 29, 2021
@vrothberg
Copy link
Member Author

vrothberg commented Jul 29, 2021

What happens if the application in the container crashes (exits with non zero exit code)? Will the rollback be performed? Or does it only work when podman run -d ... fails.

OK. It only fails when podman run -d ... fails. Which makes sense since we are only looking at conmon which has started successfully.

I think we can merge as it. The remainder may require healthchecks.

@vrothberg vrothberg changed the title auto-update: simple rollback WIP - auto-update: simple rollback Jul 29, 2021
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2021
@vrothberg
Copy link
Member Author

Moving to WIP. I had a chat with @giuseppe to discuss a solution. We gravitated toward letting the container send sdnotify messages such that a restart will wait until the container send message. To do that, we need to massage the generate-systemd code a bit to not always enforce --sdnotify=conmon.

Will tackle that tomorrow.

@edsantiago
Copy link
Member

@vrothberg can you make sure the autoupdatebroken image is multiarch-safe please?

@vrothberg vrothberg force-pushed the auto-update-rollback branch from b5f087a to e8467a1 Compare August 4, 2021 11:35
@vrothberg vrothberg changed the title WIP - auto-update: simple rollback auto-update: simple rollback Aug 4, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 4, 2021
@vrothberg
Copy link
Member Author

vrothberg commented Aug 4, 2021

@vrothberg can you make sure the autoupdatebroken image is multiarch-safe please?

It's now a manifest list with one image for amd64 👍

@vrothberg
Copy link
Member Author

not ok 188 podman auto-update - label io.containers.autoupdate=local with rollback

@edsantiago, do I need to the "socat dance" here as in the sdnotify tests?

@edsantiago
Copy link
Member

It's now a manifest list with one image for amd64

We need the actual images (or, the multiarch-testing group does). See https://github.com/containers/podman/blob/cbad5616961520831b1f169f03da2a9f81203f71/test/system/build-testimage for an example of how to produce those images.

@edsantiago
Copy link
Member

@vrothberg I'm sorry; I've looked at this one off-and-on today, but been unable to understand what's going wrong. My intuition is telling me "please do not use alpine", because that is docker.io/alpine, which is rate-limited, which is going to randomly screw up CI runs. I don't think that's the cause of the problem, but it's all I can recommend for now.

@vrothberg
Copy link
Member Author

It's now a manifest list with one image for amd64

We need the actual images (or, the multiarch-testing group does).

Can you elaborate on "actual images"?

See https://github.com/containers/podman/blob/cbad5616961520831b1f169f03da2a9f81203f71/test/system/build-testimage for an example of how to produce those images.

I'll add images for arm64, ppc64le and s390x to the manifest list.

@vrothberg
Copy link
Member Author

I'll add images for arm64, ppc64le and s390x to the manifest list.

Done ✔️

$ skopeo inspect --raw docker://quay.io/libpod/autoupdatebroken:latest|jq .
{                                                                                                        
  "schemaVersion": 2,                                                                                    
  "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",                              
  "manifests": [                                                                                         
    {                                                                                                    
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",                               
      "size": 422,                                                                                       
      "digest": "sha256:40bc10ecedfbea7594e452757b3bd2b2440317e1c296bdf91c4545f539995c67",               
      "platform": {                                                                                      
        "architecture": "amd64",                                                                         
        "os": "linux"                                                                                    
      }                                                                                                  
    },                                                                                                   
    {                                                                                                    
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",                               
      "size": 422,                                                                                       
      "digest": "sha256:a1cdef8749843f33eb5af1ec86dac4728cfa3d940119d37dbda6e9692d2d0592",               
      "platform": {                                                                                      
        "architecture": "arm64",                                                                         
        "os": "linux"                                                                                    
      }                                                                                                  
    },                                                                                                   
    {                                                                                                    
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",                               
      "size": 422,                                                                                       
      "digest": "sha256:ef3e97abba2c003766a660c5ffc957091346509f9e3836585ff2ea84324fb193",               
      "platform": {                                                                                      
        "architecture": "ppc64le",                                                                       
        "os": "linux"                                                                                    
      }                                                                                                  
    },                                                                                                   
    {                                                                                                    
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",                               
      "size": 422,                                                                                       
      "digest": "sha256:bb228dfe14265aeb8f458fad72fbad1b13e42badce0f5eee95c3896e833ab66c",               
      "platform": {                                                                                      
        "architecture": "s390x",                                                                         
        "os": "linux"                                                                                    
      }                                                                                                  
    }                                                                                                    
  ]                                                                                                      
}                                                                                                        

@vrothberg vrothberg force-pushed the auto-update-rollback branch from e8467a1 to a041246 Compare August 5, 2021 11:39
@vrothberg
Copy link
Member Author

I don't think that's the cause of the problem, but it's all I can recommend for now.

Thanks a lot for checking, @edsantiago! I restarted one of the two failed jobs earlier and it passed. It smells like bug/flake that I don't want to introduce with this PR. I repushed with the updates tests to see how it's going now.

@vrothberg vrothberg force-pushed the auto-update-rollback branch from a041246 to b313ff5 Compare August 5, 2021 11:50
Add support for simple rollbacks during `podman auto-update`.  Rollbacks
are enabled by default.  If a systemd unit cannot be restarted after an
update, the previous image will be retagged and the unit will be
restarted a second time.

Add system tests for rollbacks.  Also fix a bug in the restart sequence;
we have to use the channel to actually know whether the restart was
successful or not.

NOTE: To make rollbacks really useful, users must run their containers
with `--sdnotify=container` such that the containers send the ready
message over the (mounted) socket.  This way, restarting the systemd
units during auto update will block until the message has been received
(or a timeout kicked in).

Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg vrothberg force-pushed the auto-update-rollback branch from b313ff5 to 30df551 Compare August 5, 2021 13:20
@vrothberg
Copy link
Member Author

First run green. Let's see what the second one does.

@vrothberg
Copy link
Member Author

The empirical law: once is never, twice is always :)

PTAL

@rhatdan
Copy link
Member

rhatdan commented Aug 5, 2021

LGTM

@fatherlinux
Copy link
Contributor

This sounds like a great feature. I assume it just tags the penultimate image layer as latest and restarts? It obviously can't go modify the image on the registry right?

@vrothberg
Copy link
Member Author

This sounds like a great feature. I assume it just tags the penultimate image layer as latest and restarts?

That's right. If we have an image quay.io/foo:latest and pull down a new image, the previous one will be untagged. If the restart failed, we'll rollback by retagging the old image with quay.io/foo:latest and doing another restart.

It obviously can't go modify the image on the registry right?

In some cases it could. Are you thinking about some kind of self-healing mechanism to prevent other nodes from pulling a potentially broken image?

@vrothberg
Copy link
Member Author

@containers/podman-maintainers PTAL

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrothberg
Copy link
Member Author

ready to merge :)

@rhatdan
Copy link
Member

rhatdan commented Aug 6, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 6, 2021
@rhatdan
Copy link
Member

rhatdan commented Aug 6, 2021

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2021
@openshift-ci openshift-ci bot merged commit 8409817 into containers:main Aug 6, 2021
@vrothberg vrothberg deleted the auto-update-rollback branch August 6, 2021 12:42
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants