Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent docker: new revisions are not getting released #24198

Closed
mtojek opened this issue Feb 24, 2021 · 14 comments
Closed

Elastic Agent docker: new revisions are not getting released #24198

mtojek opened this issue Feb 24, 2021 · 14 comments
Labels

Comments

@mtojek
Copy link
Contributor

mtojek commented Feb 24, 2021

Hi,

due to the bug introduced to the 7.x (and 7.12) branch, the latest snapshot is failing. The bug has been fixed in #24163 and #24161 , but the Docker image hasn't been released due to other issues in the unified release process.

With this problem in SNAPSHOT the package development is blocked for few days in elastic/integration, elastic/elastic-package (all statuses are red). We can hardcode the correct Docker image in both repos, but it means for us a lot of operational work - start a day by reviewing all Beats branches and check if there might be a blocker (failed build), then update a map of hardcoded image references in elastic-package (for 7.13, 7.12, 7.11).

The goal of this issue is to figure out and implement a way of publishing Agent's images independently from other potentially problematic parties.

cc @andresrc @ph @ruflin @ycombinator

@mtojek mtojek added the Agent label Feb 24, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 24, 2021
@ruflin
Copy link
Contributor

ruflin commented Feb 24, 2021

I'm not sure if the goal should be to introduce a separate build process. It is unfortunate that the build is broken and we should ensure on the Beats / Agent end, we detect such problems before they are merged.

Is the package development really blocked? It seems the master snapshots are available so development against mater should still work?

@mtojek
Copy link
Contributor Author

mtojek commented Feb 24, 2021

It is, because we can't test against current/older releases. Let me bring few issues:

elastic/integrations#736 - I can't reverify if the problem is fixed for 7.11.
elastic/elastic-package#260 - updating the package spec and checking if it's correct (packages can be installed in particular stack version)
elastic/integrations#740 - package compatibility check with older stacks (work in progress, but won't be possible).
elastic/elastic-package#261 - can't verify if we can bump up the stack to 7.13, if everything works. We would have to do it blindly and keep fingers crossed that eventually it will work

Also:

  • currently every CI status blinks red for masters, we can't just change it to 8.0.0, as there is a risk of introducing faulty PRs (valid for 8.00, not valid for 7.12/7.13).
  • Integrations dev will start ignoring the CI status as it can't even bring the Elastic stack to a stable state ("CI issues unrelated").

I'm not sure if the goal should be to introduce a separate build process. It is unfortunate that the build is broken and we should ensure on the Beats / Agent end, we detect such problems before they are merged.

My impression is that there is too much coupling in the build process, that even correct products/bugfixes are blocked from releasing by a single item.

@mtojek mtojek removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 24, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 24, 2021
@mtojek mtojek added the Team:Elastic-Agent Label for the Agent team label Feb 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 24, 2021
@mtojek mtojek added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 24, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 24, 2021
@botelastic
Copy link

botelastic bot commented Feb 24, 2021

This issue doesn't have a Team:<team> label.

@andresrc
Copy link
Contributor

Why are old versions a problem?

@mtojek
Copy link
Contributor Author

mtojek commented Feb 24, 2021

It's because of affected branches (backport not released):

  • 7.x contains changes scheduled for 7.13
  • 7.12

7.11 is old enough that we'll skip testing for some packages (compatible with newer Kibana versions)

EDIT:

Yet another build has just failed, which probably means next 12h of delay.

@ph
Copy link
Contributor

ph commented Feb 24, 2021

@mtojek Just to clarify, the problem is because the unified process produced artifacts are olds and do not include the latest fixes, the new build arent completing?

First, are we responsible for the failure of the build? Are our beats or Elastic Agent breaking the build? This we can control and prioritize, if this is not the case we need to look with infra, they are looking into better way to notify us.

Like @ruflin said, I don't think introducing a new build is the solution. We have many moving parts, Elastic-Agent, Endpoint, Beats, ES, and Kibana, there are a lot of dependencies and things that need to happen to have confidence in the binary.

@mtojek
Copy link
Contributor Author

mtojek commented Feb 24, 2021

@mtojek Just to clarify, the problem is because the unified process produced artifacts are olds and do not include the latest fixes, the new build arent completing?

Yes.

First, are we responsible for the failure of the build? Are our beats or Elastic Agent breaking the build? This we can control and prioritize, if this is not the case we need to look with infra, they are looking into better way to notify us.

I think we should put more emphasis on ownership here and be more proactive in emergency situations, not just sit and wait until the next build appears. In this particular situation the image has been built correctly, but it was discovered later that it doesn't boot up correctly. I had an interesting conversation around this issue with @ycombinator . Of course we can improve the test coverage, but there might be always a situation in which this is insufficient. Such cases should be treated differently to reduce the blast and don't block next customers (e.g. integrations developers, etc.).

What do you think about introducing a tagging based solution for Docker images? Let's say that the "unified" builder tags the latest built image with the -STABLE tag. In case of detecting a faulty behavior, we can control the tag to simply revert the feature if it's possible by just retagging.

@ruflin
Copy link
Contributor

ruflin commented Feb 25, 2021

I expect future community developers to develop against stable version of the stack, so I would assume this should not happen. A bit similar on our end, I expect us to be more and more able to develop against stable / released versions.

Broken / failed builds will keep happening from time to time, be it that we are in control or someone else. One thing that was great during the development of the package-registry was that each PR and each commit to master had its own docker image / tag. So in case of a broken "latest/SNAPSHOT", a temporary tag could be used. It would be nice, if we would have something similar for the SNAPSHOT builds, so not only having 8.0.0-SNAPSHOT but also 8.0.0-3ac34-SNAPSHOT available so we could go back a few days in case things are broken. AFAIK this does not exist yet? This is very similar to what @mtojek proposed or would be a requirement because otherwise a STABLE tag cannot be introduced if this older images don't exist anymore. @kuisathaverat You might know more here?

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Mar 2, 2021

we have short of that, recently we have added the package to the main pipeline, so every time you merge in the master and the build reach the elastic-agent package stage, if that stage end well, a bunch of Docker images would be published in our repository.

docker push docker.elastic.co/observability-ci/elastic-agent:8.0.0-SNAPSHOT-amd64
docker push docker.elastic.co/observability-ci/elastic-agent:a84508c749455ef9228ba1024580279e4cc86ab7-amd64
docker push docker.elastic.co/observability-ci/elastic-agent:8.0-SNAPSHOT-amd64
docker push docker.elastic.co/observability-ci/elastic-agent-ubi8:8.0.0-SNAPSHOT-amd64
docker push docker.elastic.co/observability-ci/elastic-agent-ubi8:a84508c749455ef9228ba1024580279e4cc86ab7-amd64
docker push docker.elastic.co/observability-ci/elastic-agent-ubi8:8.0-SNAPSHOT-amd64

ARM Docker images are also published.

PRs also publish the Docker images

docker push docker.elastic.co/observability-ci/elastic-agent:pr-24220-amd64
docker push docker.elastic.co/observability-ci/elastic-agent:896efa2c57bb8be6eeca1d5a62b76a613960a614-amd64
docker push docker.elastic.co/observability-ci/elastic-agent-ubi8:pr-24220-amd64
docker push docker.elastic.co/observability-ci/elastic-agent-ubi8:896efa2c57bb8be6eeca1d5a62b76a613960a614-amd64

@mtojek
Copy link
Contributor Author

mtojek commented Mar 2, 2021

Are these images publicly available (docker.elastic.co/observability-ci/elastic-agent)? For the purpose of community contributors, we'll need something that doesn't require any special auth.

@kuisathaverat
Copy link
Contributor

you need to login to access that namespace, we can publish them in another public place

@mtojek
Copy link
Contributor Author

mtojek commented Mar 3, 2021

you need to login to access that namespace, we can publish them in another public place

Yes, that would be a good idea. I don't know green/red stats for packaging of Elastic Agent, but I hope it's relatively rare, so we can leverage from such tags.

@botelastic
Copy link

botelastic bot commented Mar 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the Stalled label Mar 3, 2022
@botelastic botelastic bot closed this as completed Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants