Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Buildkite #95070

Merged
merged 47 commits into from
May 12, 2021
Merged

[RFC] Buildkite #95070

merged 47 commits into from
May 12, 2021

Conversation

brianseeders
Copy link
Contributor

@brianseeders brianseeders commented Mar 22, 2021

RFC for using Buildkite, rather than Jenkins, as our primary CI system.

Easy to read link: https://github.com/brianseeders/kibana/blob/buildkite-rfc/rfcs/text/0018_buildkite.md

The commenting period will last two weeks, ending on April 12, 2021.

[skip-ci]

@brianseeders brianseeders added Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes v8.0.0 Team:Operations Team label for Operations Team labels Mar 22, 2021
Copy link
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@lukeelmers lukeelmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took some time to read this yesterday and, assuming the ops team is on board with managing the infra, this sounds like it will be a huge win in terms of DX.

Also, just wanted to say that this is one of the most thoughtful, well-written RFCs I've read recently. Thanks for taking the time to lay out all the details!


For self-hosted options, containers will allow us to utilize longer-running instances (with cached layers, git repos, etc) without worrying about polluting the build environment between builds.

If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Copy link
Contributor

@kobelb kobelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great RFC! Next time, it'd be good to include more pedantic details that we can bike-shed about ;)

@gtback
Copy link
Member

gtback commented Apr 6, 2021

I'll admit that I haven't read through the whole RFC in detail, but based on a skim, I'm really impressed by the thought and effort that went into this.

My primary interest in how this will impact jobs that run in the Kibana repo that aren't on kibana-ci and (theoretically) wouldn't be ported to Buildkite; specifically docs-related ones :-) Is there anything that would cause the docs builds (which currently run on elasticsearch-ci) to interfere with jobs running in Buildkite?

@brianseeders
Copy link
Contributor Author

Is there anything that would cause the docs builds (which currently run on elasticsearch-ci) to interfere with jobs running in Buildkite?

I don't think so. As far as I know, the docs jobs on elasticsearch-ci and the jobs on kibana-ci don't really interact at all or are aware of each other. They just both add Checks to the same PR, which would continue to happen.

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this RFC and all the thought taken into providing this level of detail! 🧡

As long as the operations team is OK with maintaining this architecture (I guess with a more stable CI, it'll be even less than dealing with Jenkins), I'm looking forward to seeing the first PR comment posted by Buildkite 🚀

I added a couple of NITs for your consideration. Feel free to discard them if they don't apply.


If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.

So, we need a solution that at least allows us to build and run our own containers. The more features that exist for managing this, the easier it will be.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered https://concourse-ci.org/? AFAIK, it's based on containerisation, so you can replay from certain cached stages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my skimming, Concourse looks like a build tool - in which we have chosen to use Bazel for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with Concourse, but I know of it. I didn't really consider it in this round mostly because I've heard not great things about it from people who have used it, and the UI is such a turn-off for me: https://ci.concourse-ci.org/teams/main/pipelines/concourse

Comment on lines 544 to 547
- The GCP APIs (both read and create) are returning success codes
- The GCP API for listing instances is returning partial/missing/erroneous data, with a success code
- GCP instances are successfully being created
- Created GCP instances are unable to connect to Buildkite, or Buildkite Agents API is returning partial/missing/erroneous data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with GCP APIs. Sorry for asking this question if it doesn't make sense:

Could we find any situation where the scaled-up host fails to start the agent, gets stuck in starting it or even crashes while running? Do we have a clear status to identify that and potentially self-heal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buildkite offers retry logic which can be based on the exit code. There are specific exit codes agent connectivity issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the agent crashes or stops for any reason, the GCP instance shuts itself down automatically. A new agent will automatically be created, if one is still needed based on the current needs of CI.

I also pull the agent information from Buildkite, and am able to associate running agents with GCP instances. If a GCP instance is online after several minutes but isn't associated with a running agent in Buildkite, I can terminate it so that it will be replaced. This isn't implemented yet but is on the TODO list: #95711


#### Build / Deploy

Currently, the agent manager is built and deployed using [Google Cloud Build](https://cloud.google.com/build). It is deployed to and hosted using [GKE Auto-Pilot](https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot) (Kubernetes). GKE was used, rather than Cloud Run, primarily because the agent manager runs continuously (with a 30sec pause between executions) whereas Cloud Run is for services that respond to HTTP requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an estimate of how much that one-run-every-30sec approach would cost? Are we OK with that cost?
I would expect it's much cheaper than maintaining our current Jenkins primary agent infra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the agent manager itself costs about $23/mo as it's currently configured. It's actually configured with quite a bit more resources than it currently needs, so could likely be scaled down even further. It handled 1000s of concurrent agents with its current configuration.

@chloeruka
Copy link

chloeruka commented Apr 7, 2021

I'm looking forward to seeing the first PR comment posted by Buildkite 🚀

@afharo: Just commenting to say hi and that we have been listening! We've been abuzz talking about this internally and are very excited about the thorough investigation and thought you've been putting into this. I'm keen to work with you all to help deliver the best CI for your team. So thanks for keeping us in the loop!

I'll try to find some time to write some longer-form thoughts about this RFC soon, but figured I'd break the silence. 😄 Do let me know if there are any questions you'd like answered.

@brianseeders
Copy link
Contributor Author

Thanks @chloeruka, we're super excited to work with Buildkite as well! No questions at the moment, the docs make everything pretty straightforward.


![Example Build](../images/0016_buildkite_build.png)

Note that dependencies between steps are mostly not shown in the UI. See screenshot below for an example. There are several layers of dependencies between all of the steps in this pipeline. The only one that is shown is the final step (`Post All`), which executes after all steps beforehand are finished. There are some other strategies to help organize the steps (such as the new grouping functionality) if we need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Psst we have a secret page you can use to visualise step deps here: https://buildkite.com/elastic/kibana/builds/84/dag

We're still polishing this so we can display it in more places, but it might help for debugging.

@brianseeders
Copy link
Contributor Author

We were planning to accept and merge the RFC today, but we need to wait until some internal discussions are completed. I expect that they will be completed this week, but will post an update on Thursday if they are not.

@mark-vieira
Copy link

On the Elasticsearch side Buildkite looks very intriguing. It meets a few critical needs for us:

  1. Scalable
  2. Unopinionated
  3. Bring your own build agent

I'd be nice to have some slightly more formal discussions about what a scenario in which other teams moving to Buildkite might look like. While the Kibana ops team may be comfortable with taking on the responsibility of managing their own CI infrastructure, it's a very different thing when other teams start to rely on Kibana ops owned infra. There's also a lot of current investment by the engineering infra team that would need to be adapted. For Elasticsearch, this is primarily the building of our CI worker images. If we were to "get on board" we'd probably require at least some support from both infra automation and kibana ops teams. Ideally we'd have some discussions around how those teams might balance existing commitments against aiding other teams to migrate, or even if that's a goal at all (i.e. if you want to move to Buildkite you're on your own).

There is some precedent for cross team infrastructure being owned by another team (not infra). Gradle Enterprise for example is owned and managed by the Elasticsearch delivery team, yet it's also leveraged by Cloud. So far it's worked out fine, but it's frankly a pretty trivial piece of infrastructure to maintain, and piggyback's on existing work by infra (runs inside elastic-apps).

So to summarize, Buildkite looks pretty sweet. If the Elasticsearch team wanted to do something similar, what kind of support could we expect from engineering infra and kibana ops, if any?

@brianseeders
Copy link
Contributor Author

No updates at the moment. We are still waiting on some internal discussions to complete. I will provide an update when they do.

@brianseeders
Copy link
Contributor Author

Just a quick update. We are still working through concerns with various teams internally. I will post another update when there is more to share.

@brianseeders
Copy link
Contributor Author

We got approval to move forward with Buildkite today! 🎉 🚀

@brianseeders brianseeders merged commit 3bf21c3 into elastic:master May 12, 2021
@brianseeders brianseeders deleted the buildkite-rfc branch May 12, 2021 21:58
@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label May 14, 2021
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 95070 or prevent reminders by adding the backport:skip label.

@spalger spalger added the backport:skip This commit does not require backporting label May 17, 2021
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes RFC Team:Operations Team label for Operations Team v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.