[RFC] Buildkite #95070

brianseeders · 2021-03-22T15:50:32Z

RFC for using Buildkite, rather than Jenkins, as our primary CI system.

Easy to read link: https://github.com/brianseeders/kibana/blob/buildkite-rfc/rfcs/text/0018_buildkite.md

The commenting period will last two weeks, ending on April 12, 2021.

[skip-ci]

spalger

LGTM!

lukeelmers

Took some time to read this yesterday and, assuming the ops team is on board with managing the infra, this sounds like it will be a huge win in terms of DX.

Also, just wanted to say that this is one of the most thoughtful, well-written RFCs I've read recently. Thanks for taking the time to lay out all the details!

lukeelmers · 2021-04-02T13:50:34Z

rfcs/text/0016_buildkite.md

+
+For self-hosted options, containers will allow us to utilize longer-running instances (with cached layers, git repos, etc) without worrying about polluting the build environment between builds.
+
+If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.


…var features

kobelb

Great RFC! Next time, it'd be good to include more pedantic details that we can bike-shed about ;)

gtback · 2021-04-06T15:12:17Z

I'll admit that I haven't read through the whole RFC in detail, but based on a skim, I'm really impressed by the thought and effort that went into this.

My primary interest in how this will impact jobs that run in the Kibana repo that aren't on kibana-ci and (theoretically) wouldn't be ported to Buildkite; specifically docs-related ones :-) Is there anything that would cause the docs builds (which currently run on elasticsearch-ci) to interfere with jobs running in Buildkite?

brianseeders · 2021-04-06T15:15:23Z

Is there anything that would cause the docs builds (which currently run on elasticsearch-ci) to interfere with jobs running in Buildkite?

I don't think so. As far as I know, the docs jobs on elasticsearch-ci and the jobs on kibana-ci don't really interact at all or are aware of each other. They just both add Checks to the same PR, which would continue to happen.

afharo

I love this RFC and all the thought taken into providing this level of detail! 🧡

As long as the operations team is OK with maintaining this architecture (I guess with a more stable CI, it'll be even less than dealing with Jenkins), I'm looking forward to seeing the first PR comment posted by Buildkite 🚀

I added a couple of NITs for your consideration. Feel free to discard them if they don't apply.

afharo · 2021-04-06T15:32:22Z

rfcs/text/0016_buildkite.md

+
+If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.
+
+So, we need a solution that at least allows us to build and run our own containers. The more features that exist for managing this, the easier it will be.


Have we considered https://concourse-ci.org/? AFAIK, it's based on containerisation, so you can replay from certain cached stages.

From my skimming, Concourse looks like a build tool - in which we have chosen to use Bazel for that.

I'm not super familiar with Concourse, but I know of it. I didn't really consider it in this round mostly because I've heard not great things about it from people who have used it, and the UI is such a turn-off for me: https://ci.concourse-ci.org/teams/main/pipelines/concourse

afharo · 2021-04-06T16:27:31Z

rfcs/text/0016_buildkite.md

+- The GCP APIs (both read and create) are returning success codes
+- The GCP API for listing instances is returning partial/missing/erroneous data, with a success code
+- GCP instances are successfully being created
+- Created GCP instances are unable to connect to Buildkite, or Buildkite Agents API is returning partial/missing/erroneous data


I'm not familiar with GCP APIs. Sorry for asking this question if it doesn't make sense:

Could we find any situation where the scaled-up host fails to start the agent, gets stuck in starting it or even crashes while running? Do we have a clear status to identify that and potentially self-heal?

Buildkite offers retry logic which can be based on the exit code. There are specific exit codes agent connectivity issues.

If the agent crashes or stops for any reason, the GCP instance shuts itself down automatically. A new agent will automatically be created, if one is still needed based on the current needs of CI.

I also pull the agent information from Buildkite, and am able to associate running agents with GCP instances. If a GCP instance is online after several minutes but isn't associated with a running agent in Buildkite, I can terminate it so that it will be replaced. This isn't implemented yet but is on the TODO list: #95711

afharo · 2021-04-06T16:38:23Z

rfcs/text/0016_buildkite.md

+
+#### Build / Deploy
+
+Currently, the agent manager is built and deployed using [Google Cloud Build](https://cloud.google.com/build). It is deployed to and hosted using [GKE Auto-Pilot](https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot) (Kubernetes). GKE was used, rather than Cloud Run, primarily because the agent manager runs continuously (with a 30sec pause between executions) whereas Cloud Run is for services that respond to HTTP requests.


Do we have an estimate of how much that one-run-every-30sec approach would cost? Are we OK with that cost?
I would expect it's much cheaper than maintaining our current Jenkins primary agent infra.

Sure, the agent manager itself costs about $23/mo as it's currently configured. It's actually configured with quite a bit more resources than it currently needs, so could likely be scaled down even further. It handled 1000s of concurrent agents with its current configuration.

chloeruka · 2021-04-07T05:20:00Z

I'm looking forward to seeing the first PR comment posted by Buildkite 🚀

@afharo: Just commenting to say hi and that we have been listening! We've been abuzz talking about this internally and are very excited about the thorough investigation and thought you've been putting into this. I'm keen to work with you all to help deliver the best CI for your team. So thanks for keeping us in the loop!

I'll try to find some time to write some longer-form thoughts about this RFC soon, but figured I'd break the silence. 😄 Do let me know if there are any questions you'd like answered.

brianseeders · 2021-04-07T05:42:22Z

Thanks @chloeruka, we're super excited to work with Buildkite as well! No questions at the moment, the docs make everything pretty straightforward.

chloeruka · 2021-04-08T06:57:49Z

rfcs/text/0016_buildkite.md

+
+![Example Build](../images/0016_buildkite_build.png)
+
+Note that dependencies between steps are mostly not shown in the UI. See screenshot below for an example. There are several layers of dependencies between all of the steps in this pipeline. The only one that is shown is the final step (`Post All`), which executes after all steps beforehand are finished. There are some other strategies to help organize the steps (such as the new grouping functionality) if we need.


Psst we have a secret page you can use to visualise step deps here: https://buildkite.com/elastic/kibana/builds/84/dag

We're still polishing this so we can display it in more places, but it might help for debugging.

brianseeders · 2021-04-12T19:57:04Z

We were planning to accept and merge the RFC today, but we need to wait until some internal discussions are completed. I expect that they will be completed this week, but will post an update on Thursday if they are not.

mark-vieira · 2021-04-12T20:16:55Z

On the Elasticsearch side Buildkite looks very intriguing. It meets a few critical needs for us:

Scalable
Unopinionated
Bring your own build agent

I'd be nice to have some slightly more formal discussions about what a scenario in which other teams moving to Buildkite might look like. While the Kibana ops team may be comfortable with taking on the responsibility of managing their own CI infrastructure, it's a very different thing when other teams start to rely on Kibana ops owned infra. There's also a lot of current investment by the engineering infra team that would need to be adapted. For Elasticsearch, this is primarily the building of our CI worker images. If we were to "get on board" we'd probably require at least some support from both infra automation and kibana ops teams. Ideally we'd have some discussions around how those teams might balance existing commitments against aiding other teams to migrate, or even if that's a goal at all (i.e. if you want to move to Buildkite you're on your own).

There is some precedent for cross team infrastructure being owned by another team (not infra). Gradle Enterprise for example is owned and managed by the Elasticsearch delivery team, yet it's also leveraged by Cloud. So far it's worked out fine, but it's frankly a pretty trivial piece of infrastructure to maintain, and piggyback's on existing work by infra (runs inside elastic-apps).

So to summarize, Buildkite looks pretty sweet. If the Elasticsearch team wanted to do something similar, what kind of support could we expect from engineering infra and kibana ops, if any?

brianseeders · 2021-04-16T02:05:49Z

No updates at the moment. We are still waiting on some internal discussions to complete. I will provide an update when they do.

brianseeders · 2021-04-29T16:19:24Z

Just a quick update. We are still working through concerns with various teams internally. I will post another update when there is more to share.

brianseeders · 2021-05-12T21:58:06Z

We got approval to move forward with Buildkite today! 🎉 🚀

kibanamachine · 2021-05-14T21:58:25Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 95070 or prevent reminders by adding the backport:skip label.

brianseeders added 15 commits March 15, 2021 15:18

Initial commit buildkite RFC elastic#94630

c354f67

Add HR to help with formatting

3839ea4

Add a few placeholder sections

86cde0b

Lots of in-progress work

e83a08f

A few more placeholders

75c1af7

Fleshing out more sections and adding more placeholders

19b64b3

Updates to various sections

8b1d3a7

Updates to various sections

2b49569

WIP

b3345f1

Add some info about pr bot

5c3fbad

Increment RFC number

69177b3

Merge branch 'master' into buildkite-rfc

62568c6

Oops

e9e8dde

WIP

e07f505

More agent manager info

5a120c2

brianseeders added Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes v8.0.0 Team:Operations Team label for Operations Team labels Mar 22, 2021

brianseeders added 11 commits March 22, 2021 11:50

Add new image

d9db121

Start adding Jenkins section

b2d9f48

Add table

3af7549

Change wording

6fef924

Add image

9057fc8

WIP

2886677

WIP

c52f7fd

wip

e58171e

WIP

da9ca31

Add images

4ef85fa

WIP

4beca01

spalger approved these changes Mar 29, 2021

View reviewed changes

jbudz approved these changes Mar 29, 2021

View reviewed changes

tylersmalley mentioned this pull request Mar 29, 2021

Open BuildKite Request For Comment #94630

Closed

tylersmalley requested a review from kobelb March 29, 2021 19:45

tylersmalley mentioned this pull request Mar 30, 2021

[meta] Buildkite #95798

Closed

mistic approved these changes Mar 30, 2021

View reviewed changes

mshustov approved these changes Apr 2, 2021

View reviewed changes

lukeelmers approved these changes Apr 2, 2021

View reviewed changes

Update RFC to include information about Buildkite's new redacted env …

1892d3b

…var features

kobelb approved these changes Apr 5, 2021

View reviewed changes

afharo approved these changes Apr 6, 2021

View reviewed changes

chloeruka reviewed Apr 8, 2021

View reviewed changes

spalger mentioned this pull request Apr 21, 2021

Failing test: Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/lens/lens_reporting·ts - lens app lens reporting should not cause PDF reports to fail #59229

Closed

brianseeders added 2 commits May 12, 2021 17:48

Merge remote-tracking branch 'upstream/master' into buildkite-rfc

5c7e324

Change RFC number to 18

4e69faa

brianseeders merged commit 3bf21c3 into elastic:master May 12, 2021

brianseeders deleted the buildkite-rfc branch May 12, 2021 21:58

kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label May 14, 2021

spalger added the backport:skip This commit does not require backporting label May 17, 2021

kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Buildkite #95070

[RFC] Buildkite #95070

brianseeders commented Mar 22, 2021 •

edited

Loading

spalger left a comment

lukeelmers left a comment

lukeelmers Apr 2, 2021

kobelb left a comment

gtback commented Apr 6, 2021

brianseeders commented Apr 6, 2021

afharo left a comment

afharo Apr 6, 2021

tylersmalley Apr 6, 2021

brianseeders Apr 6, 2021

afharo Apr 6, 2021

tylersmalley Apr 6, 2021

brianseeders Apr 6, 2021

afharo Apr 6, 2021

brianseeders Apr 6, 2021

chloeruka commented Apr 7, 2021 •

edited

Loading

brianseeders commented Apr 7, 2021

chloeruka Apr 8, 2021

brianseeders commented Apr 12, 2021

mark-vieira commented Apr 12, 2021

brianseeders commented Apr 16, 2021

brianseeders commented Apr 29, 2021

brianseeders commented May 12, 2021

kibanamachine commented May 14, 2021


		For self-hosted options, containers will allow us to utilize longer-running instances (with cached layers, git repos, etc) without worrying about polluting the build environment between builds.

		If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.


		If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.

		So, we need a solution that at least allows us to build and run our own containers. The more features that exist for managing this, the easier it will be.


		#### Build / Deploy

		Currently, the agent manager is built and deployed using [Google Cloud Build](https://cloud.google.com/build). It is deployed to and hosted using [GKE Auto-Pilot](https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot) (Kubernetes). GKE was used, rather than Cloud Run, primarily because the agent manager runs continuously (with a 30sec pause between executions) whereas Cloud Run is for services that respond to HTTP requests.


		![Example Build](../images/0016_buildkite_build.png)

		Note that dependencies between steps are mostly not shown in the UI. See screenshot below for an example. There are several layers of dependencies between all of the steps in this pipeline. The only one that is shown is the final step (`Post All`), which executes after all steps beforehand are finished. There are some other strategies to help organize the steps (such as the new grouping functionality) if we need.

[RFC] Buildkite #95070

[RFC] Buildkite #95070

Conversation

brianseeders commented Mar 22, 2021 • edited Loading

spalger left a comment

Choose a reason for hiding this comment

lukeelmers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kobelb left a comment

Choose a reason for hiding this comment

gtback commented Apr 6, 2021

brianseeders commented Apr 6, 2021

afharo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chloeruka commented Apr 7, 2021 • edited Loading

brianseeders commented Apr 7, 2021

Choose a reason for hiding this comment

brianseeders commented Apr 12, 2021

mark-vieira commented Apr 12, 2021

brianseeders commented Apr 16, 2021

brianseeders commented Apr 29, 2021

brianseeders commented May 12, 2021

kibanamachine commented May 14, 2021

brianseeders commented Mar 22, 2021 •

edited

Loading

chloeruka commented Apr 7, 2021 •

edited

Loading