Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Use GCS buckets for bazel remote caching #131345

Merged
merged 14 commits into from
May 3, 2022

Conversation

brianseeders
Copy link
Contributor

@brianseeders brianseeders commented May 2, 2022

TLDR: Use GCS buckets for bazel remote caching in CI for a cheap and easy bootstrap speed boost. Local remote cache is currently unchanged. All steps can read and write.

Bootstrap times in CI vary a lot over the course of the day, and can get pretty long on smaller machines when changes are made that invalidate the local cache inside the agent images (which are refreshed daily in the morning). We would like to enable caching across CI in a performant and cost-effective way.

I trialed:

  • Using bazel-remote, with grpc, hosted on an instance in our GCP project
  • Using a single, "multi-regional" GCS bucket located in the U.S.
  • Using single-region GCS buckets in every GCP region where we run CI (there are currently 5).

bazel-remote notes:

  • Probably the most expensive and least performant option, at least as I had it configured. Worse performance for instances further from where the service was hosted
  • Could possibly host a separate instance in each region to gain performance across regions, but it will be 5x as expensive
  • Does not have an HA solution
  • Would have to be hosted, maintained, upgraded, and monitored by us
  • With 100 instances running bootstrap starting from 0 cache (a worst-case scenario), CPU load spiked to around 4
  • Example with 100 workers, no disk cache, full remote cache - 2min-4min for bootstrap, depending on the region of the agent. Has a lot of variability inside the same region as well

GCS single bucket:

  • Uses https instead of grpc
  • Was faster than bazel-remote for worst-case scenario
  • Similar to bazel-remote, it's much slower for instances not close to the U.S.
  • Cheap and zero maintenance
  • File retention set to 48 hours, as we only need to cache objects not present in local cache, which is updated every 24 hours
  • Example with 100 workers, no disk cache, full remote cache - 2min-4min for bootstrap, depending on the region of the agent

GCS bucket-per-region:

  • Same as GCS single bucket, except:
  • Storage cost will be 5x as much as single bucket, but it's a pretty small cost. I'm not sure how much storage 48 hours worth of objects will be (it should be pretty small, probably in MB), but 1TB is about $20/mo.
  • Objects have to be cached separately across all the regions, but will generally happen during the on-merge job by jest jobs, and FTR jobs if still missing by then
  • All regions are fast
  • Bandwidth costs should be smaller as objects are stored in the same region
  • Example with 100 workers, no disk cache, full remote cache - about 2min for bootstrap

Given all of this, the last option (GCS bucket-per-region) seems like the best choice. This may change in the future when we're utilizing bazel for even more, and we can always reassess.

As a side note: We could also use this for local remote cache if we wanted. We just wouldn't get the UIs, statistics, historical tracking, etc. that Buildbuddy provides.

@brianseeders brianseeders added Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes v8.3.0 Team:Operations Team label for Operations Team labels May 2, 2022
@brianseeders
Copy link
Contributor Author

buildkite build this

@brianseeders
Copy link
Contributor Author

buildkite build this

@brianseeders brianseeders added v8.2.1 v7.17.4 auto-backport Deprecated - use backport:version if exact versions are needed labels May 3, 2022
@brianseeders brianseeders changed the title Trying our own bazel remote cache [CI] Use GCS buckets for bazel remote caching May 3, 2022
@brianseeders brianseeders marked this pull request as ready for review May 3, 2022 20:12
@brianseeders brianseeders requested a review from a team as a code owner May 3, 2022 20:12
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@brianseeders brianseeders requested a review from mistic May 3, 2022 20:12
Copy link
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉🎉🎉 This is awesome!! 🎉🎉🎉

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mistic
Copy link
Member

mistic commented May 3, 2022

I believe it is also worth it to try https://github.com/znly/bazel-cache in the future

@brianseeders brianseeders merged commit 3bc9c42 into elastic:main May 3, 2022
@brianseeders brianseeders deleted the bazel-remote-cache branch May 3, 2022 20:49
kibanamachine pushed a commit that referenced this pull request May 3, 2022
kibanamachine pushed a commit that referenced this pull request May 3, 2022
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.2
7.17

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request May 3, 2022
(cherry picked from commit 3bc9c42)

Co-authored-by: Brian Seeders <[email protected]>
kibanamachine added a commit that referenced this pull request May 3, 2022
(cherry picked from commit 3bc9c42)

Co-authored-by: Brian Seeders <[email protected]>
academo pushed a commit to academo/kibana that referenced this pull request May 4, 2022
academo pushed a commit to academo/kibana that referenced this pull request May 4, 2022
academo added a commit that referenced this pull request May 5, 2022
* Add severity field to create API and migration

* Adds integration test for severity field migration

* remove exclusive test

* Change severity levels

* Update integration tests for post case

* Add more integration tests

* Fix all cases list test

* Fix some server test

* Fix util server test

* Fix client util test

* Convert event log's duration from number to string in Kibana (keep as "long" in Elasticsearch) (#130819)

* Convert event.duration to string in TypeScript, keep as long in Elasticsearch

* Fix jest test

* Fix functional tests

* Add ecsStringOrNumber to event log schema

* Fix jest test

* Add utility functions to event log plugin

* Use new event log utility functions

* PR fixes

Co-authored-by: Kibana Machine <[email protected]>

* filter o11y rule aggregations (#131301)

* [Cloud Posture] Display and save rules per benchmark (#131412)

* Adding aria-label for discover data grid select document checkbox (#131277)

* Update API docs (#130999)

Co-authored-by: Kibana Machine <[email protected]>

* [CI] Use GCS buckets for bazel remote caching (#131345)

* [Actionable Observability] Add license modal to rules table (#131232)

* Add fix license link

* fix localization

* fix CI error

* fix more translation issues

Co-authored-by: Kibana Machine <[email protected]>

* [RAM] Add shareable rule status filter (#130705)

* rule state filter

* turn off experiment

* [CI] Auto-commit changed files from 'node scripts/eslint --no-cache --fix'

* Status filter API call

* Fix tests

* rename state to status, added tests

* Address comments and fix tests

* Revert experiment flag

* Remove unused translations

* Addressed comments

Co-authored-by: kibanamachine <[email protected]>

* [storybook] Watch for changes in packages (#131467)

* [storybook] Watch for changes in packages

* Update default_config.ts

* Improve saved objects migrations failure errors and logs (#131359)

* [Unified observability] Add tour step to guided setup (#131149)

* [Lens] Improved interval input (#131372)

* [Vega] Adjust vega doc for usage of ems files (#130948)

* adjust vega doc

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Nick Peihl <[email protected]>

* Excess intersections

* Create severity user action

* Add severity to create_case user action

* Fix and add integration tests

* Minor improvements

Co-authored-by: Mike Côté <[email protected]>
Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: mgiota <[email protected]>
Co-authored-by: Jordan <[email protected]>
Co-authored-by: Bhavya RM <[email protected]>
Co-authored-by: Thomas Neirynck <[email protected]>
Co-authored-by: Brian Seeders <[email protected]>
Co-authored-by: Jiawei Wu <[email protected]>
Co-authored-by: Clint Andrew Hall <[email protected]>
Co-authored-by: Christiane (Tina) Heiligers <[email protected]>
Co-authored-by: Alejandro Fernández Gómez <[email protected]>
Co-authored-by: Joe Reuter <[email protected]>
Co-authored-by: Nick Peihl <[email protected]>
Co-authored-by: Christos Nasikas <[email protected]>
kertal pushed a commit to kertal/kibana that referenced this pull request May 24, 2022
kertal pushed a commit to kertal/kibana that referenced this pull request May 24, 2022
* Add severity field to create API and migration

* Adds integration test for severity field migration

* remove exclusive test

* Change severity levels

* Update integration tests for post case

* Add more integration tests

* Fix all cases list test

* Fix some server test

* Fix util server test

* Fix client util test

* Convert event log's duration from number to string in Kibana (keep as "long" in Elasticsearch) (elastic#130819)

* Convert event.duration to string in TypeScript, keep as long in Elasticsearch

* Fix jest test

* Fix functional tests

* Add ecsStringOrNumber to event log schema

* Fix jest test

* Add utility functions to event log plugin

* Use new event log utility functions

* PR fixes

Co-authored-by: Kibana Machine <[email protected]>

* filter o11y rule aggregations (elastic#131301)

* [Cloud Posture] Display and save rules per benchmark (elastic#131412)

* Adding aria-label for discover data grid select document checkbox (elastic#131277)

* Update API docs (elastic#130999)

Co-authored-by: Kibana Machine <[email protected]>

* [CI] Use GCS buckets for bazel remote caching (elastic#131345)

* [Actionable Observability] Add license modal to rules table (elastic#131232)

* Add fix license link

* fix localization

* fix CI error

* fix more translation issues

Co-authored-by: Kibana Machine <[email protected]>

* [RAM] Add shareable rule status filter (elastic#130705)

* rule state filter

* turn off experiment

* [CI] Auto-commit changed files from 'node scripts/eslint --no-cache --fix'

* Status filter API call

* Fix tests

* rename state to status, added tests

* Address comments and fix tests

* Revert experiment flag

* Remove unused translations

* Addressed comments

Co-authored-by: kibanamachine <[email protected]>

* [storybook] Watch for changes in packages (elastic#131467)

* [storybook] Watch for changes in packages

* Update default_config.ts

* Improve saved objects migrations failure errors and logs (elastic#131359)

* [Unified observability] Add tour step to guided setup (elastic#131149)

* [Lens] Improved interval input (elastic#131372)

* [Vega] Adjust vega doc for usage of ems files (elastic#130948)

* adjust vega doc

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Nick Peihl <[email protected]>

* Excess intersections

* Create severity user action

* Add severity to create_case user action

* Fix and add integration tests

* Minor improvements

Co-authored-by: Mike Côté <[email protected]>
Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: mgiota <[email protected]>
Co-authored-by: Jordan <[email protected]>
Co-authored-by: Bhavya RM <[email protected]>
Co-authored-by: Thomas Neirynck <[email protected]>
Co-authored-by: Brian Seeders <[email protected]>
Co-authored-by: Jiawei Wu <[email protected]>
Co-authored-by: Clint Andrew Hall <[email protected]>
Co-authored-by: Christiane (Tina) Heiligers <[email protected]>
Co-authored-by: Alejandro Fernández Gómez <[email protected]>
Co-authored-by: Joe Reuter <[email protected]>
Co-authored-by: Nick Peihl <[email protected]>
Co-authored-by: Christos Nasikas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes Team:Operations Team label for Operations Team v7.17.4 v8.2.1 v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants