Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 migrations adds significant CI overhead #91618

Closed
rudolf opened this issue Feb 17, 2021 · 13 comments
Closed

v2 migrations adds significant CI overhead #91618

rudolf opened this issue Feb 17, 2021 · 13 comments
Labels
discuss project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Operations Team label for Operations Team

Comments

@rudolf
Copy link
Contributor

rudolf commented Feb 17, 2021

v2 migrations causes the CI run to take 15 minutes longer on average which consumes significant CI resources. In short, the v2 migrations have a lot more steps and do a lot more work so they're much slower especially for small data sets like our tests.

This has caused Jenkins timeouts and although infra bumped the limits, there's still a concern that we're adding this much extra time to CI for something that isn't directly testing v2 migrations and only marginally improving our confidence in the code.

The medium term solution is to use the SO import API to load test fixtures instead of having esArchiver writing potentially outdated documents directly into the index. This will avoid Kibana having to run a full migration every time fixtures are needed as objects will be migrated in memory before they're written which would theoretically be even faster than using a full v1 migration. The QA team has already started working on this #89368.

In the short term we have two options:

  1. Keep running v2 migrations and accept the CI overhead. After speeding up the spaces tests Speed up spaces tests by letting v2 migrations do less work #91829 we were able to bring CI runtime to under 2 hours again.
  2. Use v1 migrations (Disable v2 migrations to speed-up FTR #91402). This is a quick fix, but prevents us from removing the v1 code which we would ideally do as soon as we have confidence that the v2 migrations are stable (v7.13 at the earliest, but possibly v7.14)
@rudolf rudolf added the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Feb 17, 2021
@rudolf
Copy link
Contributor Author

rudolf commented Feb 17, 2021

@LeeDr @wayneseymour Do you have a target for completing #89368?

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Operations Team label for Operations Team labels Feb 17, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf added the discuss label Feb 17, 2021
@mshustov
Copy link
Contributor

Keep running v2 migrations and accept the CI overhead

I suspect the overhead will decrease as more plugins migrate to the SO API.
We can use migration v1 until the end of 7.13. Then evaluate how much #89368 has relieved the situation.

@spalger
Copy link
Contributor

spalger commented Feb 17, 2021

@LeeDr @wayneseymour Do you have a target for completing #89368?

I'm working with @wayneseymour on getting that PR ready for merge ASAP, as for migrating all our .kibana esArchives to use SO import/export I'm confident that won't ever happen without a good deal of coordination and pushing, or someone just going and migrating things for folks.

@LeeDr
Copy link

LeeDr commented Feb 17, 2021

Are we sure v2 migrations is adding that much time? I'm getting set up to run locally, but from one case on a recent master Jenkins job it looks like it took just over 2 seconds (Migration completed after 2138ms).

Most of our tests have very few saved objects. A dashboard test, for example, might have up to a dozen or so visualizations that it will add to a dashboard. Maybe there's some fleet tests or others that actually do have a lot?

09:34:15             └-: dashboard time
09:34:15               └-> "before all" hook in "dashboard time"
09:34:15               └-> "before all" hook in "dashboard time"
09:34:15                 │ proc [kibana]   log   [15:34:13.349] [info][savedobjects-service] [.kibana] INIT -> LEGACY_SET_WRITE_BLOCK
09:34:15                 │ proc [kibana]   log   [15:34:13.420] [info][savedobjects-service] [.kibana] LEGACY_SET_WRITE_BLOCK -> LEGACY_CREATE_REINDEX_TARGET
09:34:15                 │ proc [kibana]   log   [15:34:13.514] [info][savedobjects-service] [.kibana] LEGACY_CREATE_REINDEX_TARGET -> LEGACY_REINDEX
09:34:15                 │ proc [kibana]   log   [15:34:13.550] [info][savedobjects-service] [.kibana] LEGACY_REINDEX -> LEGACY_REINDEX_WAIT_FOR_TASK
09:34:15                 │ proc [kibana]   log   [15:34:13.780] [info][savedobjects-service] [.kibana] LEGACY_REINDEX_WAIT_FOR_TASK -> LEGACY_DELETE
09:34:15                 │ proc [kibana]   log   [15:34:13.829] [info][savedobjects-service] [.kibana] LEGACY_DELETE -> SET_SOURCE_WRITE_BLOCK
09:34:15                 │ proc [kibana]   log   [15:34:13.891] [info][savedobjects-service] [.kibana] SET_SOURCE_WRITE_BLOCK -> CREATE_REINDEX_TEMP
09:34:15                 │ proc [kibana]   log   [15:34:13.989] [info][savedobjects-service] [.kibana] CREATE_REINDEX_TEMP -> REINDEX_SOURCE_TO_TEMP
09:34:15                 │ proc [kibana]   log   [15:34:13.996] [info][savedobjects-service] [.kibana] REINDEX_SOURCE_TO_TEMP -> REINDEX_SOURCE_TO_TEMP_WAIT_FOR_TASK
09:34:15                 │ proc [kibana]   log   [15:34:14.108] [info][savedobjects-service] [.kibana] REINDEX_SOURCE_TO_TEMP_WAIT_FOR_TASK -> SET_TEMP_WRITE_BLOCK
09:34:15                 │ proc [kibana]   log   [15:34:14.172] [info][savedobjects-service] [.kibana] SET_TEMP_WRITE_BLOCK -> CLONE_TEMP_TO_TARGET
09:34:15                 │ proc [kibana]   log   [15:34:14.352] [info][savedobjects-service] [.kibana] CLONE_TEMP_TO_TARGET -> OUTDATED_DOCUMENTS_SEARCH
09:34:15                 │ proc [kibana]   log   [15:34:14.385] [info][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH -> OUTDATED_DOCUMENTS_TRANSFORM
09:34:15                 │ proc [kibana]   log   [15:34:15.255] [info][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_TRANSFORM -> OUTDATED_DOCUMENTS_SEARCH
09:34:15                 │ proc [kibana]   log   [15:34:15.270] [info][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH -> UPDATE_TARGET_MAPPINGS
09:34:15                 │ proc [kibana]   log   [15:34:15.325] [info][savedobjects-service] [.kibana] UPDATE_TARGET_MAPPINGS -> UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK
09:34:15                 │ proc [kibana]   log   [15:34:15.435] [info][savedobjects-service] [.kibana] UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK -> MARK_VERSION_INDEX_READY
09:34:15                 │ proc [kibana]   log   [15:34:15.476] [info][savedobjects-service] [.kibana] MARK_VERSION_INDEX_READY -> DONE
09:34:15                 │ proc [kibana]   log   [15:34:15.477] [info][savedobjects-service] [.kibana] Migration completed after 2138ms

@spalger
Copy link
Contributor

spalger commented Feb 17, 2021

While investigating flakiness we were seeing this weekend I was suspicious about the increase in build time and started stepping back through the https://kibana-ci.elastic.co/job/elastic+kibana+master/ build, and the point where the overall build time changed from ~1h45m to ~2h10m was the build where v2 migrations were enabled in the FTR.

Those builds are no longer available in Jenkins, but I can put up a PR reverting the source PR to show the change in execution time if you want to see it.

@LeeDr
Copy link

LeeDr commented Feb 17, 2021

I started running functional tests locally while looking at the migration start/end times. I see most of them only take about 4 seconds but we probably do it hundreds of times through the whole CI run.

But @spalger is also right that switching tests to use saved object import isn't going to happen quickly or for every test.

$ node.exe scripts/functional_test_runner.js | while read line; do echo "[`date +%H:%M:%S.%N`] $line"; done | grep Migrat
[13:21:45.158666400] │                "coreMigrationVersion" : "7.12.0",
[13:22:09.136904600] │ debg Migrating saved objects
[13:22:14.312429300] │ debg [visualize] Migrated Kibana index after loading Kibana data
[13:24:25.604134300] │ debg Migrating saved objects
[13:24:27.636234900] │ debg [date_nanos] Migrated Kibana index after loading Kibana data
[13:25:14.512925400] │ debg Migrating saved objects
[13:25:20.698772200] │ debg [dashboard/current/kibana] Migrated Kibana index after loading Kibana data
[13:26:43.553638800] │ debg Migrating saved objects
[13:26:47.024954600] │ debg [dashboard/current/kibana] Migrated Kibana index after loading Kibana data
[13:27:44.079090000] │ debg Migrating saved objects
[13:27:47.088599400] │ debg [dashboard/current/kibana] Migrated Kibana index after loading Kibana data
[13:30:58.585865400] │ debg Migrating saved objects
[13:31:02.086695200] │ debg [dashboard/current/kibana] Migrated Kibana index after loading Kibana data
[13:33:38.682639400] │ debg Migrating saved objects
[13:33:42.109750900] │ debg [dashboard/current/kibana] Migrated Kibana index after loading Kibana data

@rudolf
Copy link
Contributor Author

rudolf commented Feb 19, 2021

yeah, v2 migrations are only a few seconds slower, but it really adds up over hundreds of tests that load new data.

@jbudz
Copy link
Member

jbudz commented Apr 6, 2021

@rudolf is this something we're still addressing, or was #91829 the fix?

@LeeDr
Copy link

LeeDr commented Apr 6, 2021

Tre' (@wayneseymour) is starting to make good progress on this now. But there's a lot of tests. We're trying to remove cases where we unload a .kibana index and replace it with cleaning the saved objects instead. We don't know if we're going to be able to switch all functional UI tests over to using the Saved Objects or not. If we got one or two developers to help work on some tests it would go faster. Otherwise it's going to take several weeks.

@rudolf
Copy link
Contributor Author

rudolf commented Apr 8, 2021

@jbudz #91829 was the short term fix.

@rudolf
Copy link
Contributor Author

rudolf commented Sep 20, 2021

Closing as moving tests to kbnArchiver will remove the migrationsv2 overhead #102552

@rudolf rudolf closed this as completed Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Operations Team label for Operations Team
Projects
None yet
Development

No branches or pull requests

6 participants