-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v7.9.0] Filebeat constantly trying to migrate the registry to a new format #20705
Comments
Pinging @elastic/integrations-platforms (Team:Platforms) |
Pinging @elastic/integrations-services (Team:Services) |
Filebeat migrates old registry files by reading+deserializing all state from the old state into memory first. Once it has read all state into memory it fixes some old entries, updates the schema and finally opens the new store and writes entry by entry into the new store. Once the migration is finished the old file is deleted and the meta.json file is updated with the new verson ID. Unfortunately from the the trace and logs provided I can't tell if this process is just slow. But all in all, the migration happens on startup only. Support for running inputs or autodiscovery is not yet enabled in Filebeat. From the logs it looks like Filebeat was killed during the migration process:
A we are still in the migration process, this log indicates that 15969 entries have already been written. Due to the restart during the migration Filebeat will attempt to migrate again on startup, overwriting already written entries. How many entries does the original registry file has? Have you noticed IO errors in older runs? The stack trace shows that the migration is doing a write. Normally entries are written to an append only log file. If the log file gets full (default 10MiB limit) a checkpoint operation is triggered (the What is the state of the |
Spend some time debugging and benchmarking. While testing I noticed that the migrator disables the limit check, which forces the store to do an fsync operation after every entry migrated. The sweet spot between, this is fast and this is unbearable depends on the machine. On my machine 10000+ registry entries seem to take forever. Will work on a patch that fixes the fsync on every write (already fixes the issue on my machine with 100k taking <30s) and also will try to do the migration in less IO steps, such that hopefully only one fsync will be required in total. |
@urso you are correct, this just takes a very long time. I see that there are files
It took on one Filebeat instance to do this operation 2 hours and 15 minutes which is quite a lot. I can close this as a non-issue but it'd be great to have some performance improvements in this area in the future. Thanks for this software and your work on it!!! ❤️ If you will do any follow-up PR with those improvements then feel free to close this |
Let's keep the issue open. This long migration time is much too bad for my likings and should be treated as a bug. |
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes #20705
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705 (cherry picked from commit 03748b3)
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705 (cherry picked from commit 03748b3)
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes #20705 (cherry picked from commit 03748b3) Co-authored-by: Steffen Siering <[email protected]>
#20769) * Improve registry file migration performance (#20717) ## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes #20705 (cherry picked from commit 03748b3) * fix changelog Co-authored-by: Steffen Siering <[email protected]>
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705
…formance (elastic#20769) * Improve registry file migration performance (elastic#20717) ## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705 (cherry picked from commit c2be4a7) * fix changelog Co-authored-by: Steffen Siering <[email protected]>
Same problem as:
#11834 (comment)
Looking at the goroutine dump this comes up over and over again:
The full goroutine dump:
Filebeat logs on boot with the debug log level and stops doing any useful work:
Pull request that broke this most likely: https://github.com/elastic/beats/pull/19633/files
cc @urso
The text was updated successfully, but these errors were encountered: