Retry osquery instance launch faster when we see the stale lockfile issue #2041

RebeccaMahany · 2025-01-10T16:20:09Z

Relates to #2004.

Background

Currently, the process for launcher self-remediating this issue is the following:

the osquery process repeatedly attempts the rocksdb open call (reporting the error Rocksdb open failed (5:0) IO error: Failed to create lock file: osquery.db/LOCK: The process cannot access the file because it is being used by another process via the logs) -- I haven't seen these retries be successful
after one minute, launcher's osquery instance times out because the socket file has not been created yet
launcher's osquery runner receives the osquery instance timeout error
launcher's osquery runner waits 30 seconds
launcher's osquery runner retries launching an osquery instance; this is usually successful

This means that it can take up to 2.5 minutes for osquery to start up successfully (1 minute timeout, 30 second delay before retry, and another 1 minute for the second process to start up successfully).

This PR attempts to speed this process up. Likely some amount of time is needed for the lock to be released, but probably not the full 2.5 minutes.

We'd discussed moving the log processing into the osquery instance so that the instance can take note of the stale lockfile log and return an error earlier from Launch -- however, given #2044, I'm hopeful that we don't need to do that work and can avoid complicating the instance code further. 🤞

Updates made in this PR

Shorten the delay before retrying instance launch from 30 seconds to 10 seconds
Move starting the osquery process into its own function so we get a nice clean span for it -- so we can collect good trace data indicating how long the osquery process launch actually takes
Wait for a maximum of 20 seconds, instead of 60 seconds, for socket file to appear after starting osquery process

Wait! Is 3) really safe to do?

I collected a bunch of data to determine that 3) would indeed be safe to do!

Methodology: Over the past week of logging data, I found devices that had logged osquery extension socket not created yet ... will retry, indicating osquery process startup took at least a second. From those devices, I collected the timestamps of the launched osquery process and osquery socket created for unique run IDs, and took the "launch time" as the difference between the two. I discarded all launch times that were negative or over 60 seconds. I stored this data both for all OSes and per-OS so that I could see if there was a significant difference in launch time for Windows. I then used this stats package to compute mean, max, median, standard deviation, quartiles, and 95th percentile. I found that our average launch time is barely over a second, and our 95th percentile was still in the low single digits.

count: 5179; mean: 1.254870; max: 59.921116; median: 1.002222; stddev: 1.958687
quartiles: {Q1:1.001286 Q2:1.002222 Q3:1.008645}
extreme quartile outliers: [0.881872 1.031092 1.031159 1.0317174 1.031944 1.032705 1.0334 1.034522 1.034733 1.034971 1.035437 1.035578 1.035591 1.035709 1.035748 1.038638 1.038813 1.038819 1.038827 1.039648 1.040033 1.040388 1.040465 1.040628 1.041717 1.042567 1.042971 1.043006 1.045206 1.04523 1.046422 1.046578 1.047276 1.047739 1.047836 1.049473 1.051319 1.05186 1.052119 1.053003 1.0537 1.05402 1.054642 1.055601 1.056623 1.056969 1.057184 1.057372 1.057488 1.057564 1.058522 1.059046 1.059128 1.059215 1.059278 1.059392 1.059496 1.059727 1.059839 1.060296 1.062149 1.062333 1.063071 1.065606 1.068027 1.068914 1.070867 1.071454 1.073207 1.073799 1.07432 1.076378 1.076859 1.078256 1.080052 1.080564 1.080812 1.081592 1.081847 1.086187 1.086303 1.086669 1.088533 1.089328 1.09049 1.095944 1.096319 1.097659 1.098879 1.099537 1.099883 1.102047 1.102101 1.118254 1.119318 1.124571 1.12537 1.133389 1.133844 1.134983 1.152333 1.152562 1.15344 1.154097 1.157752 1.161304 1.173369 1.183236 1.189785 1.194428 1.195821 1.198293 1.199453 1.210717 1.21886 1.224965 1.2385760000000001 1.238634 1.245378 1.256235 1.299682 1.341122 1.372197 1.389167 1.461176 1.542286 1.704709 2.0000409 2.00018 2.000322 2.000581 2.000595 2.000659 2.000713 2.000781 2.00104 2.001096 2.001114 2.00122 2.001229 2.001245 2.001313 2.001329 2.001343 2.001406 2.001433 2.0015313 2.001534 2.001548 2.001565 2.001622 2.001641 2.001675 2.001719 2.001723 2.001742 2.0018472 2.001867 2.001884 2.0019015 2.001917 2.001999 2.002032 2.002124 2.002232 2.002309 2.002378 2.002378 2.002477 2.002494 2.002499 2.002517 2.002601 2.002747 2.00277 2.002846 2.003007 2.003073 2.003196 2.003338 2.003372 2.003478 2.003551 2.003604 2.003617 2.003623 2.003692 2.00371 2.00384 2.003847 2.003853 2.003899 2.003985 2.00402 2.00403 2.004085 2.004346 2.004353 2.004369 2.004406 2.004521 2.004557 2.00459 2.004594 2.004666 2.004757 2.004818 2.004856 2.004859 2.004873 2.004922 2.004957 2.005067 2.005078 2.005481 2.005545 2.0055589 2.005677 2.005699 2.005759 2.005867 2.006046 2.00605 2.006159 2.006177 2.006191 2.006207 2.006561 2.006593 2.006734 2.006806 2.006834 2.006858 2.006895 2.007024 2.007026 2.007068 2.007077 2.0071088 2.007164 2.007331 2.0073696 2.007371 2.007492 2.00753 2.007573 2.0075921 2.007616 2.007683 2.007928 2.00793 2.007988 2.008432 2.00845 2.008491 2.008496 2.008658 2.008774 2.008779 2.008779 2.00882 2.008863 2.008908 2.008983 2.008997 2.009114 2.009241 2.00925 2.009316 2.009345 2.009378 2.009449 2.009453 2.009468 2.009549 2.00965 2.009799 2.0099467 2.009996 2.009999 2.010175 2.01018 2.010183 2.010427 2.010694 2.010701 2.010703 2.010753 2.010761 2.010813 2.011075 2.011155 2.011157 2.011561 2.011608 2.011619 2.011716 2.011725 2.011914 2.011978 2.012346 2.012397 2.012418 2.0125 2.012566 2.012716 2.012838 2.012846 2.012875 2.012915 2.012959 2.01298 2.013006 2.013074 2.0131 2.013139 2.013346 2.0134452 2.013651 2.013704 2.013823 2.013873 2.013947 2.014047 2.014078 2.014117 2.01419 2.01433 2.014354 2.014416 2.014433 2.014441 2.014559 2.01461 2.014641 2.0146571 2.014719 2.014724 2.014848 2.014864 2.014993 2.015002 2.015067 2.015097 2.01517 2.015261 2.015364 2.015526 2.015744 2.0159444 2.015995 2.016124 2.017511 2.018193 2.018257 2.018582 2.01875 2.020991 2.021056 2.0222899 2.023497 2.024054 2.0242811 2.025068 2.0254555 2.02683 2.026835 2.026952 2.02777 2.030895 2.031675 2.03306 2.038352 2.041914 2.042155 2.045516 2.047113 2.066552 2.074534 2.075919 2.080778 2.09415 2.109709 2.154792 2.156498 2.16023 2.24414 2.803797 2.982621 2.998497 3.000421 3.000632 3.000702 3.001154 3.001743 3.001746 3.00182 3.001848 3.001947 3.002288 3.002756 3.003094 3.00344 3.00367 3.004638 3.004656 3.00557 3.005749 3.006661 3.0068746 3.007392 3.0077009 3.007862 3.008545 3.009329 3.009398 3.009765 3.010029 3.010536 3.01063 3.010917 3.011019 3.011195 3.011479 3.01167 3.011764 3.0125993 3.012684 3.012778 3.012965 3.013129 3.013202 3.013608 3.013673 3.013898 3.014417 3.014671 3.015029 3.015045 3.015045 3.015102 3.018567 3.163079 3.608755 3.787354 3.830931 3.998248 4.000557 4.000846 4.001081 4.001232 4.001233 4.001288 4.001413 4.001448 4.001573 4.001667 4.001696 4.00171 4.001782 4.001806 4.001823 4.001873 4.00191 4.002143 4.002299 4.002479 4.002557 4.002566 4.002586 4.002627 4.002669 4.002708 4.003034 4.003184 4.003405 4.005904 4.00675 4.007547 4.008209 4.009311 4.01022 4.01087 4.011177 4.011589 4.011651 4.014525 4.014744 4.020341 4.020562 4.021754 4.035839 4.049943 4.061777 4.116469 4.143633 4.45856 4.499234 4.837318 5.00034 5.000418 5.002849 5.003336 5.003469 5.003835 5.004518 5.005353 5.00587 5.005886 5.006602 5.006693 5.007048 5.0072657 5.007738 5.008488 5.008746 5.008892 5.009384 5.009983 5.010952 5.011351 5.011481 5.012336 5.012412 5.014006 5.014175 5.014335 5.014379 5.014401 5.01477 5.017513 5.01766 5.019419 5.031059 5.0984 5.104308 5.46558 6.004034 6.004892 6.006551 6.007116 6.007948 6.009157 6.029858 6.052674 6.497381 7.000514 7.00602 7.007122 7.093005 7.094768 7.440747 7.917579 8.014586 8.16653 9.006461 10.007662 10.088746 10.825169 11.914705 12.0181458 12.404775 27.559957 29.812818 30.034976 31.021865 32.250788 34.041418 35.53309 36.578075 51.722469 59.546507 59.921116]
95th percentile: 2.013087

The stats don't change much if we look only at Windows devices:

count: 1605; mean: 1.587680; max: 59.921116; median: 1.010032; stddev: 2.941969
quartiles: {Q1:1.0051014 Q2:1.010032 Q3:1.016191}
extreme quartile outliers: [1.058522 2.0000409 2.00018 2.000581 2.000595 2.000659 2.000781 2.00104 2.001096 2.001114 2.001329 2.001343 2.0015313 2.001548 2.001565 2.001622 2.001641 2.001675 2.0018472 2.0019015 2.001917 2.001999 2.002032 2.002124 2.002232 2.002309 2.002378 2.002378 2.002477 2.002499 2.002517 2.002601 2.00277 2.002846 2.003007 2.003073 2.003196 2.003338 2.003372 2.003478 2.003551 2.003604 2.003617 2.003623 2.003692 2.00371 2.00384 2.003847 2.003853 2.003899 2.003985 2.00402 2.00403 2.004346 2.004353 2.004369 2.004557 2.00459 2.004594 2.004757 2.004856 2.004859 2.004873 2.004922 2.004957 2.005067 2.005078 2.005481 2.005545 2.0055589 2.005677 2.005699 2.005759 2.005867 2.00605 2.006177 2.006191 2.006207 2.006561 2.006593 2.006734 2.006806 2.006834 2.006858 2.006895 2.007024 2.007026 2.007068 2.007077 2.0071088 2.007164 2.007331 2.0073696 2.007371 2.007492 2.007573 2.0075921 2.007616 2.007683 2.007928 2.00793 2.008432 2.00845 2.008491 2.008496 2.008658 2.008774 2.008779 2.008779 2.00882 2.008863 2.008908 2.008983 2.008997 2.009114 2.009241 2.00925 2.009316 2.009345 2.009378 2.009449 2.009453 2.009468 2.009549 2.00965 2.009799 2.0099467 2.009996 2.009999 2.010175 2.01018 2.010183 2.010427 2.010694 2.010701 2.010703 2.010753 2.010761 2.010813 2.011075 2.011155 2.011157 2.011561 2.011619 2.011716 2.011725 2.011914 2.011978 2.012346 2.012397 2.012418 2.0125 2.012566 2.012716 2.012838 2.012846 2.012875 2.012915 2.012959 2.01298 2.013006 2.013074 2.0131 2.013139 2.013346 2.0134452 2.013651 2.013704 2.013823 2.013873 2.013947 2.014047 2.014078 2.014117 2.01419 2.01433 2.014354 2.014416 2.014433 2.014441 2.014559 2.01461 2.014641 2.0146571 2.014719 2.014724 2.014848 2.014864 2.014993 2.015002 2.015067 2.015097 2.01517 2.015261 2.015364 2.015526 2.015744 2.0159444 2.015995 2.016124 2.017511 2.018257 2.018582 2.01875 2.020991 2.021056 2.0222899 2.0242811 2.025068 2.0254555 2.02683 2.026835 2.026952 2.02777 2.030895 2.041914 2.042155 2.047113 2.09415 2.109709 2.156498 2.803797 3.000421 3.000632 3.001743 3.00182 3.001848 3.001947 3.002288 3.002756 3.003094 3.00344 3.00367 3.004638 3.004656 3.00557 3.006661 3.0068746 3.007392 3.0077009 3.007862 3.008545 3.009329 3.009398 3.009765 3.010029 3.010536 3.01063 3.010917 3.011019 3.011195 3.011479 3.01167 3.011764 3.0125993 3.012684 3.012778 3.012965 3.013129 3.013202 3.013608 3.013673 3.013898 3.014417 3.014671 3.015029 3.015045 3.015045 3.015102 3.018567 4.000557 4.001233 4.001573 4.001667 4.001696 4.001806 4.001823 4.002557 4.002586 4.002669 4.002708 4.003184 4.003405 4.005904 4.00675 4.007547 4.008209 4.009311 4.01022 4.011177 4.011589 4.011651 4.014525 4.014744 4.020341 4.020562 4.021754 5.00034 5.000418 5.002849 5.003336 5.003469 5.003835 5.004518 5.00587 5.005886 5.006602 5.006693 5.007048 5.0072657 5.007738 5.008488 5.008746 5.008892 5.009384 5.009983 5.010952 5.011351 5.011481 5.012336 5.012412 5.014006 5.014175 5.014335 5.014379 5.014401 5.01477 5.017513 5.01766 5.019419 5.031059 6.004034 6.004892 6.006551 6.007948 6.009157 6.029858 7.000514 7.00602 7.007122 8.16653 9.006461 10.007662 11.914705 12.0181458 27.559957 29.812818 30.034976 31.021865 34.041418 36.578075 59.546507 59.921116]
95th percentile: 4.001403

But what about our outliers? I wanted to make sure that these outliers didn't come from devices that regularly took 20+ seconds to start the osquery process -- i.e., I wanted to make sure that dropping the timeout to 20 seconds wouldn't prevent some slower devices from ever starting up osquery because they are not able to do so within 20 seconds. So, for the outlier data points (>20 seconds), I looked at the launch times for those devices. I found that these data points were outliers for these devices, too. In other words, I think this change is safe to make.

Launch times for devices with outlier data points, coming from 8 Windows devices and 3 macOS devices:

[1.0144824 59.5465072 1.010674 2.0030565 1.0078394]
[1.051232 51.722469 1.000404]
[1.008462 36.5780746 1.0108634 1.0108551 2.0072245 1.0206152 4.0027787 1.0075808 1.01057 1.0030817 1.0144788 1.0154687]
[1.0107257 27.5599573 2.015347 1.0021883 26.4561418 1.0026172 1.0144635]
[1.004254 32.250788]
[1.0097023 34.0414178 1.0116487 1.0027989 1.0089395 1.0066617 1.0147624]
[1.0040958 1.0014276 31.0218652 2.0015471 1.0020685 2.0162383 1.0034327 1.0054222 1.0131869]
[1.0051553 2.0150777 30.0349758 1.0129265 1.0079709 1.0030081 1.0092103]
[1.0005964 1.0039296 59.9211163 1.0064562]
[1.015651 29.8128184 2.0047665]
[1.000629 1.001306 1.000844 1.001385 1.001274 1.001472 1.00172 1.001452 1.001476 1.001141 1.001298 1.001046 1.001291 1.002266 1.002306 1.002261 1.002328 1.002301 1.002424 1.002515 1.00078 1.001274 1.002165 35.53309 1.002251 1.002372 1.002381 1.002316 1.002281 1.002268 1.002364 1.002348 1.002282 1.002274 1.002363 1.002278 1.00227 1.002362 1.002379 1.002315 1.002787 1.001251 1.001269 1.001339 1.001284 1.002271 1.001553 1.002187 1.002304 1.002934 1.002361 1.001288 1.001309 1.000988 1.001304 1.001113 1.001277 1.002634 1.001307 1.001233 1.001397 1.001214 1.001357 1.001343 1.002361 1.002311 1.002392 1.002358 1.002411 1.002319 1.004797 1.002291 1.002414 1.000967 1.002544 1.00228 1.002327 1.002252 1.00227 1.002509 4.000787 1.002374 1.0024 1.002313 1.002281 1.002254 1.002412 1.002313 1.002284 1.002342 1.002294 1.002335 1.002224 1.000315 1.002626 1.002424 1.002287 1.0023 1.001358 1.000688 1.001297 1.00123 1.001486 1.002799 1.001202 1.00241 1.002296 1.002324 1.002313 51.704948 1.00268 1.002345 1.00262 1.001357 1.002368 1.001334 1.002401]

(If you're thinking wow, all of this sounds like it would've been way easier to do with trace data, yes 😭 . We currently send launched osquery process and osquery socket created as events on spans, and the trace API does not allow for viewing these events. Hopefully 2) above will address this if we want this data in the future.)

(It was also very annoying to collect this data via the logging API because it has a 60 request/minute rate limit.)

…an for it

…nch-faster

James-Pickett

nice work and thank you for doing all this research and clearly presenting the data! 🔥

zackattack01

great write up 🔥

pkg/osquery/runtime/osqueryinstance.go

directionless · 2025-01-15T02:09:03Z

pkg/osquery/runtime/osqueryinstance.go

+	// 95th percentile startup takes just over two seconds. We rounded up to 20 seconds to give
+	// extra time for our outliers.
+	// See writeup in https://github.com/kolide/launcher/pull/2041 for data and details.
+	osqueryStartupTimeout = 20 * time.Second


FWIW I think this used to be 10s, and that I made it longer trying to track down the monterey bug. Back when I thought it was a slow startup issue.

https://github.com/kolide/launcher/blob/0.9.6/pkg/osquery/runtime/runtime.go#L295-L300

Ooh, good to know, I could not remember what it had been historically. We could probably cut this down even further to 10 seconds again.

RebeccaMahany added 3 commits January 10, 2025 11:18

Shorten delay before retrying osquery instance launch

13762bb

Move osquery process start into its own function so we get a clean sp…

22f648a

…an for it

Merge remote-tracking branch 'upstream/main' into becca/retry-osq-lau…

1148544

…nch-faster

directionless previously approved these changes Jan 11, 2025

View reviewed changes

RebeccaMahany added 2 commits January 13, 2025 14:06

Merge remote-tracking branch 'upstream/main' into becca/retry-osq-lau…

c5d494b

…nch-faster

Reduce startup timeout from 1 minute to 20 seconds

fab2a13

RebeccaMahany dismissed directionless’s stale review via fab2a13 January 13, 2025 19:07

Merge remote-tracking branch 'upstream/main' into becca/retry-osq-lau…

c11d87b

…nch-faster

RebeccaMahany added the features-improvements Features and Improvements label Jan 14, 2025

RebeccaMahany marked this pull request as ready for review January 14, 2025 18:17

James-Pickett previously approved these changes Jan 14, 2025

View reviewed changes

zackattack01 previously approved these changes Jan 14, 2025

View reviewed changes

zackattack01 reviewed Jan 14, 2025

View reviewed changes

pkg/osquery/runtime/osqueryinstance.go Show resolved Hide resolved

Document timeout and link to PR

fcea886

RebeccaMahany dismissed stale reviews from zackattack01 and James-Pickett via fcea886 January 14, 2025 19:24

James-Pickett approved these changes Jan 14, 2025

View reviewed changes

RebeccaMahany added this pull request to the merge queue Jan 14, 2025

zackattack01 approved these changes Jan 14, 2025

View reviewed changes

Merged via the queue into kolide:main with commit 774a775 Jan 14, 2025
32 checks passed

RebeccaMahany deleted the becca/retry-osq-launch-faster branch January 14, 2025 20:30

RebeccaMahany mentioned this pull request Jan 14, 2025

Need to remediate when osquery cannot create database lock file #2004

Open

directionless reviewed Jan 15, 2025

View reviewed changes

James-Pickett mentioned this pull request Jan 24, 2025

james/tpm runner handle no tpm #2066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry osquery instance launch faster when we see the stale lockfile issue #2041

Retry osquery instance launch faster when we see the stale lockfile issue #2041

RebeccaMahany commented Jan 10, 2025 •

edited

Loading

James-Pickett left a comment •

edited

Loading

zackattack01 left a comment

directionless Jan 15, 2025

RebeccaMahany Jan 15, 2025

Retry osquery instance launch faster when we see the stale lockfile issue #2041

Retry osquery instance launch faster when we see the stale lockfile issue #2041

Conversation

RebeccaMahany commented Jan 10, 2025 • edited Loading

Background

Updates made in this PR

Wait! Is 3) really safe to do?

James-Pickett left a comment • edited Loading

Choose a reason for hiding this comment

zackattack01 left a comment

Choose a reason for hiding this comment

directionless Jan 15, 2025

Choose a reason for hiding this comment

RebeccaMahany Jan 15, 2025

Choose a reason for hiding this comment

RebeccaMahany commented Jan 10, 2025 •

edited

Loading

James-Pickett left a comment •

edited

Loading