Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry osquery instance launch faster when we see the stale lockfile issue #2041

Merged

Conversation

RebeccaMahany
Copy link
Contributor

@RebeccaMahany RebeccaMahany commented Jan 10, 2025

Relates to #2004.

Background

Currently, the process for launcher self-remediating this issue is the following:

  1. the osquery process repeatedly attempts the rocksdb open call (reporting the error Rocksdb open failed (5:0) IO error: Failed to create lock file: osquery.db/LOCK: The process cannot access the file because it is being used by another process via the logs) -- I haven't seen these retries be successful
  2. after one minute, launcher's osquery instance times out because the socket file has not been created yet
  3. launcher's osquery runner receives the osquery instance timeout error
  4. launcher's osquery runner waits 30 seconds
  5. launcher's osquery runner retries launching an osquery instance; this is usually successful

This means that it can take up to 2.5 minutes for osquery to start up successfully (1 minute timeout, 30 second delay before retry, and another 1 minute for the second process to start up successfully).

This PR attempts to speed this process up. Likely some amount of time is needed for the lock to be released, but probably not the full 2.5 minutes.

We'd discussed moving the log processing into the osquery instance so that the instance can take note of the stale lockfile log and return an error earlier from Launch -- however, given #2044, I'm hopeful that we don't need to do that work and can avoid complicating the instance code further. 🤞

Updates made in this PR

  1. Shorten the delay before retrying instance launch from 30 seconds to 10 seconds
  2. Move starting the osquery process into its own function so we get a nice clean span for it -- so we can collect good trace data indicating how long the osquery process launch actually takes
  3. Wait for a maximum of 20 seconds, instead of 60 seconds, for socket file to appear after starting osquery process

Wait! Is 3) really safe to do?

I collected a bunch of data to determine that 3) would indeed be safe to do!

Methodology: Over the past week of logging data, I found devices that had logged osquery extension socket not created yet ... will retry, indicating osquery process startup took at least a second. From those devices, I collected the timestamps of the launched osquery process and osquery socket created for unique run IDs, and took the "launch time" as the difference between the two. I discarded all launch times that were negative or over 60 seconds. I stored this data both for all OSes and per-OS so that I could see if there was a significant difference in launch time for Windows. I then used this stats package to compute mean, max, median, standard deviation, quartiles, and 95th percentile. I found that our average launch time is barely over a second, and our 95th percentile was still in the low single digits.

count: 5179; mean: 1.254870; max: 59.921116; median: 1.002222; stddev: 1.958687
quartiles: {Q1:1.001286 Q2:1.002222 Q3:1.008645}
extreme quartile outliers: [0.881872 1.031092 1.031159 1.0317174 1.031944 1.032705 1.0334 1.034522 1.034733 1.034971 1.035437 1.035578 1.035591 1.035709 1.035748 1.038638 1.038813 1.038819 1.038827 1.039648 1.040033 1.040388 1.040465 1.040628 1.041717 1.042567 1.042971 1.043006 1.045206 1.04523 1.046422 1.046578 1.047276 1.047739 1.047836 1.049473 1.051319 1.05186 1.052119 1.053003 1.0537 1.05402 1.054642 1.055601 1.056623 1.056969 1.057184 1.057372 1.057488 1.057564 1.058522 1.059046 1.059128 1.059215 1.059278 1.059392 1.059496 1.059727 1.059839 1.060296 1.062149 1.062333 1.063071 1.065606 1.068027 1.068914 1.070867 1.071454 1.073207 1.073799 1.07432 1.076378 1.076859 1.078256 1.080052 1.080564 1.080812 1.081592 1.081847 1.086187 1.086303 1.086669 1.088533 1.089328 1.09049 1.095944 1.096319 1.097659 1.098879 1.099537 1.099883 1.102047 1.102101 1.118254 1.119318 1.124571 1.12537 1.133389 1.133844 1.134983 1.152333 1.152562 1.15344 1.154097 1.157752 1.161304 1.173369 1.183236 1.189785 1.194428 1.195821 1.198293 1.199453 1.210717 1.21886 1.224965 1.2385760000000001 1.238634 1.245378 1.256235 1.299682 1.341122 1.372197 1.389167 1.461176 1.542286 1.704709 2.0000409 2.00018 2.000322 2.000581 2.000595 2.000659 2.000713 2.000781 2.00104 2.001096 2.001114 2.00122 2.001229 2.001245 2.001313 2.001329 2.001343 2.001406 2.001433 2.0015313 2.001534 2.001548 2.001565 2.001622 2.001641 2.001675 2.001719 2.001723 2.001742 2.0018472 2.001867 2.001884 2.0019015 2.001917 2.001999 2.002032 2.002124 2.002232 2.002309 2.002378 2.002378 2.002477 2.002494 2.002499 2.002517 2.002601 2.002747 2.00277 2.002846 2.003007 2.003073 2.003196 2.003338 2.003372 2.003478 2.003551 2.003604 2.003617 2.003623 2.003692 2.00371 2.00384 2.003847 2.003853 2.003899 2.003985 2.00402 2.00403 2.004085 2.004346 2.004353 2.004369 2.004406 2.004521 2.004557 2.00459 2.004594 2.004666 2.004757 2.004818 2.004856 2.004859 2.004873 2.004922 2.004957 2.005067 2.005078 2.005481 2.005545 2.0055589 2.005677 2.005699 2.005759 2.005867 2.006046 2.00605 2.006159 2.006177 2.006191 2.006207 2.006561 2.006593 2.006734 2.006806 2.006834 2.006858 2.006895 2.007024 2.007026 2.007068 2.007077 2.0071088 2.007164 2.007331 2.0073696 2.007371 2.007492 2.00753 2.007573 2.0075921 2.007616 2.007683 2.007928 2.00793 2.007988 2.008432 2.00845 2.008491 2.008496 2.008658 2.008774 2.008779 2.008779 2.00882 2.008863 2.008908 2.008983 2.008997 2.009114 2.009241 2.00925 2.009316 2.009345 2.009378 2.009449 2.009453 2.009468 2.009549 2.00965 2.009799 2.0099467 2.009996 2.009999 2.010175 2.01018 2.010183 2.010427 2.010694 2.010701 2.010703 2.010753 2.010761 2.010813 2.011075 2.011155 2.011157 2.011561 2.011608 2.011619 2.011716 2.011725 2.011914 2.011978 2.012346 2.012397 2.012418 2.0125 2.012566 2.012716 2.012838 2.012846 2.012875 2.012915 2.012959 2.01298 2.013006 2.013074 2.0131 2.013139 2.013346 2.0134452 2.013651 2.013704 2.013823 2.013873 2.013947 2.014047 2.014078 2.014117 2.01419 2.01433 2.014354 2.014416 2.014433 2.014441 2.014559 2.01461 2.014641 2.0146571 2.014719 2.014724 2.014848 2.014864 2.014993 2.015002 2.015067 2.015097 2.01517 2.015261 2.015364 2.015526 2.015744 2.0159444 2.015995 2.016124 2.017511 2.018193 2.018257 2.018582 2.01875 2.020991 2.021056 2.0222899 2.023497 2.024054 2.0242811 2.025068 2.0254555 2.02683 2.026835 2.026952 2.02777 2.030895 2.031675 2.03306 2.038352 2.041914 2.042155 2.045516 2.047113 2.066552 2.074534 2.075919 2.080778 2.09415 2.109709 2.154792 2.156498 2.16023 2.24414 2.803797 2.982621 2.998497 3.000421 3.000632 3.000702 3.001154 3.001743 3.001746 3.00182 3.001848 3.001947 3.002288 3.002756 3.003094 3.00344 3.00367 3.004638 3.004656 3.00557 3.005749 3.006661 3.0068746 3.007392 3.0077009 3.007862 3.008545 3.009329 3.009398 3.009765 3.010029 3.010536 3.01063 3.010917 3.011019 3.011195 3.011479 3.01167 3.011764 3.0125993 3.012684 3.012778 3.012965 3.013129 3.013202 3.013608 3.013673 3.013898 3.014417 3.014671 3.015029 3.015045 3.015045 3.015102 3.018567 3.163079 3.608755 3.787354 3.830931 3.998248 4.000557 4.000846 4.001081 4.001232 4.001233 4.001288 4.001413 4.001448 4.001573 4.001667 4.001696 4.00171 4.001782 4.001806 4.001823 4.001873 4.00191 4.002143 4.002299 4.002479 4.002557 4.002566 4.002586 4.002627 4.002669 4.002708 4.003034 4.003184 4.003405 4.005904 4.00675 4.007547 4.008209 4.009311 4.01022 4.01087 4.011177 4.011589 4.011651 4.014525 4.014744 4.020341 4.020562 4.021754 4.035839 4.049943 4.061777 4.116469 4.143633 4.45856 4.499234 4.837318 5.00034 5.000418 5.002849 5.003336 5.003469 5.003835 5.004518 5.005353 5.00587 5.005886 5.006602 5.006693 5.007048 5.0072657 5.007738 5.008488 5.008746 5.008892 5.009384 5.009983 5.010952 5.011351 5.011481 5.012336 5.012412 5.014006 5.014175 5.014335 5.014379 5.014401 5.01477 5.017513 5.01766 5.019419 5.031059 5.0984 5.104308 5.46558 6.004034 6.004892 6.006551 6.007116 6.007948 6.009157 6.029858 6.052674 6.497381 7.000514 7.00602 7.007122 7.093005 7.094768 7.440747 7.917579 8.014586 8.16653 9.006461 10.007662 10.088746 10.825169 11.914705 12.0181458 12.404775 27.559957 29.812818 30.034976 31.021865 32.250788 34.041418 35.53309 36.578075 51.722469 59.546507 59.921116]
95th percentile: 2.013087

The stats don't change much if we look only at Windows devices:

count: 1605; mean: 1.587680; max: 59.921116; median: 1.010032; stddev: 2.941969
quartiles: {Q1:1.0051014 Q2:1.010032 Q3:1.016191}
extreme quartile outliers: [1.058522 2.0000409 2.00018 2.000581 2.000595 2.000659 2.000781 2.00104 2.001096 2.001114 2.001329 2.001343 2.0015313 2.001548 2.001565 2.001622 2.001641 2.001675 2.0018472 2.0019015 2.001917 2.001999 2.002032 2.002124 2.002232 2.002309 2.002378 2.002378 2.002477 2.002499 2.002517 2.002601 2.00277 2.002846 2.003007 2.003073 2.003196 2.003338 2.003372 2.003478 2.003551 2.003604 2.003617 2.003623 2.003692 2.00371 2.00384 2.003847 2.003853 2.003899 2.003985 2.00402 2.00403 2.004346 2.004353 2.004369 2.004557 2.00459 2.004594 2.004757 2.004856 2.004859 2.004873 2.004922 2.004957 2.005067 2.005078 2.005481 2.005545 2.0055589 2.005677 2.005699 2.005759 2.005867 2.00605 2.006177 2.006191 2.006207 2.006561 2.006593 2.006734 2.006806 2.006834 2.006858 2.006895 2.007024 2.007026 2.007068 2.007077 2.0071088 2.007164 2.007331 2.0073696 2.007371 2.007492 2.007573 2.0075921 2.007616 2.007683 2.007928 2.00793 2.008432 2.00845 2.008491 2.008496 2.008658 2.008774 2.008779 2.008779 2.00882 2.008863 2.008908 2.008983 2.008997 2.009114 2.009241 2.00925 2.009316 2.009345 2.009378 2.009449 2.009453 2.009468 2.009549 2.00965 2.009799 2.0099467 2.009996 2.009999 2.010175 2.01018 2.010183 2.010427 2.010694 2.010701 2.010703 2.010753 2.010761 2.010813 2.011075 2.011155 2.011157 2.011561 2.011619 2.011716 2.011725 2.011914 2.011978 2.012346 2.012397 2.012418 2.0125 2.012566 2.012716 2.012838 2.012846 2.012875 2.012915 2.012959 2.01298 2.013006 2.013074 2.0131 2.013139 2.013346 2.0134452 2.013651 2.013704 2.013823 2.013873 2.013947 2.014047 2.014078 2.014117 2.01419 2.01433 2.014354 2.014416 2.014433 2.014441 2.014559 2.01461 2.014641 2.0146571 2.014719 2.014724 2.014848 2.014864 2.014993 2.015002 2.015067 2.015097 2.01517 2.015261 2.015364 2.015526 2.015744 2.0159444 2.015995 2.016124 2.017511 2.018257 2.018582 2.01875 2.020991 2.021056 2.0222899 2.0242811 2.025068 2.0254555 2.02683 2.026835 2.026952 2.02777 2.030895 2.041914 2.042155 2.047113 2.09415 2.109709 2.156498 2.803797 3.000421 3.000632 3.001743 3.00182 3.001848 3.001947 3.002288 3.002756 3.003094 3.00344 3.00367 3.004638 3.004656 3.00557 3.006661 3.0068746 3.007392 3.0077009 3.007862 3.008545 3.009329 3.009398 3.009765 3.010029 3.010536 3.01063 3.010917 3.011019 3.011195 3.011479 3.01167 3.011764 3.0125993 3.012684 3.012778 3.012965 3.013129 3.013202 3.013608 3.013673 3.013898 3.014417 3.014671 3.015029 3.015045 3.015045 3.015102 3.018567 4.000557 4.001233 4.001573 4.001667 4.001696 4.001806 4.001823 4.002557 4.002586 4.002669 4.002708 4.003184 4.003405 4.005904 4.00675 4.007547 4.008209 4.009311 4.01022 4.011177 4.011589 4.011651 4.014525 4.014744 4.020341 4.020562 4.021754 5.00034 5.000418 5.002849 5.003336 5.003469 5.003835 5.004518 5.00587 5.005886 5.006602 5.006693 5.007048 5.0072657 5.007738 5.008488 5.008746 5.008892 5.009384 5.009983 5.010952 5.011351 5.011481 5.012336 5.012412 5.014006 5.014175 5.014335 5.014379 5.014401 5.01477 5.017513 5.01766 5.019419 5.031059 6.004034 6.004892 6.006551 6.007948 6.009157 6.029858 7.000514 7.00602 7.007122 8.16653 9.006461 10.007662 11.914705 12.0181458 27.559957 29.812818 30.034976 31.021865 34.041418 36.578075 59.546507 59.921116]
95th percentile: 4.001403

But what about our outliers? I wanted to make sure that these outliers didn't come from devices that regularly took 20+ seconds to start the osquery process -- i.e., I wanted to make sure that dropping the timeout to 20 seconds wouldn't prevent some slower devices from ever starting up osquery because they are not able to do so within 20 seconds. So, for the outlier data points (>20 seconds), I looked at the launch times for those devices. I found that these data points were outliers for these devices, too. In other words, I think this change is safe to make.

Launch times for devices with outlier data points, coming from 8 Windows devices and 3 macOS devices:

[1.0144824 59.5465072 1.010674 2.0030565 1.0078394]
[1.051232 51.722469 1.000404]
[1.008462 36.5780746 1.0108634 1.0108551 2.0072245 1.0206152 4.0027787 1.0075808 1.01057 1.0030817 1.0144788 1.0154687]
[1.0107257 27.5599573 2.015347 1.0021883 26.4561418 1.0026172 1.0144635]
[1.004254 32.250788]
[1.0097023 34.0414178 1.0116487 1.0027989 1.0089395 1.0066617 1.0147624]
[1.0040958 1.0014276 31.0218652 2.0015471 1.0020685 2.0162383 1.0034327 1.0054222 1.0131869]
[1.0051553 2.0150777 30.0349758 1.0129265 1.0079709 1.0030081 1.0092103]
[1.0005964 1.0039296 59.9211163 1.0064562]
[1.015651 29.8128184 2.0047665]
[1.000629 1.001306 1.000844 1.001385 1.001274 1.001472 1.00172 1.001452 1.001476 1.001141 1.001298 1.001046 1.001291 1.002266 1.002306 1.002261 1.002328 1.002301 1.002424 1.002515 1.00078 1.001274 1.002165 35.53309 1.002251 1.002372 1.002381 1.002316 1.002281 1.002268 1.002364 1.002348 1.002282 1.002274 1.002363 1.002278 1.00227 1.002362 1.002379 1.002315 1.002787 1.001251 1.001269 1.001339 1.001284 1.002271 1.001553 1.002187 1.002304 1.002934 1.002361 1.001288 1.001309 1.000988 1.001304 1.001113 1.001277 1.002634 1.001307 1.001233 1.001397 1.001214 1.001357 1.001343 1.002361 1.002311 1.002392 1.002358 1.002411 1.002319 1.004797 1.002291 1.002414 1.000967 1.002544 1.00228 1.002327 1.002252 1.00227 1.002509 4.000787 1.002374 1.0024 1.002313 1.002281 1.002254 1.002412 1.002313 1.002284 1.002342 1.002294 1.002335 1.002224 1.000315 1.002626 1.002424 1.002287 1.0023 1.001358 1.000688 1.001297 1.00123 1.001486 1.002799 1.001202 1.00241 1.002296 1.002324 1.002313 51.704948 1.00268 1.002345 1.00262 1.001357 1.002368 1.001334 1.002401]

(If you're thinking wow, all of this sounds like it would've been way easier to do with trace data, yes 😭 . We currently send launched osquery process and osquery socket created as events on spans, and the trace API does not allow for viewing these events. Hopefully 2) above will address this if we want this data in the future.)

(It was also very annoying to collect this data via the logging API because it has a 60 request/minute rate limit.)

directionless
directionless previously approved these changes Jan 11, 2025
@RebeccaMahany RebeccaMahany added the features-improvements Features and Improvements label Jan 14, 2025
@RebeccaMahany RebeccaMahany marked this pull request as ready for review January 14, 2025 18:17
James-Pickett
James-Pickett previously approved these changes Jan 14, 2025
Copy link
Contributor

@James-Pickett James-Pickett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work and thank you for doing all this research and clearly presenting the data! 🔥

zackattack01
zackattack01 previously approved these changes Jan 14, 2025
Copy link
Contributor

@zackattack01 zackattack01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great write up 🔥

@RebeccaMahany RebeccaMahany added this pull request to the merge queue Jan 14, 2025
Merged via the queue into kolide:main with commit 774a775 Jan 14, 2025
32 checks passed
@RebeccaMahany RebeccaMahany deleted the becca/retry-osq-launch-faster branch January 14, 2025 20:30
// 95th percentile startup takes just over two seconds. We rounded up to 20 seconds to give
// extra time for our outliers.
// See writeup in https://github.com/kolide/launcher/pull/2041 for data and details.
osqueryStartupTimeout = 20 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think this used to be 10s, and that I made it longer trying to track down the monterey bug. Back when I thought it was a slow startup issue.

https://github.com/kolide/launcher/blob/0.9.6/pkg/osquery/runtime/runtime.go#L295-L300

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, good to know, I could not remember what it had been historically. We could probably cut this down even further to 10 seconds again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
features-improvements Features and Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants