-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry osquery instance launch faster when we see the stale lockfile issue #2041
Retry osquery instance launch faster when we see the stale lockfile issue #2041
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work and thank you for doing all this research and clearly presenting the data! 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great write up 🔥
fcea886
// 95th percentile startup takes just over two seconds. We rounded up to 20 seconds to give | ||
// extra time for our outliers. | ||
// See writeup in https://github.com/kolide/launcher/pull/2041 for data and details. | ||
osqueryStartupTimeout = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think this used to be 10s, and that I made it longer trying to track down the monterey bug. Back when I thought it was a slow startup issue.
https://github.com/kolide/launcher/blob/0.9.6/pkg/osquery/runtime/runtime.go#L295-L300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, good to know, I could not remember what it had been historically. We could probably cut this down even further to 10 seconds again.
Relates to #2004.
Background
Currently, the process for launcher self-remediating this issue is the following:
Rocksdb open failed (5:0) IO error: Failed to create lock file: osquery.db/LOCK: The process cannot access the file because it is being used by another process
via the logs) -- I haven't seen these retries be successfulThis means that it can take up to 2.5 minutes for osquery to start up successfully (1 minute timeout, 30 second delay before retry, and another 1 minute for the second process to start up successfully).
This PR attempts to speed this process up. Likely some amount of time is needed for the lock to be released, but probably not the full 2.5 minutes.
We'd discussed moving the log processing into the osquery instance so that the instance can take note of the stale lockfile log and return an error earlier from
Launch
-- however, given #2044, I'm hopeful that we don't need to do that work and can avoid complicating the instance code further. 🤞Updates made in this PR
Wait! Is 3) really safe to do?
I collected a bunch of data to determine that 3) would indeed be safe to do!
Methodology: Over the past week of logging data, I found devices that had logged
osquery extension socket not created yet ... will retry
, indicating osquery process startup took at least a second. From those devices, I collected the timestamps of thelaunched osquery process
andosquery socket created
for unique run IDs, and took the "launch time" as the difference between the two. I discarded all launch times that were negative or over 60 seconds. I stored this data both for all OSes and per-OS so that I could see if there was a significant difference in launch time for Windows. I then used this stats package to compute mean, max, median, standard deviation, quartiles, and 95th percentile. I found that our average launch time is barely over a second, and our 95th percentile was still in the low single digits.The stats don't change much if we look only at Windows devices:
But what about our outliers? I wanted to make sure that these outliers didn't come from devices that regularly took 20+ seconds to start the osquery process -- i.e., I wanted to make sure that dropping the timeout to 20 seconds wouldn't prevent some slower devices from ever starting up osquery because they are not able to do so within 20 seconds. So, for the outlier data points (>20 seconds), I looked at the launch times for those devices. I found that these data points were outliers for these devices, too. In other words, I think this change is safe to make.
Launch times for devices with outlier data points, coming from 8 Windows devices and 3 macOS devices:
(If you're thinking wow, all of this sounds like it would've been way easier to do with trace data, yes 😭 . We currently send
launched osquery process
andosquery socket created
as events on spans, and the trace API does not allow for viewing these events. Hopefully 2) above will address this if we want this data in the future.)(It was also very annoying to collect this data via the logging API because it has a 60 request/minute rate limit.)