You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I1217 06:16:53.786639 2924 rocksdb.cpp:181] Rocksdb open failed (5:0) IO error: Failed to create lock file: some-root-dir\osquery.db/LOCK: The process cannot access the file because it is being used by another process.
At least at the time of the error, there was only one osquery process. It's not clear from the logs what happened, but I think the osquery process didn't get a chance to terminate completely and the lock file got left behind.
launcher cannot handle when the osquery process fails to start up fully with an error like this -- the only remediation currently available is manual.
We should a) improve the shutdown routine to make sure that launcher doesn't get into this situation, and b) update the osquery runner to detect this state and take corrective action.
Thoughts re: a) improve the shutdown routine to make sure that launcher doesn't get into this situation:
Do we need a longer timeout for the shutdown routines? (This timeout is not enforced by the runner or instance -- it comes from the rungroup and/or from whatever is managing the launcher service.) I think timeout during shutdown is the most likely explanation for this issue. In this particular example, the issue occurred on Windows, so it may be something specific to the Windows service.
Do we want to introduce some kind of ordering to the shutdown routines? I think it's possible we could be running into this issue because some of the shutdown routines happen simultaneously with the osquery process shutdown, and this disrupts clean shutdown.
Do we need to change how we're shutting down osquery to shut it down more gracefully? I don't think there's much more we can do here, but maybe I'm missing something.
Thoughts re: b) update the osquery runner to detect this state and take corrective action:
Is it a good idea for the osquery instance to remove the lock file (if one exists) before starting up a new osquery process? (I think we would prefer this to having the osquery instance remove the lock file on shutdown because the shutdown tasks run more or less simultaneously, so we could end up with e.g. the instance removing the lock file before the osquery process gets the chance to shut itself down cleanly.)
The osquery instance and osquery runner both currently have no visibility into the osquery logs -- they're processed by the log adapter, which is entirely separate. However, here and previously (when trying to figure out how to handle launcher falling back to a old version of osquery that is incompatible with the database -- some details here) we've tentatively wanted the osquery instance to be able to respond to particular logs from osquery. Do we want to move log processing into the osquery instance? Or open up some line of communication between the instance and the log adapter?
The text was updated successfully, but these errors were encountered:
Currently, the process for launcher self-remediating this issue is the following:
the osquery process repeatedly attempts the rocksdb open call (reporting the error Rocksdb open failed (5:0) IO error: Failed to create lock file: osquery.db/LOCK: The process cannot access the file because it is being used by another process via the logs) -- I haven't seen these retries be successful
after one minute, launcher's osquery instance times out because the socket file has not been created yet
launcher's osquery runner retries launching an osquery instance; this is usually successful
This means that it can take up to 2.5 minutes for osquery to start up successfully (1 minute timeout, 30 second delay before retry, and another 1 minute for the second process to start up successfully).
We may be able to speed this process up. Likely some amount of time is needed for the lock to be released, but probably not the full 2.5 minutes.
We'd discussed and adapted the last suggestion in section b) of this issue description into:
notice the Rocksdb open failed (5:0) IO error: Failed to create lock file: osquery.db/LOCK: The process cannot access the file because it is being used by another process log from osquery
stop trying to start up the instance immediately (rather than waiting for a full minute) and return an error to the runner
the runner will wait out a shorter delay, and then retry starting the instance
I am hopeful that #2044 will fix this issue (or at least that #2041 will help mitigate it more quickly) -- we'll know for sure after the next release, so I'm going to leave this issue open until then.
I saw this error recently in the automated tests:
At least at the time of the error, there was only one osquery process. It's not clear from the logs what happened, but I think the osquery process didn't get a chance to terminate completely and the lock file got left behind.
launcher cannot handle when the osquery process fails to start up fully with an error like this -- the only remediation currently available is manual.
We should a) improve the shutdown routine to make sure that launcher doesn't get into this situation, and b) update the osquery runner to detect this state and take corrective action.
Thoughts re: a) improve the shutdown routine to make sure that launcher doesn't get into this situation:
Thoughts re: b) update the osquery runner to detect this state and take corrective action:
The text was updated successfully, but these errors were encountered: