[core] increase sqlite3 lock timeout to 60s #4552
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
understanding the issue
In certain high-load scenarios, we see
sqlite3.OperationalError: database is locked
. This was first reported in #1507 and various fixes changes have partially addressed it (notably, WAL: #1509, #3923, #4283).However, I still observed the issue on the spot database while cancelling or completing 1000+ managed jobs at once.
This PR avoids this issue by increasing the sqlite timeout to a high value (60 seconds), that should be well beyond what we expect to see in practice.
What is sqlite3.OperationalError: database is locked?
SQLite has various locking mechanisms to ensure consistency. If a transaction tries to execute but cannot obtain the necessary lock, it will give an SQLITE_BUSY error. In Python, this is translated to
sqlite3.OperationalError: database is locked
.If the necessary lock is currently held, SQLite will try to obtain the lock until it times out. The timeout time is set by sqlite3_busy_timeout() in the C API, which corresponds in the Python API to
timeout=
kwarg in the sqlite3.connect() call.In Python, the default is 5 seconds.
To obtain the lock, SQLite does not use a condition variable or any kind of notification-based synchronization. Instead, SQLite will retry acquisition at set intervals that back off until the retry interval maxes out at 100ms.
https://github.com/sqlite/sqlite/blob/f1747f93e0f8df7984b595b91649c7789217fe59/src/main.c#L1695-L1698
This means that we can calculate how many attempts a process has to obtain the lock: it's approximately
TIMEOUT / 100ms
(ten times the timeout in seconds), plus a few extra before the retry fully backs off. For a default 5s timeout, we will get exactly 59 attempts to obtain the lock.We aren't guaranteed that SQLite will retry obtaining the lock - some cases, such as those that would lead to deadlock, are explicitly excluded and will immediately return the error.
This error can also reflect some code failing to close its database cursor, and indefinitely blocking all other uses of the database. However, we can be reasonably confident that this is NOT the case for skypilot, since we consistently use contexts to manage the database connection and cursor.
WAL
WAL mode (Write-Ahead Logging) changes the way writes are added to the database and significantly reduces lock contention. Specifically:
This is much better than the default mode, where writing requires an exclusive lock on the database. Switching to WAL almost completely eliminated locking issues in skypilot. However, it is still possible to have lock contention:
In high concurrency situations, the first case is the most problematic one: we may have 1000+ processes trying to write to the database at the same time. Our goal in this situation should almost certainly be to avoid crashing, even if it takes a long time to write the change.
simulation and timeout choice
Created a table as
Script:
Ran this 1000x in parallel with a very high database timeout to capture the following latency data (n=100,000):
We see an exponential dropoff in the long tail of high-latency requests. From this data we could estimate that the probability of a latency >60s is 1e-7 to 1e-6
This simulation is significantly worse than what we expect in the real world - even with 1000 managed jobs transitioning at the same time, each job only writes to the database a small number of times (<5) instead of 100x.
So, 60 seconds seems like a safe choice. Since skypilot is not really realtime-sensitive software, we should prefer a very long delay to crashing.
testing
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh