[core] increase sqlite3 lock timeout to 60s #4552

cg505 · 2025-01-10T20:46:55Z

understanding the issue

In certain high-load scenarios, we see sqlite3.OperationalError: database is locked. This was first reported in #1507 and various fixes changes have partially addressed it (notably, WAL: #1509, #3923, #4283).

However, I still observed the issue on the spot database while cancelling or completing 1000+ managed jobs at once.

This PR avoids this issue by increasing the sqlite timeout to a high value (60 seconds), that should be well beyond what we expect to see in practice.

What is sqlite3.OperationalError: database is locked?

SQLite has various locking mechanisms to ensure consistency. If a transaction tries to execute but cannot obtain the necessary lock, it will give an SQLITE_BUSY error. In Python, this is translated to sqlite3.OperationalError: database is locked.

If the necessary lock is currently held, SQLite will try to obtain the lock until it times out. The timeout time is set by sqlite3_busy_timeout() in the C API, which corresponds in the Python API to timeout= kwarg in the sqlite3.connect() call.

In Python, the default is 5 seconds.

To obtain the lock, SQLite does not use a condition variable or any kind of notification-based synchronization. Instead, SQLite will retry acquisition at set intervals that back off until the retry interval maxes out at 100ms.
https://github.com/sqlite/sqlite/blob/f1747f93e0f8df7984b595b91649c7789217fe59/src/main.c#L1695-L1698
This means that we can calculate how many attempts a process has to obtain the lock: it's approximately TIMEOUT / 100ms (ten times the timeout in seconds), plus a few extra before the retry fully backs off. For a default 5s timeout, we will get exactly 59 attempts to obtain the lock.

We aren't guaranteed that SQLite will retry obtaining the lock - some cases, such as those that would lead to deadlock, are explicitly excluded and will immediately return the error.

If SQLite determines that invoking the busy handler could result in a deadlock, it will go ahead and return SQLITE_BUSY to the application instead of invoking the busy handler.
https://www.sqlite.org/c3ref/busy_handler.html

This error can also reflect some code failing to close its database cursor, and indefinitely blocking all other uses of the database. However, we can be reasonably confident that this is NOT the case for skypilot, since we consistently use contexts to manage the database connection and cursor.

WAL

WAL mode (Write-Ahead Logging) changes the way writes are added to the database and significantly reduces lock contention. Specifically:

Writers no longer block readers
Readers no longer block writers
But, there can still only be a single writer at once

This is much better than the default mode, where writing requires an exclusive lock on the database. Switching to WAL almost completely eliminated locking issues in skypilot. However, it is still possible to have lock contention:

Multiple simultaneous writers will block each other
When only one process has the database open, and it closes its connection, it will hold an exclusive lock while it cleans up the write-ahead log and writes everything fully into the database
If the last process to have the database open crashed, the first process to re-open the database next will hold an exclusive lock while it recovers the database.

In high concurrency situations, the first case is the most problematic one: we may have 1000+ processes trying to write to the database at the same time. Our goal in this situation should almost certainly be to avoid crashing, even if it takes a long time to write the change.

simulation and timeout choice

Created a table as

CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY AUTOINCREMENT)

Script:

for _ in range(100):
    start = time.time()
    try:
        with db_utils.safe_cursor(_DB_PATH) as cursor:
            cursor.execute('INSERT INTO test DEFAULT VALUES')
    except sqlite3.OperationalError as e:
        if 'database is locked' not in str(e):
            raise
        print(f'LOCK FAILED AFTER {time.time() - start}', flush=True)
    else:
        print(f'ran for {time.time() - start}', flush=True)
    sleep_time = random.random() / 10
    time.sleep(sleep_time)

Ran this 1000x in parallel with a very high database timeout to capture the following latency data (n=100,000):

We see an exponential dropoff in the long tail of high-latency requests. From this data we could estimate that the probability of a latency >60s is 1e-7 to 1e-6

This simulation is significantly worse than what we expect in the real world - even with 1000 managed jobs transitioning at the same time, each job only writes to the database a small number of times (<5) instead of 100x.

So, 60 seconds seems like a safe choice. Since skypilot is not really realtime-sensitive software, we should prefer a very long delay to crashing.

testing

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll

Great findings @cg505! This is awesome! LGTM

[core] increase sqlite3 lock timeout to 60s

1c5989f

cg505 requested review from concretevitamin and Michaelvll January 13, 2025 23:50

cg505 marked this pull request as ready for review January 13, 2025 23:50

cg505 mentioned this pull request Jan 13, 2025

[Sqlite] Database is locked when refreshing the status #1441

Open

Michaelvll approved these changes Jan 14, 2025

View reviewed changes

cg505 merged commit d9bb51a into skypilot-org:master Jan 15, 2025
18 checks passed

cg505 mentioned this pull request Jan 21, 2025

[jobs] make status updates robust when controller dies #4602

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] increase sqlite3 lock timeout to 60s #4552

[core] increase sqlite3 lock timeout to 60s #4552

cg505 commented Jan 10, 2025 •

edited

Loading

Michaelvll left a comment

[core] increase sqlite3 lock timeout to 60s #4552

[core] increase sqlite3 lock timeout to 60s #4552

Conversation

cg505 commented Jan 10, 2025 • edited Loading

understanding the issue

What is sqlite3.OperationalError: database is locked?

WAL

simulation and timeout choice

testing

Michaelvll left a comment

Choose a reason for hiding this comment

cg505 commented Jan 10, 2025 •

edited

Loading