Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] increase sqlite3 lock timeout to 60s #4552

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Jan 10, 2025

understanding the issue

In certain high-load scenarios, we see sqlite3.OperationalError: database is locked. This was first reported in #1507 and various fixes changes have partially addressed it (notably, WAL: #1509, #3923, #4283).

However, I still observed the issue on the spot database while cancelling or completing 1000+ managed jobs at once.

This PR avoids this issue by increasing the sqlite timeout to a high value (60 seconds), that should be well beyond what we expect to see in practice.

What is sqlite3.OperationalError: database is locked?

SQLite has various locking mechanisms to ensure consistency. If a transaction tries to execute but cannot obtain the necessary lock, it will give an SQLITE_BUSY error. In Python, this is translated to sqlite3.OperationalError: database is locked.

If the necessary lock is currently held, SQLite will try to obtain the lock until it times out. The timeout time is set by sqlite3_busy_timeout() in the C API, which corresponds in the Python API to timeout= kwarg in the sqlite3.connect() call.

In Python, the default is 5 seconds.

To obtain the lock, SQLite does not use a condition variable or any kind of notification-based synchronization. Instead, SQLite will retry acquisition at set intervals that back off until the retry interval maxes out at 100ms.
https://github.com/sqlite/sqlite/blob/f1747f93e0f8df7984b595b91649c7789217fe59/src/main.c#L1695-L1698
This means that we can calculate how many attempts a process has to obtain the lock: it's approximately TIMEOUT / 100ms (ten times the timeout in seconds), plus a few extra before the retry fully backs off. For a default 5s timeout, we will get exactly 59 attempts to obtain the lock.

We aren't guaranteed that SQLite will retry obtaining the lock - some cases, such as those that would lead to deadlock, are explicitly excluded and will immediately return the error.

If SQLite determines that invoking the busy handler could result in a deadlock, it will go ahead and return SQLITE_BUSY to the application instead of invoking the busy handler.
https://www.sqlite.org/c3ref/busy_handler.html

This error can also reflect some code failing to close its database cursor, and indefinitely blocking all other uses of the database. However, we can be reasonably confident that this is NOT the case for skypilot, since we consistently use contexts to manage the database connection and cursor.

WAL

WAL mode (Write-Ahead Logging) changes the way writes are added to the database and significantly reduces lock contention. Specifically:

  • Writers no longer block readers
  • Readers no longer block writers
  • But, there can still only be a single writer at once

This is much better than the default mode, where writing requires an exclusive lock on the database. Switching to WAL almost completely eliminated locking issues in skypilot. However, it is still possible to have lock contention:

  • Multiple simultaneous writers will block each other
  • When only one process has the database open, and it closes its connection, it will hold an exclusive lock while it cleans up the write-ahead log and writes everything fully into the database
  • If the last process to have the database open crashed, the first process to re-open the database next will hold an exclusive lock while it recovers the database.

In high concurrency situations, the first case is the most problematic one: we may have 1000+ processes trying to write to the database at the same time. Our goal in this situation should almost certainly be to avoid crashing, even if it takes a long time to write the change.

simulation and timeout choice

Created a table as

CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY AUTOINCREMENT)

Script:

for _ in range(100):
    start = time.time()
    try:
        with db_utils.safe_cursor(_DB_PATH) as cursor:
            cursor.execute('INSERT INTO test DEFAULT VALUES')
    except sqlite3.OperationalError as e:
        if 'database is locked' not in str(e):
            raise
        print(f'LOCK FAILED AFTER {time.time() - start}', flush=True)
    else:
        print(f'ran for {time.time() - start}', flush=True)
    sleep_time = random.random() / 10
    time.sleep(sleep_time)

Ran this 1000x in parallel with a very high database timeout to capture the following latency data (n=100,000):

latency_histogram latency_histogram_log

We see an exponential dropoff in the long tail of high-latency requests. From this data we could estimate that the probability of a latency >60s is 1e-7 to 1e-6

This simulation is significantly worse than what we expect in the real world - even with 1000 managed jobs transitioning at the same time, each job only writes to the database a small number of times (<5) instead of 100x.

So, 60 seconds seems like a safe choice. Since skypilot is not really realtime-sensitive software, we should prefer a very long delay to crashing.

testing

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cg505 cg505 marked this pull request as ready for review January 13, 2025 23:50
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great findings @cg505! This is awesome! LGTM

@cg505 cg505 merged commit d9bb51a into skypilot-org:master Jan 15, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants