-
-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix random SQLite busy database is locked errors #2527
Fix random SQLite busy database is locked errors #2527
Conversation
I think insights from @wz1000 would be useful with sqlite issues |
We already turn on WAL mode in hiedb: https://github.com/wz1000/HieDb/blob/305a896adbaf78175ebd866c0e665d99c093cab8/src/HieDb/Create.hs#L74
Are we sure this is not responsible for the errors? Currently we serialize all writes to the database (that is what Retrying on exceptions is a plausible solution, but I don't quite understand why it is necessary given our usage scenario. Is it possible to reproduce the same errors on a simple test program with the same usage pattern? (i.e. WAL mode, one writer, multiple readers) |
Out of curiosity, does exist some issue reported related with this? or even informal reports in any channel? |
This was creating problems in our testsuite, reported here: #1430 (comment). I haven't seen these errors reproduce again since this commit: a044156 |
This reminds me, perhaps we should be using Last time I tried the version of sqlite bundled with |
I think I read It does seem like #2484 fixes the rare cases, and my PR is not needed because of WAL mode. |
I think it could still be valuable, especially if a log message was emitted everytime a transaction was retried. You can never really be too sure about this. These show up much more consistently when we have multiple HLS processes accessing the same database, possibly because of multiple editor sessions. Retrying is the best we can do in that case, We also probably want a random exponential backoff instead of a fixed delay. See https://en.wikipedia.org/wiki/Exponential_backoff. I found https://hackage.haskell.org/package/retry after a quick search, though it might just be best to implement it ourselves instead of incurring a dependency. Feel free to reopen if you are wanting to continue with this. |
Oh didn't think of multiple HLS hitting a single database, in that case the writes might also need to retry. I'll add logging retries and random exponential backoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good so far. We need to use the backoff even in the writer thread too (the indexHieFile
function).
Are there any performance implications? |
I don't think there should be in the absence of contention, since we only retry if a query fails. |
I see... I thought having the thread on the receive side of the So I've threaded the retry wrapper through so the client can decide what do retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we want to ever support 2 concurrent HLS instances writing to the same cache. In such scenario the internal invariant that "no one else is mutating the interface store" no longer holds.
That said, even though this PR is not trivial, the approach is thoughtful and I cannot see any problem with it.
…y wrapper to action instead
… writerThread with retries, promote time duration and maxRetryCount constants to top level
d05bb30
to
1bc3adc
Compare
Perhaps we should also retry around |
I realised this shortly after leaving my last comment, but I still think it could be valuable. Many people, especially beginners, tend to use the IDE on standalone haskell files in their home directory or somewhere, without a |
…nCommand with retry, fix tests
…tuff, move WithHieDb to Types.Shake to avoid circular dependency
OK, Since I also noticed that wrapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
🎄 Hey. This merged. Thaanks. Improving tests stability is appreciated. We did had problems with it & never know when in the future the structure changes/optimizations would uncovered real DB reasons again. |
* fix sql busy database is locked errors using retries * fix ghc 9.0+ * hlint fixes * fix ghc 9.0+ again * remove accidentally added redundant liftIO, remove accidentally added empty lines * add logging to retries, add exponential backoff, refactor tests * add random-1.2.1 to older stack.yamls * use Random typeclass instead of uniformR, revert stack.yamls * logInfo instead of logDebug * dont wrap action with hiedb retry on the writer side, give hiedb retry wrapper to action instead * bump log priorities to warning, wrap all hiedb calls in runWithDb and writerThread with retries, promote time duration and maxRetryCount constants to top level * fix ghc 9.0.1 * refactor retryOnSqliteBusy into retryOnException et al, wrap Hiedb.runCommand with retry, fix tests * push WithHieDb into createExportsMapHieDb to potentially retry less stuff, move WithHieDb to Types.Shake to avoid circular dependency Co-authored-by: Javier Neira <[email protected]>
* fix sql busy database is locked errors using retries * fix ghc 9.0+ * hlint fixes * fix ghc 9.0+ again * remove accidentally added redundant liftIO, remove accidentally added empty lines * add logging to retries, add exponential backoff, refactor tests * add random-1.2.1 to older stack.yamls * use Random typeclass instead of uniformR, revert stack.yamls * logInfo instead of logDebug * dont wrap action with hiedb retry on the writer side, give hiedb retry wrapper to action instead * bump log priorities to warning, wrap all hiedb calls in runWithDb and writerThread with retries, promote time duration and maxRetryCount constants to top level * fix ghc 9.0.1 * refactor retryOnSqliteBusy into retryOnException et al, wrap Hiedb.runCommand with retry, fix tests * push WithHieDb into createExportsMapHieDb to potentially retry less stuff, move WithHieDb to Types.Shake to avoid circular dependency Co-authored-by: Javier Neira <[email protected]>
If
hiedb
raises anSQLITE_BUSY
exception, retry with random exponential backoff for both reads and writes. The max delay is 1 second, the base delay is 1 millisecond, and the max number of retries is 10.This potentially solves the case where we have multiple HLS instances using the same SQLite database. Although, maybe if multiple clients could connect to the same server this solution would be moot again.
Edit: The below is now mostly irrelevant because WAL mode is already used for the sqlite db. Oops.
There seem to be a lot of ways of solving this problem, and so what I've done might not be want we want to do.
Here's a summary of what I've learned...
Background:
The SQLite we use is bundled in
direct-sqlite
, and uses default configuration for most things. This means that SQLite uses a rollback journal. It also means itssqlite3_busy_handler
is null. The busy handler being null means that when a connection tries to acquire a lock and is prevented by another connection already holding certain locks then it will return aSQLITE_BUSY
immediately which manifests as an exception.When using a rollback journal there are 4 locks that can be acquired: "shared", "reserved", "pending", "exclusive".
SQLITE_BUSY
.SQLITE_BUSY
SQLITE_BUSY
.Problem:
The above means that a write can cause a read to
SQLITE_BUSY
. In fact in rare cases it seems even a read can cause a read or write toSQLITE_BUSY
, although I think that has to be super rare because the rollback journal has to be "hot" for that to happen.Currently we serialize writes on one connection, and do reads on another. The writes happen on one thread, the reads can happen from multiple threads. Our random exceptions are most likely a read trying to acquire a shared lock when a write is finishing up.
Solutions:
SQLITE_BUSY
can still occur in rare cases:SQLITE_BUSY
.SQLITE_BUSY
.SQLITE_BUSY
is received, retry the read. This is what this PR does. Not really that satisfying because what the sleep times should be depend on how long the pertinent locks are held. Currently the thread sleeps for at least 1 millisecond before retrying, for a maximum of 10 retries.sqlite3_busy_handler
. This more or less would do what retrying above does but maybe better. I think this requires sending PRs to expose this functionality indirect-sqlite
, but maybe I didn't look hard enough.TVar
in thehiedbWriter
inShakeExtras
, or by using some kinda broadcast signaling synchronization abstraction. There are still some edge cases that probably don't matter. Eg. a read is recovering a hot journal, and another read or write comes in?MVar
protecting a single connection, and spawning a thread per write if we don't want to block on writes. I don't think performance will be much different.I don't know much about SQLite so maybe I've got something wrong, or missed some alternatives. I might try to do the WAL mode on another branch since it seems like a one-liner but I don't know how to test it in HLS. If I do this I only know how to create synthetic examples in
sqlite-simple
, ordirect-sqlite
.