Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a busy_handler_timeout setter #443

Merged
merged 12 commits into from
Jan 3, 2024

Conversation

fractaledmind
Copy link
Contributor

One of the largest pain-points when using SQLite in a Rails app is the limited concurrency support. While SQLite does support only one writer, from a Rails app's point-of-view this does not mean that that app simply cannot run in, for example, Puma clustered mode. As I have detailed, the primary issues are that the GVL isn't released while the SQLite busy_timeout C function is running and transactions are run in DEFERRED mode.

We can solve the first of these by providing a Ruby busy_handler callback that will release the GVL between retries. This allows multiple threads to naturally coordinate and "queue up" to acquire SQLite's single write lock.

I initially proposed providing this in the Rails codebase (see: rails/rails#50370); however, it was suggested that this functionality more naturally belongs in this codebase. So, I would love to start brainstorming the interface for this. Once built and released, we can then update Rails to use this instead of the busy_timeout.

Jean Boussier suggested lock_wait_timeout as a name. I'm opening this PR to get the ball rolling. I'm open to naming suggestions as well as test suggestions.

@fractaledmind
Copy link
Contributor Author

@suwyn Do you happen to have a better alternative name for this? I know that you have been thinking about this problem space for a while, so I imagine you might have some language around distinguishing a Ruby busy_timeout from the SQLite/C busy_timeout in a way that is succinct but communicative.

@suwyn
Copy link

suwyn commented Dec 27, 2023

Oh boy, naming is hard! Maybe busy_handler_timeout?

@fractaledmind
Copy link
Contributor Author

I like that. I'll switch to that

@suwyn
Copy link

suwyn commented Dec 27, 2023

@fractaledmind Also consider we'll want to configure this from the database config file so adding it as an option would be beneficial. Here is an example from #426

Default of nil which wouldn't set the busy handler.

@suwyn
Copy link

suwyn commented Dec 27, 2023

Also noting for prosperity why this is needed. The underlining issue with the interpreter is documented here.

From my experiments there is a lot of speed to be gained if there was a way to fix the underlining issue.

@fractaledmind
Copy link
Contributor Author

This will be set in Rails the same way that busy_timeout is currently set. Rails' database.yml won't change at all, it will just be that the timeout value will be passed to this function and not the busy_timeout function.

@flavorjones
Copy link
Member

flavorjones commented Dec 27, 2023

Ignore any CI failures for "native packaging" with "head", that's only because the HEAD version rolled over to 3.4.dev (you should be able to fix with a rebase now that #447 has been merged)

lib/sqlite3/database.rb Outdated Show resolved Hide resolved
@fractaledmind fractaledmind changed the title Add a lock_wait_timeout setter Add a busy_handler_timeout setter Dec 27, 2023
lib/sqlite3/database.rb Outdated Show resolved Hide resolved
@fractaledmind
Copy link
Contributor Author

@djmb @rosa: I was just reading thru the SolidQueue source as I work on integrating it into a project I have at work. I came across this commit that patches Rails' SQLite3 adapter to improve concurrency. Reading the commit description, it sounds like you have had similar experiences as I have working to get SQLite to handle concurrency in a reasonable and resilient way.

I note that in your comment and patch, you lean on the retries option and patch it to have linear backoff (sleeping count milliseconds on each busy_handler invocation). I'd love to hear more about what, in particular, was causing problems with the current Rails implementation of retries that doesn't backoff between retries. I'd also be keen to get your perspective on this implementation of a busy_handler and the general plan for Rails. To recap conversations had across repos, the plan as it stands currently is:

  1. get this PR merged into the SQLite3 driver to provide a busy_handler_timeout method that accepts a timeout in milliseconds, but uses a Ruby busy_handler instead of the SQLite C busy_timeout to ensure that the GVL is released between retries
  2. deprecate the retries option in Rails for the SQLite adapter
  3. have the timeout option in Rails for the SQLite adapter call this busy_handler_timeout method from the SQLite3 driver instead of the busy_timeout method

After working with the retries option for some time, and getting feedback from the community, I believe that the mental model of timeout is clearer and people better understand what kind of value to set for their needs. However, we need to deal with the GVL lock issue of the busy_timeout method in this driver. Does the current plan make sense given you and your team's experience with SQLite usage in real-world apps? Does the current implementation of the busy_handler_timeout here make sense given you and your team's experience?

I believe your perspective could help ensure that both this PR and the larger plan for Rails' SQLite support are as solid as possible.

@fractaledmind
Copy link
Contributor Author

@flavorjones @byroot: Added two simple tests focusing on the GVL blocking difference between busy_timeout and busy_handler_timeout. AppVeyor is green now. Is there anything still blocking here?

# while SQLite sleeps and retries.
def busy_handler_timeout=( milliseconds )
timeout_seconds = milliseconds.fdiv(1000)
timeout_deadline = Process.clock_gettime(Process::CLOCK_MONOTONIC) + timeout_seconds
Copy link

@suwyn suwyn Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fractaledmind Looking at the code again, I believe there is a bug here.

This is setting a clock based timeout deadline at the time busy_handler_timeout is set rather than when the busy_handler gets invoked for the first time. Essentially setting a hard timeout for the timeout_deadline for all invocations of the busy handler.

Consider the following (I didn't test it but I believe this should show the issue, if I am reading the code correctly)

busy_handler_timeout_db = SQLite3::Database.open( "test.db" )
busy_handler_timeout_db.busy_handler_timeout = 1000
sleep 1.5

# timeout_deadline should be expired now since that was set 
# at the point in time when we set busy_handler_timeout to 1000.
# Any invocations of the busy handler will now throw busy errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. Great catch. Will fix when I'm back at my laptop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suwyn Checkout 48daae1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better @fractaledmind

It still has me wondering if it's thread safe though since it is using an instance variable. If two threads used the same connection, you'd have an issue. I did some experimenting in Rails and the connection pool appears to be thread safe (as it states in the docs) .

Experimenting with the gem on its own, I wasn't able quickly able to cause any issue. SQLite doesnt have a way to sleep so I stopped short of writing a long running UDF.

I think the ideal solution is to either bake the busy handler into the C code or pass the time of first invocation into the handler from C, like it does with count. I'd obviously strongly prefer the former, but have no idea on what that would take and outside of the scope of this PR.

So short of using the count to estimate the clock time (e.g. the (count * RETRY_INTERVAL) > timeout_seconds approach) instead of the clock time we'll just have to know that a race condition probably exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suwyn: You should join this Discord server where we talk a lot about the SQLite ecosystem in Ruby: https://discord.gg/ehdDh5C4

That would allow us to connect more easily and even pair on this. I'd love some help on writing useful tests. I have a long running UDF already written:

WITH RECURSIVE r(i) AS (
  VALUES(0)
  UNION ALL
  SELECT i FROM r
  LIMIT 10000000
)
SELECT i FROM r ORDER BY i LIMIT 1;

What I don't have are resilient and expansive tests. Want to work on getting this over the line together?

@djmb
Copy link

djmb commented Jan 3, 2024

I note that in your comment and patch, you lean on the retries option and patch it to have linear backoff (sleeping count milliseconds on each busy_handler invocation). I'd love to hear more about what, in particular, was causing problems with the current Rails implementation of retries that doesn't backoff between retries.

Hi @fractaledmind! The issue with using retries without a backoff was that we would still get SQLite3::BusyExceptions.

I've just tried it out with no sleep and to avoid the BusyExceptions, I need to set the retries to about 1,000,000, at which point it seems to hang. I don't know exactly what the mechanism is there though - whether the thread holding the lock is actually blocked or if it is being starved.

I'd also be keen to get your perspective on this implementation of a busy_handler and the general plan for Rails. To recap conversations had across repos, the plan as it stands currently is:

get this PR merged into the SQLite3 driver to provide a busy_handler_timeout method that accepts a timeout in milliseconds, but uses a Ruby busy_handler instead of the SQLite C busy_timeout to ensure that the GVL is released between retries
deprecate the retries option in Rails for the SQLite adapter
have the timeout option in Rails for the SQLite adapter call this busy_handler_timeout method from the SQLite3 driver instead of the busy_timeout method
After working with the retries option for some time, and getting feedback from the community, I believe that the mental model of timeout is clearer and people better understand what kind of value to set for their needs. However, we need to deal with the GVL lock issue of the busy_timeout method in this driver. Does the current plan make sense given you and your team's experience with SQLite usage in real-world apps? Does the current implementation of the busy_handler_timeout here make sense given you and your team's experience?

I believe your perspective could help ensure that both this PR and the larger plan for Rails' SQLite support are as solid as possible.

We only used retries here because the current timeout wouldn't release the GVL lock, so your plan here sounds perfect. I agree that timeout is much more intuitive.

It seems likely we will still need a sleep in the handler to avoid the issues we've been seeing, but it would be good to know why. I'll see if I can debug what's going on in Solid Queue myself. It's tricky because the I/O of logging anything from the handler seems to have the same effect as a sleep and the BusyExceptions go away.

If you want to dig about yourself, the patch for sleeping is to allow the tests to pass, you can run them with TARGET_DB=sqlite rails test. Removing the patch should give you the BusyExceptions.

elsif now > @timeout_deadline
next false
else
sleep(0.001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sleep necessary? I think the VM is interruptible on any method call so the calls to clock_gettime etc should do the trick (I think).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenderlove it's not about being interruptible, it's about forcibly releasing the GVL. The assumption is that the client that is currently holding the SQLite3 write lock might be another thread in the same process, hence we should try to switch back to it rather than busy loop for 100ms until the thread scheduler quantum is reached.

And even if it's not in that process, yielding the GVL allow other unrelated threads to proceed.

NB: for the "in same process case" a shared Ruby mutex would be way more efficient, but we'd need some way to tell two clients are pointing that the same database, hence should share a single Mutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that the client that is currently holding the SQLite3 write lock might be another thread in the same process, hence we should try to switch back to it

I see. In that case wouldn't Thread.pass also do the trick?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would yes, but then the risk is that there's no other ready thread, causing the process to essentially be busy looping, which would pin one CPU to 100% and may not be desirable. A very short sleep actually make sense IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Copy link
Member

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the patch if @byroot is OK with it. I had one question wrt the sleep (but including the sleep is fine if it's necessary)

@tenderlove tenderlove merged commit 0d487ae into sparklemotion:main Jan 3, 2024
1 check failed
@flavorjones
Copy link
Member

@fractaledmind this is causing failures in CI on windows

  1) Failure:
TC_Integration_Pending#test_busy_timeout_blocks_gvl [D:/a/sqlite3-ruby/sqlite3-ruby/test/test_integration_pending.rb:100]:
SQLite3::BusyException expected but nothing was raised.

see e.g. https://github.com/sparklemotion/sqlite3-ruby/actions/runs/7403622873/job/20143757496

Can you please take a look? Or let me know if you'd prefer me to revert.

@fractaledmind
Copy link
Contributor Author

fractaledmind commented Jan 4, 2024

I have been trying to debug, but can't reproduce. I am now trying to research a more direct way to test "holds GVL" vs "releases GVL" for these two. I opened a PR with a fix that grounds the tests in the same kind of setup that was used in the other tests, which I am confident is more deterministic: #456

On a related note, is there a way to ensure local tests are running with current head version of SQLite? It seems my tests run against my default macOS version of SQLite.

@flavorjones
Copy link
Member

On a related note, is there a way to ensure local tests are running with current head version of SQLite? It seems my tests run against my default macOS version of SQLite.

By default this should not happen. Try:

bundle exec rake clean clobber
bundle exec rake compile
bundle exec rake test

If you're still using the system libraries, open a new issue and attach

  • full output from these commands
  • the Makefile and mkmf.log from the compilation phase (which should be under ./tmp/x86_64-darwin/sqlite3_native/3.2.2 or similarly-named directory)

flavorjones added a commit that referenced this pull request Jan 4, 2024
This is the outcome of running:

    git revert f80c5ff..0c93d30
tenderlove added a commit that referenced this pull request Jan 4, 2024
…ndler-timeout-for-now

Revert pull request #443 from fractaledmind/lock-wait-timeout
@fractaledmind fractaledmind deleted the lock-wait-timeout branch January 7, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants