investigate replacing resque with good_job #1245

jrochkind · 2021-07-15T15:06:42Z

"ActiveJob" is the framework for rails background jobs -- it can have a number of different adapters, which can also use different storage back-ends (for storing the queue).

We currently use resque, which uses a redis backend. resque is somewhat unmaintained and is seeming a bit fragile these days.

Sidekiq is much more popular and maintained -- but the free version is somewhat crippled. For instance, it can only run multi-threaded, not multi-process. Which might be fine for our work load, but it's annoying. The free one also lacks guarantees of not losing jobs I think (!?!). The "pro" version is somewhat expensive. (looks like at present it might be a straight $950/year, regardless of installation size? Not sure. https://sidekiq.org/products/pro.html)

There aren't a whole lot of other options for ActiveJob back-ends. But one new one is good_job. In addition to being entirley open source, I like that it actually uses postgres as a storage back-end instead of redis. We already have a postgres. This would let us get rid of our redis dependency, one less thing to pay heroku for, one less moving part to manage.

(Note our recent problems we think with exceeding heroku redis connection limits -- although heroku postgres has connection limits too, I don't think switching to good_job will increase our postgres connection needs -- our bg job workers already each needed a PG connection).

good_job while fairly new looks pretty solid, and has made a lot of functionality gains over the past year. But it still may be missing some features we'd really like (possibly we could contribute some). But it's worth investigating.

Good job issues to investigate/fix

good_job 3.0 is expected to come out soon, so we should defintiely wait for that to actually go live
job handling order:
- good job does not actually handle jobs in "first-in first-out"! Which makes me nervious and ideally we'd fix it. Also good_job does not have the feature of handling multiple queues in a priority order, which our current queue setup for on_demand_derivatives kind of counts on -- we set up a single resque worker which should always work on work in on_demand_derivatives if present, and only if nothing there is present go on to work on default. You can't set up good_job this way at present. See In what orders are multiple participating queues in a single process handled? bensheldon/good_job#624
Verify "graceful shutdown" that will be used in heroku restarts etc.
- We want it such that when heroku sends a SIGTERM, the workers will stop taking new jobs off the queue, but keep processing in-progress jobs -- for a certain (settable?) amount of time less than heroku's 30s timeout. When the time is exceeded, jobs should be hard-cancelled, but in a way that they get automatically retried again when jobs come back up.
Figure out RAM/CPU sizing on heroku dynos
- we decided that we could run two (formerly 3) resque workers simultaneously on a heroku standard-2x. But good_job uses a different execution model -- multi-threads (which take up less RAM), but that also means if we want to take advantage of possible multi-core on heroku dyno we'd need to run more than one good_job process. We just have to figure out the right way to configure processes/threads to take best advantage of heroku dyno.
Effect on DB connections?
- Heroku postgres plans give us a limited number of connections to postgres. Good_job being db/postgres based... will each worker use an extra connection? Or, our workers were mostly already using a database connection, will this same database connection they were already using be what's used for good_job? Could have an effect on our connection limits?
- In production, we're currently using a heroku postgres standard-0. Which has a connection limit of 120. Looking at our current production system at 2:30pm on June 20, how many connections are currently in use? Run heroku pg:info. It looks like currently only 11! So we seem to have quite a bit of spare capacity, not a concern -- still would be good to know good_job's effect.
Make sure our hirefire autoscaler can handle delayed job! Oh hey the README says it does so that's good! Might want to test to be sure. https://github.com/hirefire/hirefire-resource

The text was updated successfully, but these errors were encountered:

jrochkind · 2021-07-20T19:12:36Z

delayed_job is another possibility. (With delayed_job_web as admin UI). It's much older than good_job. Which could mean it's mroe mature, or could mean it's creakier.

Rob from notch8 likes DelayedJob.

jrochkind · 2022-06-08T14:49:10Z

We may want to re-prioritize this, resque is currently really annoying me with lots of deprecation notices in logs. See resque/resque#1796

jrochkind · 2022-06-22T21:04:18Z

Done some research, have a local branch....

Also see:

bensheldon/good_job#624
bensheldon/good_job#626
bensheldon/good_job#627
bensheldon/good_job#630

jrochkind · 2022-06-30T18:47:02Z

Verify graceful shutdown

We have a TestFailJob that just does sleep 120, just waits for 120 seconds, takes a long time to complete.

start up a good job worker with bundle exec good_job start --shutdown-timeout=20 (wait 20 seconds for job to complete, then kill it)
TestFailJob.perform_later from console
- Notice job shows up in good_job admin as 'running'
kill [pid] from a terminal to the good_job process (kill default signal is SIGTERM)
good_job process takes 20 seconds to stop; job still shows up in "running" state during this time
- Another job I tried to enqueue remains in queue state, good_job is not taking it off the queue, great!
After 20 seconds, good_job process exits -- and the job in queue now is in "queue" tab instead of "running" tab, great!
- One weird thing, it shows up in dashboard in "queue" tab, but with state "running"... will ask a question of good_job on this one.
start up good_job start again -- job moves from "queued" to "running", hooray!

Mostly it looks like this is behaving as expected. There are some idiosyncracies. Asked some questions here bensheldon/good_job#650

jrochkind · 2023-12-04T15:19:07Z

Jeremy Friesen from SoftServ (consultancy that does lots of samvera work) has a good review of good_job:

Super happy with Good Job for a few reasons:

A Crash of the Worker does not lose the jobs (like Sidekiq does)

Having a database to query is way better than the Redis query of Sidekiq

The configuration options are wonderful (with built in cron consideration)

All told, we’ve noticed it to be more durable that Sidekiq.

jrochkind · 2023-12-19T16:03:27Z

Okay, dhh/37 signals/basecamp (who are some of the main Rails maintainers) have actually just released their own database-backed ActiveJob alternative, which actually has a design that may e closer to what we're looking for, as well as maybe higher confidence of future maintenance as its' got a team behind it. Solid Queue

But it doesn't YET have an admin dashboard -- they say one is coming "early next year". Once it does, it could be a real contender -- as much as I'd love to support a third-party thing not from basecamp.

They seem to have learned from good_job to figure out how to be a bit simpler, with better defaults and without relying on postgres-specific features (meaning it doens't interfere with pgbouncer use to share pg connections), while having the features we need -- including not losing jobs even if worker is hard-crashed (using a "process heartbeat" design to detect crashed workers, which is what resque and sidekiq use; not sure if this feature is available in free sidekiq).

Just need the admin dashboard, and hope it has the features we need -- I'd look at writing an admin dashboard if they didn't say one was coming hopefully soon.

https://dev.37signals.com/introducing-solid-queue/

https://github.com/basecamp/solid_queue

We definitely could really, right now for Oral History request stuff, use 'schedule future job' functionality that resque isn't giving us without a plugin that has added functionality we don't need. :(. ActiveJob lets you say "schedule this job for 1 hour from now", but resque doesn't support that. So can't come early enough.

jrochkind · 2024-09-26T13:29:31Z

After Rails World, we plan to focus again on Mission Control – Jobs for a bit, with the goal of releasing v1.0 soon.

https://dev.37signals.com/solid-queue-v1-0/

Should prob wait at least until the 1.0 of Mission Control to migrate, and maybe a bit beyond that -- basecamp has a history of releasing not quite ready for prime time software.

jrochkind added the maintenance/performance not new features or bugfixes, but keeping/improving the app running well label Jul 15, 2021

jrochkind mentioned this issue Jul 19, 2021

If an ActiveJob job fails, rerun it automatically. #607

Closed

eddierubeiz mentioned this issue Oct 13, 2021

[draft] good_job #1374

Closed

3 tasks

jrochkind mentioned this issue Jun 8, 2022

Turn off redis deprecation warnings to avoid annoying in our logs #1745

Closed

jrochkind self-assigned this Jun 27, 2022

jrochkind mentioned this issue Jun 30, 2022

[experimental] good_job #1765

Draft

4 tasks

jrochkind added the zenhub-infraback label Dec 18, 2024

jrochkind added this to Digital Collections Team Dec 18, 2024

jrochkind moved this to Infrastructure Backlog in Digital Collections Team Dec 18, 2024

jrochkind removed the zenhub-infraback label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate replacing resque with good_job #1245

investigate replacing resque with good_job #1245

jrochkind commented Jul 15, 2021 •

edited

Loading

jrochkind commented Jul 20, 2021

jrochkind commented Jun 8, 2022

jrochkind commented Jun 22, 2022 •

edited

Loading

jrochkind commented Jun 30, 2022 •

edited

Loading

jrochkind commented Dec 4, 2023

jrochkind commented Dec 19, 2023 •

edited

Loading

jrochkind commented Sep 26, 2024

investigate replacing resque with good_job #1245

investigate replacing resque with good_job #1245

Comments

jrochkind commented Jul 15, 2021 • edited Loading

Good job issues to investigate/fix

jrochkind commented Jul 20, 2021

jrochkind commented Jun 8, 2022

jrochkind commented Jun 22, 2022 • edited Loading

jrochkind commented Jun 30, 2022 • edited Loading

Verify graceful shutdown

jrochkind commented Dec 4, 2023

jrochkind commented Dec 19, 2023 • edited Loading

jrochkind commented Sep 26, 2024

jrochkind commented Jul 15, 2021 •

edited

Loading

jrochkind commented Jun 22, 2022 •

edited

Loading

jrochkind commented Jun 30, 2022 •

edited

Loading

jrochkind commented Dec 19, 2023 •

edited

Loading