QoS extensions #105

dfoody · 2012-03-16T04:37:19Z

Hi, I've created a bunch of QoS extensions to Kue this would be useful to incorporate into the master if you're interested. There are three main extensions that I added:

watchdog+heartbeat to auto-restart stuck/dead jobs (useful if, for example, a server that's processing jobs reboots without a graceful shutdown).
dependencies to allow jobs to wait for other jobs to complete before starting to give more granular controls of execution ordering even with a large pool of available workers (dependencies even work across different job types).
serialized execution to ensure that only one of a related group of jobs can execute at the same time (so, for example, you could use this to ensure that no two jobs related to the same user run at the same time across all workers). As with dependencies, this works across different job types.

Added additional quality of service extensions including: - watchdog+heartbeat to auto-restart stuck/dead jobs - dependencies to allow jobs to wait for other jobs to complete before starting - serialization to ensure that only one of a related group of jobs can execute at the same time.

tj · 2012-03-16T15:36:40Z

sounds good! there's a lot here so I'll have to go through and review but from the description it's a +1

tj · 2012-03-16T16:24:35Z

Readme.md

+    car: 'S5',
+    charge: '$59,000'
+}).after(newcustomer).after(config).save();
+```


I like this, I was doing similar but just creating the jobs within the job processor haha

Yep creating them within the job works if your precursors are execute serially.
But, what I wanted to do was make it easy to execute precursors in parallel, and then only trigger the successor when all the precursors finished.

By the way, the example "QoS" runs all of these types of combinations if you want to see it in action.

Cheers.

Dan

On Mar 16, 2012, at 12:24 PM, TJ Holowaychuk wrote:

name: 'Jim Steele',

email: '[email protected]'
+}).save();

+var config = jobs.create('customconfig', {

title: 'S5',

color: 'sprint blue',

transmission: 'dsg'
+}).save();

+var charge = jobs.create('change', {

email: '[email protected]',

car: 'S5',

charge: '$59,000'
+}).after(newcustomer).after(config).save();
+```

I like this, I was doing similar but just creating the jobs within the job processor haha

Reply to this email directly or view it on GitHub:
https://github.com/LearnBoost/kue/pull/105/files#r568644

- Make sure staged jobs that are deleted are removed from the staging area. - Repair issues detected with staging while attempting to assign the lock. - Fixed cleanup of precursor job data when jobs have already completed before the precursor is created. - Added error reporting when calls to red is fail. - Add precursor information to UI

Corrected an issue where the new QoS logic caused the states not to be able to be changed from the UI.

rgarcia · 2012-05-08T02:01:38Z

+1 on job dependencies

- When heartbeat is triggered, make sure value is set before updating red is - When a job is deleted, also remove the logs related to the job - When a job completes, update the updated_at timestamp.

Previously the logic called the callback before all state related to the job was saved (the rest was saved in the background). This works in many cases, but when run from a command line process which creates a job and then quits, jobs were not getting created. This change makes sure that the job is fully saved before the callback is called.

When you have many jobs (for example in the completed state), descending sort was retrieving the oldest jobs and descending sorting them, so if you had jobs 10-99, choosing descending sort would show 20,19,18,17,… instead of 99,98,97,96,... This fix corrects descending sort so that it works from the newest jobs backwards. (caveat: sorting is still done by lexicographical comparison of job IDs, not by numeric comparison - so 2 is higher than 10 - this fix does not change this behavior).

This fix lets the UI be more tolerant of null fields in jobs. While these should not typically occur, I have seen cases where a job ends up with null fields due to redis bug/corruption issues. Once the job is corrupted, it won't show up in the UI (so you can't delete it, fix it, etc. without doing so manually). This change lets the UI be a bit more lenient so corrupt jobs can still be seen and deleted.

bigal488 · 2012-07-12T14:50:36Z

Any thoughts on merging these?

spollack · 2012-07-30T22:05:00Z

+1 on merging these features, particularly auto-restarting stuck/dead jobs

dmhaiduchonak · 2012-08-11T21:37:56Z

+1 for serialized execution

pushrax · 2012-08-20T18:12:26Z

👍 for restarting stuck jobs

Jellyfrog · 2012-11-16T03:50:40Z

+1

podviaznikov · 2012-12-20T01:47:14Z

+1

Under heavy use, especially with failures, process restarts, and network issues, kue would often get corrupted. The root cause was that kue does not transactionally change state - so if a change went part way through and an error occurred the data structures would be in an inconsistent state. This change turns the critical path internals (state change and job dequeuing) to be transactional (or, at least as close as can be achieved using redis). Some notes on this: - This requires redis 2.6 (it uses lua scripting) - When dequeuing jobs there are still some conditions where it's impossible to achieve atomic dequeue-and-execute semantics. So, there is now an error handler which is called to notify the queue processor when this condition occurs.

ChrisCinelli · 2013-02-26T00:01:06Z

@dfoody : These changes are really interesting but it also looks like that you did not keep merging the new commits made on learnboost repo since one year ago. This looks more worthy to be called a new project that trying to merging it now to the current version of learnboost/kue. Unless you want to go through the pain of adding the latest changes.

tadeuszwojcik · 2013-03-15T15:54:31Z

@dfoody did serialize method works after workers restart ?
I've tried you fork (https://github.com/dfoody/kue/tree/QoS) with serialize option, and after restart all jobs go to staged state and nothing is processed.
I'm really interested in making serialize working, so any help would be appreciated :)

dfoody · 2013-03-15T16:09:28Z

@CodeFather If there's still a job in active or failed that holds the serialization lock, then all other jobs that need the same serialize lock go to staged. The active job that holds the lock must either move to finished or be deleted to release the lock.

By default, there's nothing that moves a job from active to finished other than successfully completing the job. So, if a worker crashes during job processing, the job will just stay in active until you restart it manually. This sounds like what's happening to you.

But, the QoS branch also has a "heartbeat" mechanism that can be used to auto-restart jobs that haven't triggered the heartbeat in a certain amount of time. The only thing to be careful of is that you need to be sure to call the heartbeat often enough. If a job is still running "properly" and doesn't heartbeat in time, kue will think it has failed and start another copy of the job running.

There are three separate things you need to do to use the heartbeat:

one-time enable the watchdog that checks for jobs that haven't triggered their heartbeat - using jobs.watchdog(ms)
on each job set the expected heartbeat interval and number of retries that will be allowed with job.heartbeat(ms).retries(num).
trigger the heartbeat periodically on a job with job.heartbeat() (no arguments).

tadeuszwojcik · 2013-03-15T16:56:42Z

@dfoody many thanks for comprehensive answer :) Works as you described.
Actually in my case failed job was holding serialization lock and as soon as I removed that job it worked !
Great job with QoS . Hope it will be merged to kue (as well as jobs shutdowns) or someone else than @visionmedia will take over to maintain it.
Thanks again.

auser · 2013-04-08T18:48:15Z

+1 please merge @visionmedia

ElliotChong · 2013-05-27T23:26:55Z

+1

jcspencer · 2013-06-24T00:43:09Z

Please merge @visionmedia 👍

bulkan · 2013-06-28T07:19:10Z

lib/queue/worker.js

+  if(delay)
+      amount = Math.round(Math.random()*30000);
+  else
+      amount = Math.round(Math.random()*2000);


Why delay job.process when no delay is passed in ?

Before the transactional changes, this was originally there to avoid having many workers all start processing at the same time (which happens in two scenarios - when a new worker starts or when you've queued a set of short running jobs) - it helped avoid stuck jobs.
But, with the more recent move to transactional queuing it's likely not necessary - though we've never tested w/o it (we have ~4 months in production with no stuck jobs at all - after 10s of millions of jobs - so it's pretty battle hardened as it is).

Note that the setting with 'delay' (the 30s random interval) is really important still. It protects against an error storm when redis (or the connection to it) goes down.

bulkan · 2013-06-30T12:28:05Z

@dfoody I've been trying to manually merge this into our fork at https://github.com/LearnBoost/kue but I'm having lots of trouble understanding the state the jobs are left in. Mixing in Lua code to change the state of the job gets very confusing.

At this moment i'm giving up on this code 😦

behrad · 2013-08-18T20:10:53Z

@dfoody I noticed that your branch is based on an old kue, and also limits the node version up to 0.7!
Any plan or idea to merge with the latest kue version and add support for latest node !?

dfoody · 2013-08-24T15:19:26Z

@behrad - sorry no immediate plans to merge with newer kue or upgrade node version (we're still using older versions in production).

But, I'd be happy to walk anyone that wants to port it through how it works (why various changes were made the way they were made, how the control flow works, etc.).

Queue.stop() initiates stopping all workers. Active workers finish up their current jobs, workers that are idle stop as soon as is possible (worst case it may take up to a minute for an idle worker to stop, but in most cases they stop almost immediately). Once all workers are stopped the "stopped" event is emitted.

behrad · 2014-01-27T17:24:58Z

I merged server-side redis lua-scripts from @dfoody 's QOS branch into kue 0.7.4 as a new QOS branch. Anybody can test and provide feedbacks !?

Nibbler999 · 2014-01-31T13:36:21Z

Works as expected here, no problems.

behrad · 2014-01-31T15:26:31Z

Good news. I personally can't convince myself to move into server-side lua scripts yet. Developing concurrent branches also seems no good, but i did that merge to be used as comparision.
User should tell if Kue 0.7 looks unreliable in different deployments?
How much need is for QOS branch?

dfoody · 2014-01-31T15:53:51Z

Our experience with Kue was that it worked fine in low volume and for testing. But, as we got up to > 10k jobs a day (many running concurrently) we started to experience on the order of 1-10 jobs that were corrupted every day. Now, scale that up to > 200k jobs a day (what we now run) it becomes more than a full-time job just to fix corruptions. But, with the QoS branch changes we've had no job corruptions in months.

So, I suspect in light testing you'd find no difference. But, if you really want to see the difference write a test that queues 100k jobs, with 1000 concurrent workers, where each job takes <1s to execute. Then hit "ctrl-c" mid execution (for example, while it's still queuing jobs). If you then restart and see if you can get the jobs all running again, you'll see the difference between the QoS and non-QoS versions.

behrad · 2014-01-31T16:08:58Z

That was right with Kue 3.4, which you branched and created QOS version based upon.
We are running Kue where 1 million jobs/day comes in being processed. We have tested it with 1000, 500, 100 concurrent workers distributed as 8 cluster workers. No problems in 0.7! jobs are doing restful IO mostly with <1(s) durations.
I want to make sure if this is true for other runtime scenarios in different stress conditions. (longer jobs, thunder jobs, ...)

And about hitting "ctrl+c", we are not talking about an specific feature, like gracefully termination. We added graceful shutdown into 0.7 and it is tested in heavy load with long running jobs. However it should be more and more tested.
For you @dfoody is it feasible to replace your QOS version with 0.7 in a simple test-bed that shows your production environment and let us know?

dfoody · 2014-01-31T18:14:32Z

Hi @behrad, I've looked at the 0.7 code and, unfortunately, it still suffers from all the same issues as previous versions. Because it's not atomic any unexpected failure will corrupt kue. When the failure is a process crash, bugs due to unhanded exceptions, a network failure, a power failure, etc. - situations where no graceful termination code will help you.

I'm guessing you're not experiencing any of these types of failures, but we do (we run large clusters of machines on AWS and routinely experience all of these types of things). And we've seen it fail at virtually every line of code in a state change (so seen every possible variation of parts of a partial state change having taken effect and corrupting kue).

I'd really like kue to be as stable and reliable as a database. But, without atomic operations (whether via multi or lua) it can never be.

behrad · 2014-01-31T18:42:46Z

Internal operations can be mostly re-written using multi, I don't expect
kue indexes to be 100% healthy on network/power failures. wrong
expectation?
You can see that there's only one difference between QOS and normal Kue
which is in setState (and thats a big part), in job state change handling. Other redis operations
are the same (be with or without multi)

dfoody · 2014-01-31T21:02:49Z

The primary areas where atomicity is important are Job.prototype.state and Worker.prototype.getJob (which eventually calls Job.prototype.state).

Unfortunately, when we looked at it, it turned out that multi is at best is super-tricky, and at worst won't work for Job.prototype.state (though, granted, we use a much more complicated state change model in the QoS branch - one that accommodates resource locking).

Here's an example of what can happen: You have a low priority job queued before a high priority job is queued, and a worker goes to pull a job from the queue while the high priority one is still being queued.

With the current 0.7 implementation (even without any failures) the job can be corrupted (once line 487 of Job.prototype.state happens, the worker will pull the high priority job off the queue first, but it's state is not yet set - so if the worker then changes the state before the code queuing the job keeps going, the state will now be inconsistent). We actually saw this specific issue happening in production after we first started using job prioritization.

But, even with multi+watch, because you have to retry a state change if it fails (since one of the sorted set keys being modified may have been related to a different job changing state, not the current job), you can end up with a job being stuck between states or changing to the wrong state.

So, it turned out to be significantly simpler to build the entire state change as an atomic operation.

behrad · 2014-02-01T06:27:16Z

I totally see what you are talking about @dfoody , this merge I did, was also a response to these. but those failures seem to happen in very rare conditions, and this gives us event more time to evaluate a complete change to depend on Redis >= 2.6

oliverlloyd · 2014-02-04T14:35:52Z

For us the sorts of actions that seem to trigger stuck jobs (and we see them both for the active and inactive queues) are when servers are restarted. The problem is this is not a rare occurrence, we use AWS & hosting providers so it happens all the time. Note. Our jobs tend to be longer running (around 3-15 seconds) so this might exacerbate things.

Thing is, this seems like a fairly typical use case, no?

@behrad as well as wading in with opinions, we're also happy to help out with testing for this issue.

ElliotChong · 2014-02-04T17:22:14Z

👍 👍 for better handling of stuck / dead jobs - I'm seeing at least a few corrupted jobs in between restarts as well

behrad · 2014-02-05T09:31:29Z

Would you please help us to reproduce it? awesome if you can write related test cases.

stuck jobs

by stuck jobs we mean corrupted indexes which cause job to stay in inactive/active state despite other jobs being handled.

when servers are restarted

@oliverlloyd by server you mean redis? or redis+kue? and what kind of restart? a hard one(redis aborts)? or a signaled one which let apps to gracefully shutdown?

I'm seeing at least a few corrupted jobs in between restarts as well

@ElliotChong which version of Kue/redis? what kind of restart again?

oliverlloyd · 2014-02-05T09:40:43Z

@behrad By server I mean app/web server, not redis. Typically we run on Heroku where ps:restart will cycle all servers.

behrad · 2014-02-05T09:53:00Z

@oliverlloyd nice then! I don't think you are using 0.7's graceful shutdown. If so can you test with 7.0 and listening to the right signal to shutdown Kue on app server reboot and let us know?

oliverlloyd · 2014-02-05T12:59:21Z

@behrad Can't trigger shutdown method #291...

behrad · 2014-02-18T09:00:40Z

@oliverlloyd any feedbacks?

oliverlloyd · 2014-02-18T13:19:29Z

Yes and no. We now see graceful shutdown upon app restarts, this is good, it has clearly mitigated the issue a lot. However we do sometimes still get stuck jobs but I am not able to state with confidence that this is not caused by other factors (I think that it is fair that job could get stuck in certain aggressive scenarios such as Redis running out of memory).

On the other hand, the fact that jobs can still get stuck - for whatever reason - remains an argument in favour of a background job checking for stuck jobs.

behrad · 2014-02-18T13:35:47Z

Would you please clarify that each stuck job will get processed after manually running a LPUSH q:JOB_TYPE:jobs 1?
Can you dig more to see how that case can be re-produced? (I may help it to be solved if you can describe the pattern)
If it happens with redis crashes, server side scripting also won't work! How can you tell redis don't crash in between a transaction or LUA script? SO! try to make redis as up&running as you can. There are some memory specific configurations to stop redis being terminated by OS out of memory triggers. We had spent a month to config our redis to stay alive, and IT IS!

joshbedo · 2015-08-16T16:08:24Z

+1 for job dependencies, i need that right now :)

LobeTia · 2016-01-25T13:11:43Z

+1 for serialized execution

tj reviewed Mar 16, 2012
View reviewed changes

Dan Foody added 2 commits March 17, 2012 20:55

Allow state to be changed from UI

21015c9

Corrected an issue where the new QoS logic caused the states not to be able to be changed from the UI.

Dan Foody added 4 commits May 9, 2012 10:16

Cleanup job completion, deletion, and heartbeat

945c1b7

- When heartbeat is triggered, make sure value is set before updating red is - When a job is deleted, also remove the logs related to the job - When a job completes, update the updated_at timestamp.

spollack mentioned this pull request Sep 21, 2012

Jobs stuck in inactive state #130

Closed

This was referenced Jun 25, 2013

Empty queue or stacked active jobs after reboot #129

Closed

Graceful shutdown #71

Closed

running multiple kue in the same computer #177

Closed

bulkan reviewed Jun 28, 2013
View reviewed changes

behrad mentioned this pull request Jan 27, 2014

Merging to Transactional Job Processing #199

Closed

behrad closed this Oct 14, 2016

QoS extensions #105

QoS extensions #105

Conversation

dfoody commented Mar 16, 2012

tj commented Mar 16, 2012

tj Mar 16, 2012

Choose a reason for hiding this comment

dfoody Mar 16, 2012

Choose a reason for hiding this comment

rgarcia commented May 8, 2012

bigal488 commented Jul 12, 2012

spollack commented Jul 30, 2012

dmhaiduchonak commented Aug 11, 2012

pushrax commented Aug 20, 2012

Jellyfrog commented Nov 16, 2012

podviaznikov commented Dec 20, 2012

ChrisCinelli commented Feb 26, 2013

tadeuszwojcik commented Mar 15, 2013

dfoody commented Mar 15, 2013

tadeuszwojcik commented Mar 15, 2013

auser commented Apr 8, 2013

ElliotChong commented May 27, 2013

jcspencer commented Jun 24, 2013

bulkan Jun 28, 2013

Choose a reason for hiding this comment

dfoody Jun 28, 2013

Choose a reason for hiding this comment

bulkan commented Jun 30, 2013

behrad commented Aug 18, 2013

dfoody commented Aug 24, 2013

behrad commented Jan 27, 2014

Nibbler999 commented Jan 31, 2014

behrad commented Jan 31, 2014

dfoody commented Jan 31, 2014

behrad commented Jan 31, 2014

dfoody commented Jan 31, 2014

behrad commented Jan 31, 2014

dfoody commented Jan 31, 2014

behrad commented Feb 1, 2014

oliverlloyd commented Feb 4, 2014

ElliotChong commented Feb 4, 2014

behrad commented Feb 5, 2014

oliverlloyd commented Feb 5, 2014

behrad commented Feb 5, 2014

oliverlloyd commented Feb 5, 2014

behrad commented Feb 18, 2014

oliverlloyd commented Feb 18, 2014

behrad commented Feb 18, 2014

joshbedo commented Aug 16, 2015

LobeTia commented Jan 25, 2016