-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QoS extensions #105
QoS extensions #105
Conversation
Added additional quality of service extensions including: - watchdog+heartbeat to auto-restart stuck/dead jobs - dependencies to allow jobs to wait for other jobs to complete before starting - serialization to ensure that only one of a related group of jobs can execute at the same time.
sounds good! there's a lot here so I'll have to go through and review but from the description it's a +1 |
car: 'S5', | ||
charge: '$59,000' | ||
}).after(newcustomer).after(config).save(); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, I was doing similar but just creating the jobs within the job processor haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep creating them within the job works if your precursors are execute serially.
But, what I wanted to do was make it easy to execute precursors in parallel, and then only trigger the successor when all the precursors finished.
By the way, the example "QoS" runs all of these types of combinations if you want to see it in action.
Cheers.
- Dan
On Mar 16, 2012, at 12:24 PM, TJ Holowaychuk wrote:
- name: 'Jim Steele',
- email: '[email protected]'
+}).save();
+var config = jobs.create('customconfig', {
- title: 'S5',
- color: 'sprint blue',
- transmission: 'dsg'
+}).save();
+var charge = jobs.create('change', {
- email: '[email protected]',
- car: 'S5',
- charge: '$59,000'
+}).after(newcustomer).after(config).save();
+```I like this, I was doing similar but just creating the jobs within the job processor haha
Reply to this email directly or view it on GitHub:
https://github.com/LearnBoost/kue/pull/105/files#r568644
- Make sure staged jobs that are deleted are removed from the staging area. - Repair issues detected with staging while attempting to assign the lock. - Fixed cleanup of precursor job data when jobs have already completed before the precursor is created. - Added error reporting when calls to red is fail. - Add precursor information to UI
Corrected an issue where the new QoS logic caused the states not to be able to be changed from the UI.
+1 on job dependencies |
- When heartbeat is triggered, make sure value is set before updating red is - When a job is deleted, also remove the logs related to the job - When a job completes, update the updated_at timestamp.
Previously the logic called the callback before all state related to the job was saved (the rest was saved in the background). This works in many cases, but when run from a command line process which creates a job and then quits, jobs were not getting created. This change makes sure that the job is fully saved before the callback is called.
When you have many jobs (for example in the completed state), descending sort was retrieving the oldest jobs and descending sorting them, so if you had jobs 10-99, choosing descending sort would show 20,19,18,17,… instead of 99,98,97,96,... This fix corrects descending sort so that it works from the newest jobs backwards. (caveat: sorting is still done by lexicographical comparison of job IDs, not by numeric comparison - so 2 is higher than 10 - this fix does not change this behavior).
This fix lets the UI be more tolerant of null fields in jobs. While these should not typically occur, I have seen cases where a job ends up with null fields due to redis bug/corruption issues. Once the job is corrupted, it won't show up in the UI (so you can't delete it, fix it, etc. without doing so manually). This change lets the UI be a bit more lenient so corrupt jobs can still be seen and deleted.
Any thoughts on merging these? |
+1 on merging these features, particularly auto-restarting stuck/dead jobs |
+1 for serialized execution |
👍 for restarting stuck jobs |
+1 |
1 similar comment
+1 |
Under heavy use, especially with failures, process restarts, and network issues, kue would often get corrupted. The root cause was that kue does not transactionally change state - so if a change went part way through and an error occurred the data structures would be in an inconsistent state. This change turns the critical path internals (state change and job dequeuing) to be transactional (or, at least as close as can be achieved using redis). Some notes on this: - This requires redis 2.6 (it uses lua scripting) - When dequeuing jobs there are still some conditions where it's impossible to achieve atomic dequeue-and-execute semantics. So, there is now an error handler which is called to notify the queue processor when this condition occurs.
@dfoody : These changes are really interesting but it also looks like that you did not keep merging the new commits made on learnboost repo since one year ago. This looks more worthy to be called a new project that trying to merging it now to the current version of learnboost/kue. Unless you want to go through the pain of adding the latest changes. |
@dfoody did serialize method works after workers restart ? |
@CodeFather If there's still a job in active or failed that holds the serialization lock, then all other jobs that need the same serialize lock go to staged. The active job that holds the lock must either move to finished or be deleted to release the lock. By default, there's nothing that moves a job from active to finished other than successfully completing the job. So, if a worker crashes during job processing, the job will just stay in active until you restart it manually. This sounds like what's happening to you. But, the QoS branch also has a "heartbeat" mechanism that can be used to auto-restart jobs that haven't triggered the heartbeat in a certain amount of time. The only thing to be careful of is that you need to be sure to call the heartbeat often enough. If a job is still running "properly" and doesn't heartbeat in time, kue will think it has failed and start another copy of the job running. There are three separate things you need to do to use the heartbeat:
|
@dfoody many thanks for comprehensive answer :) Works as you described. |
+1 please merge @visionmedia |
+1 |
Please merge @visionmedia 👍 |
if(delay) | ||
amount = Math.round(Math.random()*30000); | ||
else | ||
amount = Math.round(Math.random()*2000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why delay job.process
when no delay
is passed in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the transactional changes, this was originally there to avoid having many workers all start processing at the same time (which happens in two scenarios - when a new worker starts or when you've queued a set of short running jobs) - it helped avoid stuck jobs.
But, with the more recent move to transactional queuing it's likely not necessary - though we've never tested w/o it (we have ~4 months in production with no stuck jobs at all - after 10s of millions of jobs - so it's pretty battle hardened as it is).
Note that the setting with 'delay' (the 30s random interval) is really important still. It protects against an error storm when redis (or the connection to it) goes down.
@dfoody I've been trying to manually merge this into our fork at https://github.com/LearnBoost/kue but I'm having lots of trouble understanding the At this moment i'm giving up on this code 😦 |
@dfoody I noticed that your branch is based on an old kue, and also limits the node version up to 0.7! |
@behrad - sorry no immediate plans to merge with newer kue or upgrade node version (we're still using older versions in production). But, I'd be happy to walk anyone that wants to port it through how it works (why various changes were made the way they were made, how the control flow works, etc.). |
Queue.stop() initiates stopping all workers. Active workers finish up their current jobs, workers that are idle stop as soon as is possible (worst case it may take up to a minute for an idle worker to stop, but in most cases they stop almost immediately). Once all workers are stopped the "stopped" event is emitted.
I merged server-side redis lua-scripts from @dfoody 's QOS branch into kue 0.7.4 as a new QOS branch. Anybody can test and provide feedbacks !? |
Works as expected here, no problems. |
Good news. I personally can't convince myself to move into server-side lua scripts yet. Developing concurrent branches also seems no good, but i did that merge to be used as comparision. |
Our experience with Kue was that it worked fine in low volume and for testing. But, as we got up to > 10k jobs a day (many running concurrently) we started to experience on the order of 1-10 jobs that were corrupted every day. Now, scale that up to > 200k jobs a day (what we now run) it becomes more than a full-time job just to fix corruptions. But, with the QoS branch changes we've had no job corruptions in months. So, I suspect in light testing you'd find no difference. But, if you really want to see the difference write a test that queues 100k jobs, with 1000 concurrent workers, where each job takes <1s to execute. Then hit "ctrl-c" mid execution (for example, while it's still queuing jobs). If you then restart and see if you can get the jobs all running again, you'll see the difference between the QoS and non-QoS versions. |
That was right with Kue 3.4, which you branched and created QOS version based upon. And about hitting "ctrl+c", we are not talking about an specific feature, like gracefully termination. We added graceful shutdown into 0.7 and it is tested in heavy load with long running jobs. However it should be more and more tested. |
Hi @behrad, I've looked at the 0.7 code and, unfortunately, it still suffers from all the same issues as previous versions. Because it's not atomic any unexpected failure will corrupt kue. When the failure is a process crash, bugs due to unhanded exceptions, a network failure, a power failure, etc. - situations where no graceful termination code will help you. I'm guessing you're not experiencing any of these types of failures, but we do (we run large clusters of machines on AWS and routinely experience all of these types of things). And we've seen it fail at virtually every line of code in a state change (so seen every possible variation of parts of a partial state change having taken effect and corrupting kue). I'd really like kue to be as stable and reliable as a database. But, without atomic operations (whether via multi or lua) it can never be. |
Internal operations can be mostly re-written using multi, I don't expect |
The primary areas where atomicity is important are Job.prototype.state and Worker.prototype.getJob (which eventually calls Job.prototype.state). Unfortunately, when we looked at it, it turned out that multi is at best is super-tricky, and at worst won't work for Job.prototype.state (though, granted, we use a much more complicated state change model in the QoS branch - one that accommodates resource locking). Here's an example of what can happen: You have a low priority job queued before a high priority job is queued, and a worker goes to pull a job from the queue while the high priority one is still being queued. With the current 0.7 implementation (even without any failures) the job can be corrupted (once line 487 of Job.prototype.state happens, the worker will pull the high priority job off the queue first, but it's state is not yet set - so if the worker then changes the state before the code queuing the job keeps going, the state will now be inconsistent). We actually saw this specific issue happening in production after we first started using job prioritization. But, even with multi+watch, because you have to retry a state change if it fails (since one of the sorted set keys being modified may have been related to a different job changing state, not the current job), you can end up with a job being stuck between states or changing to the wrong state. So, it turned out to be significantly simpler to build the entire state change as an atomic operation. |
I totally see what you are talking about @dfoody , this merge I did, was also a response to these. but those failures seem to happen in very rare conditions, and this gives us event more time to evaluate a complete change to depend on Redis >= 2.6 |
For us the sorts of actions that seem to trigger stuck jobs (and we see them both for the active and inactive queues) are when servers are restarted. The problem is this is not a rare occurrence, we use AWS & hosting providers so it happens all the time. Note. Our jobs tend to be longer running (around 3-15 seconds) so this might exacerbate things. Thing is, this seems like a fairly typical use case, no? @behrad as well as wading in with opinions, we're also happy to help out with testing for this issue. |
👍 👍 for better handling of stuck / dead jobs - I'm seeing at least a few corrupted jobs in between restarts as well |
Would you please help us to reproduce it? awesome if you can write related test cases.
by stuck jobs we mean corrupted indexes which cause job to stay in inactive/active state despite other jobs being handled.
@oliverlloyd by server you mean redis? or redis+kue? and what kind of restart? a hard one(redis aborts)? or a signaled one which let apps to gracefully shutdown?
@ElliotChong which version of Kue/redis? what kind of restart again? |
@behrad By server I mean app/web server, not redis. Typically we run on Heroku where ps:restart will cycle all servers. |
@oliverlloyd nice then! I don't think you are using 0.7's graceful shutdown. If so can you test with 7.0 and listening to the right signal to shutdown Kue on app server reboot and let us know? |
@oliverlloyd any feedbacks? |
Yes and no. We now see graceful shutdown upon app restarts, this is good, it has clearly mitigated the issue a lot. However we do sometimes still get stuck jobs but I am not able to state with confidence that this is not caused by other factors (I think that it is fair that job could get stuck in certain aggressive scenarios such as Redis running out of memory). On the other hand, the fact that jobs can still get stuck - for whatever reason - remains an argument in favour of a background job checking for stuck jobs. |
|
+1 for job dependencies, i need that right now :) |
+1 for serialized execution |
Hi, I've created a bunch of QoS extensions to Kue this would be useful to incorporate into the master if you're interested. There are three main extensions that I added: