Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QoS extensions #105

Closed
wants to merge 9 commits into from
Closed

QoS extensions #105

wants to merge 9 commits into from

Conversation

dfoody
Copy link

@dfoody dfoody commented Mar 16, 2012

Hi, I've created a bunch of QoS extensions to Kue this would be useful to incorporate into the master if you're interested. There are three main extensions that I added:

  • watchdog+heartbeat to auto-restart stuck/dead jobs (useful if, for example, a server that's processing jobs reboots without a graceful shutdown).
  • dependencies to allow jobs to wait for other jobs to complete before starting to give more granular controls of execution ordering even with a large pool of available workers (dependencies even work across different job types).
  • serialized execution to ensure that only one of a related group of jobs can execute at the same time (so, for example, you could use this to ensure that no two jobs related to the same user run at the same time across all workers). As with dependencies, this works across different job types.

Added additional quality of service extensions including:
- watchdog+heartbeat to auto-restart stuck/dead jobs
- dependencies to allow jobs to wait for other jobs to complete before starting
- serialization to ensure that only one of a related group of jobs can execute at the same time.
@tj
Copy link
Contributor

tj commented Mar 16, 2012

sounds good! there's a lot here so I'll have to go through and review but from the description it's a +1

car: 'S5',
charge: '$59,000'
}).after(newcustomer).after(config).save();
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, I was doing similar but just creating the jobs within the job processor haha

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep creating them within the job works if your precursors are execute serially.
But, what I wanted to do was make it easy to execute precursors in parallel, and then only trigger the successor when all the precursors finished.

By the way, the example "QoS" runs all of these types of combinations if you want to see it in action.

Cheers.

  • Dan

On Mar 16, 2012, at 12:24 PM, TJ Holowaychuk wrote:

+var config = jobs.create('customconfig', {

  • title: 'S5',
  • color: 'sprint blue',
  • transmission: 'dsg'
    +}).save();

+var charge = jobs.create('change', {

  • email: '[email protected]',
  • car: 'S5',
  • charge: '$59,000'
    +}).after(newcustomer).after(config).save();
    +```

I like this, I was doing similar but just creating the jobs within the job processor haha


Reply to this email directly or view it on GitHub:
https://github.com/LearnBoost/kue/pull/105/files#r568644

Dan Foody added 2 commits March 17, 2012 20:55
- Make sure staged jobs that are deleted are removed from the staging area.
- Repair issues detected with staging while attempting to assign the lock.
- Fixed cleanup of precursor job data when jobs have already completed before the precursor is created.
- Added error reporting when calls to red is fail.
- Add precursor information to UI
Corrected an issue where the new QoS logic caused the states not to be able to be changed from the UI.
@rgarcia
Copy link

rgarcia commented May 8, 2012

+1 on job dependencies

Dan Foody added 4 commits May 9, 2012 10:16
- When heartbeat is triggered, make sure value is set before updating red is
- When a job is deleted, also remove the logs related to the job
- When a job completes, update the updated_at timestamp.
Previously the logic called the callback before all state related to the job was saved (the rest was saved in the background).  This works in many cases, but when run from a command line process which creates a job and then quits, jobs were not getting created.

This change makes sure that the job is fully saved before the callback is called.
When you have many jobs (for example in the completed state), descending sort was retrieving the oldest jobs and descending sorting them, so if you had jobs 10-99, choosing descending sort would show 20,19,18,17,… instead of 99,98,97,96,...

This fix corrects descending sort so that it works from the newest jobs backwards.

(caveat: sorting is still done by lexicographical comparison of job IDs, not by numeric comparison - so 2 is higher than 10 - this fix does not change this behavior).
This fix lets the UI be more tolerant of null fields in jobs.
While these should not typically occur, I have seen cases where a job ends up with null fields due to redis bug/corruption issues.
Once the job is corrupted, it won't show up in the UI (so you can't delete it, fix it, etc. without doing so manually).
This change lets the UI be a bit more lenient so corrupt jobs can still be seen and deleted.
@bigal488
Copy link

Any thoughts on merging these?

@spollack
Copy link

+1 on merging these features, particularly auto-restarting stuck/dead jobs

@dmhaiduchonak
Copy link

+1 for serialized execution

@pushrax
Copy link

pushrax commented Aug 20, 2012

👍 for restarting stuck jobs

@Jellyfrog
Copy link
Contributor

+1

1 similar comment
@podviaznikov
Copy link

+1

Under heavy use, especially with failures, process restarts, and network issues, kue would often get corrupted.  The root cause was that kue does not transactionally change state - so if a change went part way through and an error occurred the data structures would be in an inconsistent state.
This change turns the critical path internals (state change and job dequeuing) to be transactional (or, at least as close as can be achieved using redis).
Some notes on this:
- This requires redis 2.6 (it uses lua scripting)
- When dequeuing jobs there are still some conditions where it's impossible to achieve atomic dequeue-and-execute semantics.  So, there is now an error handler which is called to notify the queue processor when this condition occurs.
@ChrisCinelli
Copy link
Contributor

@dfoody : These changes are really interesting but it also looks like that you did not keep merging the new commits made on learnboost repo since one year ago. This looks more worthy to be called a new project that trying to merging it now to the current version of learnboost/kue. Unless you want to go through the pain of adding the latest changes.

@tadeuszwojcik
Copy link

@dfoody did serialize method works after workers restart ?
I've tried you fork (https://github.com/dfoody/kue/tree/QoS) with serialize option, and after restart all jobs go to staged state and nothing is processed.
I'm really interested in making serialize working, so any help would be appreciated :)

@dfoody
Copy link
Author

dfoody commented Mar 15, 2013

@CodeFather If there's still a job in active or failed that holds the serialization lock, then all other jobs that need the same serialize lock go to staged. The active job that holds the lock must either move to finished or be deleted to release the lock.

By default, there's nothing that moves a job from active to finished other than successfully completing the job. So, if a worker crashes during job processing, the job will just stay in active until you restart it manually. This sounds like what's happening to you.

But, the QoS branch also has a "heartbeat" mechanism that can be used to auto-restart jobs that haven't triggered the heartbeat in a certain amount of time. The only thing to be careful of is that you need to be sure to call the heartbeat often enough. If a job is still running "properly" and doesn't heartbeat in time, kue will think it has failed and start another copy of the job running.

There are three separate things you need to do to use the heartbeat:

  1. one-time enable the watchdog that checks for jobs that haven't triggered their heartbeat - using jobs.watchdog(ms)
  2. on each job set the expected heartbeat interval and number of retries that will be allowed with job.heartbeat(ms).retries(num).
  3. trigger the heartbeat periodically on a job with job.heartbeat() (no arguments).

@tadeuszwojcik
Copy link

@dfoody many thanks for comprehensive answer :) Works as you described.
Actually in my case failed job was holding serialization lock and as soon as I removed that job it worked !
Great job with QoS . Hope it will be merged to kue (as well as jobs shutdowns) or someone else than @visionmedia will take over to maintain it.
Thanks again.

@auser
Copy link

auser commented Apr 8, 2013

+1 please merge @visionmedia

@ElliotChong
Copy link

+1

@jcspencer
Copy link
Contributor

Please merge @visionmedia 👍

if(delay)
amount = Math.round(Math.random()*30000);
else
amount = Math.round(Math.random()*2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delay job.process when no delay is passed in ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the transactional changes, this was originally there to avoid having many workers all start processing at the same time (which happens in two scenarios - when a new worker starts or when you've queued a set of short running jobs) - it helped avoid stuck jobs.
But, with the more recent move to transactional queuing it's likely not necessary - though we've never tested w/o it (we have ~4 months in production with no stuck jobs at all - after 10s of millions of jobs - so it's pretty battle hardened as it is).

Note that the setting with 'delay' (the 30s random interval) is really important still. It protects against an error storm when redis (or the connection to it) goes down.

@bulkan
Copy link
Contributor

bulkan commented Jun 30, 2013

@dfoody I've been trying to manually merge this into our fork at https://github.com/LearnBoost/kue but I'm having lots of trouble understanding the state the jobs are left in. Mixing in Lua code to change the state of the job gets very confusing.

At this moment i'm giving up on this code 😦

@behrad
Copy link
Collaborator

behrad commented Aug 18, 2013

@dfoody I noticed that your branch is based on an old kue, and also limits the node version up to 0.7!
Any plan or idea to merge with the latest kue version and add support for latest node !?

@dfoody
Copy link
Author

dfoody commented Aug 24, 2013

@behrad - sorry no immediate plans to merge with newer kue or upgrade node version (we're still using older versions in production).

But, I'd be happy to walk anyone that wants to port it through how it works (why various changes were made the way they were made, how the control flow works, etc.).

Queue.stop() initiates stopping all workers.  Active workers finish up their current jobs, workers that are idle stop as soon as is possible (worst case it may take up to a minute for an idle worker to stop, but in most cases they stop almost immediately).

Once all workers are stopped the "stopped" event is emitted.
@behrad
Copy link
Collaborator

behrad commented Jan 27, 2014

I merged server-side redis lua-scripts from @dfoody 's QOS branch into kue 0.7.4 as a new QOS branch. Anybody can test and provide feedbacks !?

@Nibbler999
Copy link

Works as expected here, no problems.

@behrad
Copy link
Collaborator

behrad commented Jan 31, 2014

Good news. I personally can't convince myself to move into server-side lua scripts yet. Developing concurrent branches also seems no good, but i did that merge to be used as comparision.
User should tell if Kue 0.7 looks unreliable in different deployments?
How much need is for QOS branch?

@dfoody
Copy link
Author

dfoody commented Jan 31, 2014

Our experience with Kue was that it worked fine in low volume and for testing. But, as we got up to > 10k jobs a day (many running concurrently) we started to experience on the order of 1-10 jobs that were corrupted every day. Now, scale that up to > 200k jobs a day (what we now run) it becomes more than a full-time job just to fix corruptions. But, with the QoS branch changes we've had no job corruptions in months.

So, I suspect in light testing you'd find no difference. But, if you really want to see the difference write a test that queues 100k jobs, with 1000 concurrent workers, where each job takes <1s to execute. Then hit "ctrl-c" mid execution (for example, while it's still queuing jobs). If you then restart and see if you can get the jobs all running again, you'll see the difference between the QoS and non-QoS versions.

@behrad
Copy link
Collaborator

behrad commented Jan 31, 2014

That was right with Kue 3.4, which you branched and created QOS version based upon.
We are running Kue where 1 million jobs/day comes in being processed. We have tested it with 1000, 500, 100 concurrent workers distributed as 8 cluster workers. No problems in 0.7! jobs are doing restful IO mostly with <1(s) durations.
I want to make sure if this is true for other runtime scenarios in different stress conditions. (longer jobs, thunder jobs, ...)

And about hitting "ctrl+c", we are not talking about an specific feature, like gracefully termination. We added graceful shutdown into 0.7 and it is tested in heavy load with long running jobs. However it should be more and more tested.
For you @dfoody is it feasible to replace your QOS version with 0.7 in a simple test-bed that shows your production environment and let us know?

@dfoody
Copy link
Author

dfoody commented Jan 31, 2014

Hi @behrad, I've looked at the 0.7 code and, unfortunately, it still suffers from all the same issues as previous versions. Because it's not atomic any unexpected failure will corrupt kue. When the failure is a process crash, bugs due to unhanded exceptions, a network failure, a power failure, etc. - situations where no graceful termination code will help you.

I'm guessing you're not experiencing any of these types of failures, but we do (we run large clusters of machines on AWS and routinely experience all of these types of things). And we've seen it fail at virtually every line of code in a state change (so seen every possible variation of parts of a partial state change having taken effect and corrupting kue).

I'd really like kue to be as stable and reliable as a database. But, without atomic operations (whether via multi or lua) it can never be.

@behrad
Copy link
Collaborator

behrad commented Jan 31, 2014

Internal operations can be mostly re-written using multi, I don't expect
kue indexes to be ​100% healthy on network/power failures. wrong
expectation?
You can see that there's only one difference between QOS and normal Kue
which is in setState (and thats a big part), in job state change handling. Other redis operations
are the same (be with or without multi)

@dfoody
Copy link
Author

dfoody commented Jan 31, 2014

The primary areas where atomicity is important are Job.prototype.state and Worker.prototype.getJob (which eventually calls Job.prototype.state).

Unfortunately, when we looked at it, it turned out that multi is at best is super-tricky, and at worst won't work for Job.prototype.state (though, granted, we use a much more complicated state change model in the QoS branch - one that accommodates resource locking).

Here's an example of what can happen: You have a low priority job queued before a high priority job is queued, and a worker goes to pull a job from the queue while the high priority one is still being queued.

With the current 0.7 implementation (even without any failures) the job can be corrupted (once line 487 of Job.prototype.state happens, the worker will pull the high priority job off the queue first, but it's state is not yet set - so if the worker then changes the state before the code queuing the job keeps going, the state will now be inconsistent). We actually saw this specific issue happening in production after we first started using job prioritization.

But, even with multi+watch, because you have to retry a state change if it fails (since one of the sorted set keys being modified may have been related to a different job changing state, not the current job), you can end up with a job being stuck between states or changing to the wrong state.

So, it turned out to be significantly simpler to build the entire state change as an atomic operation.

@behrad
Copy link
Collaborator

behrad commented Feb 1, 2014

I totally see what you are talking about @dfoody , this merge I did, was also a response to these. but those failures seem to happen in very rare conditions, and this gives us event more time to evaluate a complete change to depend on Redis >= 2.6

@oliverlloyd
Copy link

For us the sorts of actions that seem to trigger stuck jobs (and we see them both for the active and inactive queues) are when servers are restarted. The problem is this is not a rare occurrence, we use AWS & hosting providers so it happens all the time. Note. Our jobs tend to be longer running (around 3-15 seconds) so this might exacerbate things.

Thing is, this seems like a fairly typical use case, no?

@behrad as well as wading in with opinions, we're also happy to help out with testing for this issue.

@ElliotChong
Copy link

👍 👍 for better handling of stuck / dead jobs - I'm seeing at least a few corrupted jobs in between restarts as well

@behrad
Copy link
Collaborator

behrad commented Feb 5, 2014

Would you please help us to reproduce it? awesome if you can write related test cases.

stuck jobs

by stuck jobs we mean corrupted indexes which cause job to stay in inactive/active state despite other jobs being handled.

when servers are restarted

@oliverlloyd by server you mean redis? or redis+kue? and what kind of restart? a hard one(redis aborts)? or a signaled one which let apps to gracefully shutdown?

I'm seeing at least a few corrupted jobs in between restarts as well

@ElliotChong which version of Kue/redis? what kind of restart again?

@oliverlloyd
Copy link

@behrad By server I mean app/web server, not redis. Typically we run on Heroku where ps:restart will cycle all servers.

@behrad
Copy link
Collaborator

behrad commented Feb 5, 2014

@oliverlloyd nice then! I don't think you are using 0.7's graceful shutdown. If so can you test with 7.0 and listening to the right signal to shutdown Kue on app server reboot and let us know?

@oliverlloyd
Copy link

@behrad Can't trigger shutdown method #291...

@behrad
Copy link
Collaborator

behrad commented Feb 18, 2014

@oliverlloyd any feedbacks?

@oliverlloyd
Copy link

Yes and no. We now see graceful shutdown upon app restarts, this is good, it has clearly mitigated the issue a lot. However we do sometimes still get stuck jobs but I am not able to state with confidence that this is not caused by other factors (I think that it is fair that job could get stuck in certain aggressive scenarios such as Redis running out of memory).

On the other hand, the fact that jobs can still get stuck - for whatever reason - remains an argument in favour of a background job checking for stuck jobs.

@behrad
Copy link
Collaborator

behrad commented Feb 18, 2014

  1. Would you please clarify that each stuck job will get processed after manually running a LPUSH q:JOB_TYPE:jobs 1?

  2. Can you dig more to see how that case can be re-produced? (I may help it to be solved if you can describe the pattern)

  3. If it happens with redis crashes, server side scripting also won't work! How can you tell redis don't crash in between a transaction or LUA script? SO! try to make redis as up&running as you can. There are some memory specific configurations to stop redis being terminated by OS out of memory triggers. We had spent a month to config our redis to stay alive, and IT IS!

@joshbedo
Copy link

+1 for job dependencies, i need that right now :)

@LobeTia
Copy link

LobeTia commented Jan 25, 2016

+1 for serialized execution

@behrad behrad closed this Oct 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.