Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with releasing uniquejobs locks after timeout expires #169

Closed
davehartnoll opened this issue Mar 2, 2016 · 8 comments
Closed

Problem with releasing uniquejobs locks after timeout expires #169

davehartnoll opened this issue Mar 2, 2016 · 8 comments

Comments

@davehartnoll
Copy link

The redis/aquire_lock.lua script sets two keys in redis for each unique job; the first has an expiration time:

if redis.pcall('set', unique_key, job_id, 'nx', 'ex', expires) then
    redis.pcall('hsetnx', 'uniquejobs', job_id, unique_key)
    return 1
else
    return 0
end

The redis/release_lock.lua script contains this to delete the same two keys:

if redis.pcall('del', unique_key) then
    redis.pcall('hdel', 'uniquejobs', job_id)
    return 1
 end

However, if the job has already expired by the time this release_lock script is called then the first redis.pcall will return false and the 2nd one never gets executed. This causes the uniquejobs hash to keep some entries forever, and it just gets bigger and bigger, and in my case it was consuming all available memory and causing Sidekiq to reject all additional jobs even though the underlying queues were apparently empty.

One simple fix may be to always remove the key from the uniquejobs hash before testing whether the timed-out key can also be deleted, but I'll leave it to someone who understands the locking mechanism better to decide if it's good:

redis.pcall('hdel', 'uniquejobs', job_id)
if redis.pcall('del', unique_key) then
    return 1
 end

My workaround (to avoid having to change the library) is to add a configuration setting the increase the value of the default timeout:

SidekiqUniqueJobs.configure do |config|
  config.default_queue_lock_expiration = 24 * 60 * 60
end

P.S.: the correct spelling of 'aquire' is 'acquire'

@mhenrixon
Copy link
Owner

Thanks for reporting! I never thought that part of the code would be an issue. I'll make sure both calls are always made.

@mhenrixon
Copy link
Owner

Fixed by c22a5a3

@mathieujobin
Copy link

thank you very much ! I was waiting for this

@ecnalyr
Copy link

ecnalyr commented Mar 4, 2016

+1

@mathieujobin
Copy link

this is unfortunately not solving our problem...

@mhenrixon
Copy link
Owner

Bah I know why, sorry about that I forgot about the check for the key to existence.

@davehartnoll
Copy link
Author

The fix doesn't actually solve my problem either. Should this issue be reopened?

@stephankaag
Copy link

Any idea how to fix this @mhenrixon ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants