Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First job never unlocks the lock / Endless waiting #352

Closed
mensfeld opened this issue Nov 21, 2018 · 6 comments
Closed

First job never unlocks the lock / Endless waiting #352

mensfeld opened this issue Nov 21, 2018 · 6 comments

Comments

@mensfeld
Copy link

mensfeld commented Nov 21, 2018

Describe the bug

I have a simple worker (as presented below) that I run in the following way within the example app context):

  1. I start the rails s to see sidekiq and I see empty queues (flushall just in case)
  2. I run the rails c and run following command: 10.times { BlockedWorker.perform_async }
  3. I start sidekiq with bundle exec sidekiq

Expected behavior

I expect jobs to be finished (one after another).

Current behavior

As presented below. The first worker job is finished while the rest is hanging. It seems the lock is never taken down and it blocks the other jobs.

zrzut ekranu z 2018-11-21 16-54-54

And the log from Sidekiq looks as followed:

2018-11-21T15:53:09.636Z 6699 TID-gs84uvizf INFO: Running in ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
2018-11-21T15:53:09.636Z 6699 TID-gs84uvizf INFO: See LICENSE and the LGPL-3.0 for licensing details.
2018-11-21T15:53:09.636Z 6699 TID-gs84uvizf INFO: Upgrade to Sidekiq Pro for more features and support: http://sidekiq.org
2018-11-21T15:53:09.637Z 6699 TID-gs84uvizf INFO: Starting processing, hit Ctrl-C to stop
2018-11-21T15:53:09.653Z 6699 TID-gs85czyzv BlockedWorker JID-a7192cb339651d7d29d6fc1f INFO: start
2018-11-21T15:53:09.658Z 6699 TID-gs85dogz7 BlockedWorker JID-4878e0be35c2f387069a107b INFO: start
2018-11-21T15:53:09.659Z 6699 TID-gs85czzh3 BlockedWorker JID-7d9d0ebd99f872b17133d4ff INFO: start
2018-11-21T15:53:09.661Z 6699 TID-gs85doevr BlockedWorker JID-b4e55637d755d2a6f3a46ad3 INFO: start
2018-11-21T15:53:09.661Z 6699 TID-gs85d0y4z BlockedWorker JID-60106e4081d529b2699de89c INFO: start
2018-11-21T15:53:09.661Z 6699 TID-gs85czzoz BlockedWorker JID-42e1e0ace202f2cf0ddbb02c INFO: start
2018-11-21T15:53:09.662Z 6699 TID-gs85czycb BlockedWorker JID-fcd68ffba277de047d587421 INFO: start
2018-11-21T15:53:09.663Z 6699 TID-gs85dofwz BlockedWorker JID-4da8071cec6e83c93a2cd533 INFO: start
2018-11-21T15:53:09.664Z 6699 TID-gs85doeob BlockedWorker JID-feee33a9e2573de083d9c029 INFO: start
2018-11-21T15:53:09.664Z 6699 TID-gs85czjob BlockedWorker JID-9a55eac09d367d3db03a806e INFO: start
2018-11-21T15:53:19.684Z 6699 TID-gs85dofwz BlockedWorker JID-4da8071cec6e83c93a2cd533 INFO: done: 10.022 sec

Worker class

class BlockedWorker
  include Sidekiq::Worker

  sidekiq_options lock: :while_executing,
    unique_across_workers: true,
    lock_timeout: nil

  def perform(*args)
    sleep 10
  end

  def self.unique_args(args)
    [1]
  end
end

Additional info

If after a while, let's say 1 minute I enqueue next 10 jobs with the same command, one job from the old pool will be finished but the it gets stuck again:

2018-11-21T15:53:09.636Z 6699 TID-gs84uvizf INFO: Running in ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
2018-11-21T15:53:09.636Z 6699 TID-gs84uvizf INFO: See LICENSE and the LGPL-3.0 for licensing details.
2018-11-21T15:53:09.636Z 6699 TID-gs84uvizf INFO: Upgrade to Sidekiq Pro for more features and support: http://sidekiq.org
2018-11-21T15:53:09.637Z 6699 TID-gs84uvizf INFO: Starting processing, hit Ctrl-C to stop
2018-11-21T15:53:09.653Z 6699 TID-gs85czyzv BlockedWorker JID-a7192cb339651d7d29d6fc1f INFO: start
2018-11-21T15:53:09.658Z 6699 TID-gs85dogz7 BlockedWorker JID-4878e0be35c2f387069a107b INFO: start
2018-11-21T15:53:09.659Z 6699 TID-gs85czzh3 BlockedWorker JID-7d9d0ebd99f872b17133d4ff INFO: start
2018-11-21T15:53:09.661Z 6699 TID-gs85doevr BlockedWorker JID-b4e55637d755d2a6f3a46ad3 INFO: start
2018-11-21T15:53:09.661Z 6699 TID-gs85d0y4z BlockedWorker JID-60106e4081d529b2699de89c INFO: start
2018-11-21T15:53:09.661Z 6699 TID-gs85czzoz BlockedWorker JID-42e1e0ace202f2cf0ddbb02c INFO: start
2018-11-21T15:53:09.662Z 6699 TID-gs85czycb BlockedWorker JID-fcd68ffba277de047d587421 INFO: start
2018-11-21T15:53:09.663Z 6699 TID-gs85dofwz BlockedWorker JID-4da8071cec6e83c93a2cd533 INFO: start
2018-11-21T15:53:09.664Z 6699 TID-gs85doeob BlockedWorker JID-feee33a9e2573de083d9c029 INFO: start
2018-11-21T15:53:09.664Z 6699 TID-gs85czjob BlockedWorker JID-9a55eac09d367d3db03a806e INFO: start
2018-11-21T15:53:19.684Z 6699 TID-gs85dofwz BlockedWorker JID-4da8071cec6e83c93a2cd533 INFO: done: 10.022 sec
2018-11-21T15:56:25.790Z 6699 TID-gs85dofwz BlockedWorker JID-3e75945180e6e4220845fc37 INFO: start
2018-11-21T15:56:25.790Z 6699 TID-gs85czjvj BlockedWorker JID-81c00a23df92545d3d78de30 INFO: start
2018-11-21T15:56:25.791Z 6699 TID-gs85czypb BlockedWorker JID-0aa5700546fbd36bb26f44e1 INFO: start
2018-11-21T15:56:25.792Z 6699 TID-gs85czhfv BlockedWorker JID-435f349b7b7fd73ce5c1e4ef INFO: start
2018-11-21T15:56:25.792Z 6699 TID-gs85czyjr BlockedWorker JID-f7505d0370b94597980b9497 INFO: start
2018-11-21T15:56:25.793Z 6699 TID-gs85dofsn BlockedWorker JID-e0cad54f68fe068ff9897a51 INFO: start
2018-11-21T15:56:25.793Z 6699 TID-gs85dof5r BlockedWorker JID-d53e68507940c602de4fad85 INFO: start
2018-11-21T15:56:25.793Z 6699 TID-gs85dofcv BlockedWorker JID-53b46ca993053401da89ba6b INFO: start
2018-11-21T15:56:25.792Z 6699 TID-gs85czhbj BlockedWorker JID-e65f1be6fdd576cf30a77f06 INFO: start
2018-11-21T15:56:25.792Z 6699 TID-gs85d0vzz BlockedWorker JID-8a79a84f53f4e847b78b5fdd INFO: start
2018-11-21T15:56:35.807Z 6699 TID-gs85czzh3 BlockedWorker JID-7d9d0ebd99f872b17133d4ff INFO: done: 206.148 sec
@mensfeld
Copy link
Author

The specs fail but when the LPUSH in unlock.lua is moved as the last element:

return redis.call('LPUSH', available_key, token)

things seem to work fine.

@mhenrixon
Copy link
Owner

Of course! Nice find @mensfeld and many thanks for the detailed investigation. I will cut a new version right away. Just have a couple of tests to fix first!

@mensfeld
Copy link
Author

Anytime. It seems the LPUSH and DEL just after conflict with the unlocking on the BRPOP

@mhenrixon
Copy link
Owner

Indeed! Can't believe I didn't see this until you pointed it out! No wonder people have been complaining

@mensfeld
Copy link
Author

This is interesting anyhow as it seems to be a problem with the LUA library (or ruby client) because when I do the same directly from redis-cli things work. That is I do a blocking pop and then I do a lpush with a del after within a single executioin flow and the brpop gets unlocked.

@mhenrixon
Copy link
Owner

Yeah, they still haven't managed to make this super smooth. If we use the ruby layer on the other hand performance suffers greatly. It is a fine balance between performance and reliability. I had other similar situations as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants