-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please try v7.1.5 #617
Comments
After the gem upgrade I noticed this issue (in rails console):
I've checked source code of: def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
synchronize do |client|
command = [:brpoplpush, source, destination, timeout]
require 'pry'; binding.pry # I added it for debugging
timeout += client.timeout if timeout > 0 # this line is failing on timeout > 0
client.call_with_timeout(command, timeout)
end
end I did
Is it possible that sidekiq unique jobs is passing timeout in wrong format down to redis-rb? |
Must be, I'll see if I can replicate this in a test. |
@ArturT worker configuration? Just need the |
@ArturT I tried to replicate the problem either in the rails console or in tests. |
Here is my worker: class MyWorker
include Sidekiq::Worker
sidekiq_options queue: :my_queue_name,
# Locks when the client pushes the job to the queue.
# The queue will be unlocked when the server starts processing the job.
# The server then goes on to creating a runtime lock for the job to prevent
# simultaneous jobs from being executed. As soon as the server starts
# processing a job, the client can push the same job to the queue.
# https://github.com/mhenrixon/sidekiq-unique-jobs#until-and-while-executing
lock: :until_and_while_executing,
# search in logs SidekiqUniqueJobs keyword to find duplicates
log_duplicate_payload: true
def perform(id)
end
end I'll try to debug it more later on when finding some time. |
It was good you had me have a look at this because I discovered that I unintentionally created a breaking change that was fixed in 7.1.1 just now. |
I can confirm v7.1.1 works fine now. Thanks :) |
@mhenrixon I have a different error this time. Some of the jobs are timeouting due to the issue to connect with Redis. It's recommended to use a higher timeout for RedisCloud addon on Heroku. So I use 5 seconds
Is it possible that sidekiq unique jobs does not respect timeout 5 seconds anymore? This is my config: # config/initializers/sidekiq.rb
sidekiq_redis_config = {
url: Secret.sidekiq_redis_url,
namespace: REDIS_SIDEKIQ_NAMESPACE,
# tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
# https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
network_timeout: 5
}
Sidekiq.configure_server do |config|
# A Sidekiq server process requires at least (concurrency + 5) connections
# https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
config.redis = sidekiq_redis_config
...
end
... |
@ArturT I can't think of a single thing. I've checked both the connection pool gem and the sidekiq gem and this behavior seems like a mystery to me. |
I suspect this could be a coincident issue. Maybe RedisCloud had temporary issues yesterday/during this night that caused timeout and this revealed misconfiguration in my project. I found out that I have a separate redis config for sidekiq server than sidekiq client. In case of sidekiq server config I had |
Thanks for reporting back @ArturT! I was always wondering why sidekiq uses |
I was wrong about the misconfiguration in my config. In one place I use Redis.new. It requires I found in another place that I use Sidekiq.configure_client do |config|
threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }
# Add 2 more connections to the pool just in case.
# This should help with error ConnectionPool::TimeoutError when connection is stuck.
# Connections are lazy loaded so they won't be opened unless necessary.
# https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
size = threads_count.to_i + 2
config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)
...
end |
Hi @ArturT, it seems like Super confusing, I can't think of a single time when I would like my pool timeout to be less than the connect timeout. Quite the contrary, maybe I'd like to wait for an additional 1 in the pool timeout. |
sidekiq_redis_config = {
url: Secret.sidekiq_redis_url,
namespace: REDIS_SIDEKIQ_NAMESPACE,
# tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
# https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
network_timeout: 5,
pool_timeout: 5
}
Sidekiq.configure_server do |config|
# A Sidekiq server process requires at least (concurrency + 5) connections
# https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
config.redis = sidekiq_redis_config
end
redis_conn = lambda do
Redis.new(timeout: 5, url: Secret.sidekiq_redis_url, namespace: REDIS_SIDEKIQ_NAMESPACE)
end
Sidekiq.configure_client do |config|
threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }
# Add 2 more connections to the pool just in case.
# This should help with error ConnectionPool::TimeoutError when connection is stuck.
# Connections are lazy loaded so they won't be opened unless necessary.
# https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
size = threads_count.to_i + 2
config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)
end Try something like this @ArturT |
@mhenrixon That's a good catch with |
We started seeing a lot of Is there any chance that a connection isn't checked back into the pool? |
@mhenrixon I don't think increasing Please correct me if I'm wrong, but the way I understand it Sidekiq tries to automatically configure the pool to be big enough. Seeing the
Did something change in this release that uses an additional connection? #604 maybe? Is there a way to create a new connection outside of the pool for this? |
Maybe there is something going on here in what @benedikt said. It seems like I had Is it possible that sidekiq creates not enough pool of connections and sidekiq-unique-jobs consumes them which prevents from opening new connections and the whole pool of connections could be busy hence `pool_timeout=1 limit is reached? |
It seems like #604 is adding using a I believe Sidekiq itself needs two additional connections on top of the number of worker threads it creates. With the My guess is that this was okay until now because Sidekiq doesn't really use one of the additional two connections, so it was fine until the |
Point taken, I'll have a look at making sure we only use a connection when needed. |
There is no leaking of connections from what I can tell. That said, it does require more redis connection for the extra work. If you didn't already have a ton of trouble with the reaper, you can turn the resurrector off with |
Let me know if that helps |
Upgraded to the latest version and disabled the reaper resurrector. No issues so far 👍 |
Thank you for the confirmation @benedikt! I will disable it per default in the next release and refactor the resurrector. |
Just to let you know despite a pool timeout 5 seconds I got last night's exception |
We've got a busy sidekiq worker pool, and I'm also seeing the same We've tried setting the sidekiq connection pool to a larger version than the default with some code like this, but we're still seeing this issue.
However this does not seem to have helped too much. Our experience of the issue is that the rate of exceptions start a few hours after a new code deploy / sidekiq process restart and then accelerate from there before the sidekiq worker instance is unable to find connections to process its own jobs. It feels to us like there must be some issue that causes the connections to be used but not returned to the pool with this version of the gem. We're planning on disabling the reaper process as our next attempt at correcting this issue. |
@mhenrixon First off, thanks for the awesome gem! We were seeing this issue in production after upgrading and had to revert back to our much older version once we started noticing jobs being dropped. I believe I was able to isolate the circumstance by introducing latency in the redis connection for It appears that if the latency is somewhere around 18ms, then the job will be rejected. This was tested over localhost on OSX, sidekiq-unique-jobs version 7.1.2,
the sidekiq portion of this test is: class TestWorker
include Sidekiq::Worker
def perform(timestamp)
Rails.logger.info("performing")
File.write('/tmp/timestamp', timestamp)
sleep(10)
end
end
class UniqueUntilAndWhileExecutingTestWorker < TestWorker
sidekiq_options queue: :medium, lock: :until_and_while_executing
end
class UniqueUntilExecutingTestWorker < TestWorker
sidekiq_options queue: :medium, lock: :until_executing
end
class UniqueUntilExecutedTestWorker < TestWorker
sidekiq_options queue: :medium, lock: :until_executed
end And executed in a rails console def test(worker)
timestamp = Time.now.to_i.to_s
p worker.perform_async(timestamp)
puts "check #{timestamp}?"
gets
p File.read('/tmp/timestamp') == timestamp
end
test(TestWorker)
test(UniqueUntilAndWhileExecutingTestWorker)
test(UniqueUntilExecutingTestWorker)
test(UniqueUntilExecutedTestWorker) The ruby TCP proxy script used to introduce latency is in the following gist: https://gist.github.com/millerjs/8ab3a4ce63c4502da4c34db017ae417d ruby ~/bp/scratch/latency_proxy.rb localhost:6379 8379 with the following results
I'm happy to provide any more info, logs, redis keys, etc if that will help debug this issue edit: If I crank |
@mhenrixon the README says sidekiq-unique-jobs/lib/sidekiq_unique_jobs/locksmith.rb Lines 273 to 277 in 620fe75
config.timeout is nil .
Are these two things contradictory? edit 1it looks like sidekiq-unique-jobs/lib/sidekiq_unique_jobs/locksmith.rb Lines 101 to 107 in 620fe75
isn't respecting any lock_timeout wait settings. If sidekiq-unique-jobs/lib/sidekiq_unique_jobs/locksmith.rb Lines 239 to 243 in 620fe75
so the sidekiq-unique-jobs/lib/sidekiq_unique_jobs/locksmith.rb Lines 260 to 261 in 620fe75
this explains why upping the |
@mhenrixon I believe the solution (at least to the problem I'm seeing) is to ensure that the diff --git a/lib/sidekiq_unique_jobs/locksmith.rb b/lib/sidekiq_unique_jobs/locksmith.rb
index f90f1ce0..41fb02ab 100644
--- a/lib/sidekiq_unique_jobs/locksmith.rb
+++ b/lib/sidekiq_unique_jobs/locksmith.rb
@@ -102,7 +102,7 @@ module SidekiqUniqueJobs
raise SidekiqUniqueJobs::InvalidArgument, "#execute needs a block" unless block
redis(redis_pool) do |conn|
- lock!(conn, method(:primed_async), &block)
+ lock!(conn, method(:primed_async), config.timeout, &block)
end
end though that doesn't handle the I think a workaround for now would be to set both the sorry for the comment spam edit: Pull request with the above #624 |
relates to mhenrixon#617 There is an issue where an integer `config.timeout` was being respected in `pop_queued`, but the promise in `primed_async` was only waiting the `config.ttl` plus the drift. This means that high latency redis connections will return a `nil` `primed_jid` and while_executing jobs will get dropped. The proposed solution is to ensure that `config.timeout` gets passed in as the `wait` argument to the bound `method(:primed_async)`, which allows the `Concurrent::Promises` to timeout after the same duration (plus drift) as the `brpoplpush` call. The issue triage and approach is documented in: - mhenrixon#617 (comment) - mhenrixon#617 (comment) - mhenrixon#617 (comment)
relates to mhenrixon#617 There is an issue where an integer `config.timeout` was being respected in `pop_queued`, but the promise in `primed_async` was only waiting the `config.ttl` plus the drift. This means that high latency redis connections will return a `nil` `primed_jid` and while_executing jobs will get dropped. The proposed solution is to ensure that `config.timeout` gets passed in as the `wait` argument to the bound `method(:primed_async)`, which allows the `Concurrent::Promises` to timeout after the same duration (plus drift) as the `brpoplpush` call. The issue triage and approach is documented in: - mhenrixon#617 (comment) - mhenrixon#617 (comment) - mhenrixon#617 (comment)
relates to #617 There is an issue where an integer `config.timeout` was being respected in `pop_queued`, but the promise in `primed_async` was only waiting the `config.ttl` plus the drift. This means that high latency redis connections will return a `nil` `primed_jid` and while_executing jobs will get dropped. The proposed solution is to ensure that `config.timeout` gets passed in as the `wait` argument to the bound `method(:primed_async)`, which allows the `Concurrent::Promises` to timeout after the same duration (plus drift) as the `brpoplpush` call. The issue triage and approach is documented in: - #617 (comment) - #617 (comment) - #617 (comment)
I added an UPGRADING.md document that I will use for guiding people on how to upgrade.
A ton of bug fixes went into this release! Even
UntilAndWhileExecuting
should be less problematic now.Finally a special shoutout to my sponsors. You make all of this work possible and worth it!
Link to the release
The text was updated successfully, but these errors were encountered: