-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10X increase in read ops/sec after upgrade from v6 to v7 #684
Comments
Sorry about the trouble, you can tweak the reaper settings to cut the reads a bit. Most of the work (apart from the reaper) is done in Lua so shouldn't cause too much trouble. |
For example, reaping every 30 seconds seems a little overkill. I'd try reaping every 5-30 minutes instead. Also, some people reported issues with the resurrector which is why I default that to off. Wouldn't be many reads but if you really need to cut down on reads... Never heard on a provider that charges per reads before but I'm happy to optimize the gem 💎 |
Hi 👋🏻 For context, @jkhulme and I both work on the same application which depends on your gem (thank you!). We have different deployments with different scales, but a lot of our usage patterns look similar. I took some time today to examine our Redis Reads and I believe you're right that the largest increase is coming from the reaper. It fails while our sidekiq queues are full, but once they clear out we could have 300,000 locks to clean out (@jkhulme may have millions). Unfortunately I don't expect reaping less frequently to help - in the scenario where we have 300k locks to clear, it's going to cost a lot of reads whether we clean 1k every 30 seconds or 10k every 5 minutes. It's possible that switching to the Lua reaper would fix a lot of our problems, but I am concerned about its blocking behaviour, especially when we can build up such a large number of locks to be reaped. Currently, when the reaper is iterating over a list of digests to find orphans, it performs the following operations per digest:
After that, it uses a batch delete process, and it also avoids looping over every digest since #666 was merged, but even if the reaper only ever ran the above for digests it was going to delete, that's roughly 12 reads per digest, so cleaning up 300k digests will produce 3.9M reads. If that's spread evenly over 3 hours, that's still 300 reads/second. In practice, it's not spread out evenly, and we see peaks over 2k reads/second every minute or so (our reaper interval is a minute). Lookng at the list above, there are three operations which I suggest could be saved and re-used, at least within a given run of the reaper:
That would get us from 12 down to 9. I think, ideally, we would find some way to batch some of the other operations. For example, for the two ZSCANs, we might consider testing an entire batch of keys together, by first storing them in a set (e.g., @mhenrixon If I make a PR roughly following the strategy above, would that be helpful? I'm happy to make any further changes you suggest in terms of style/architecture/specs etc. @jkhulme do you think a 50% reduction would get us back within the range of acceptable read traffic? |
If you have any suggestions for how to improve performance I am wide open for it @francesmcmullin. The only thing that is extremely important is to make sure that redis version compatibility isn't broken without a major release. I'm open for increasing the required redis version but I am not sure I have test coverage for various versions. |
Thanks for the pointer! I foolishly ignored the version number of I have a very rough changeset with passing specs here for reducing the SSCAN and LLEN calls, but having looked at the problem more closely I'm interested in zooming out a little bit. I think the root of the problem, for @jkhulme and me, is that we use an enormous amount of By its nature, the reaper must thoroughly check if digests are queued, retrying, active or scheduled, because it is a safety net cleanup system, which knows nothing about the locks related to these digests. However, none of that is necessary for an
For us, I'm pursuing a simpler SETNX lock for our highest volume use case, rather than using |
@francesmcmullin, there is nothing per se keeping us from having all the digests in one place. I have the digests set to give some insight into what's going on without having to loop through all keys. I'll give this some thought. Would be pretty sweet to use a more efficient reaper. Perhaps you could contribute one suitable for your use case and we can prefer that when the redis version is appropriate and fallback to the existing one if not? I would probably duplicate the existing ruby reaper and make changes to a separate class to avoid adding if statements in the existing one. |
@francesmcmullin I'd love to get some improvements in to the gem. Did you do any more improvements? Any PR you could open? I appreciate your support! |
Hi @mhenrixon 👋🏻 Thanks for the nudge! I'll open a few PRs today, please treat them as suggestions only, I won't mind if they need more work or don't fit with the gem's broader use cases etc. |
@francesmcmullin you just really inspired me. I will reduce the complexity considerably and create different lua scripts for different locks and also different sets for different locks. I believe you just made v8 happen :) |
Describe the bug
We upgraded from 6.0.24 to 7.1.12 (alongside upgrade sidekiq from 6.1.2 to 6.4.0)
Before the upgrade we were seeing ~300 reads/sec, and ~300 writes/sec
After the upgrade we are seeing ~3000 reads/sec, and still ~300 reads/sec
This is a very big increase to see happen, and is proving to be quite expensive with our redis provider
Expected behavior
Not to see a 10x increase in reads/sec
Current behavior
We are seeing 10x increase in reads/sec
Worker class
n/a
Additional context
Our unique jobs config:
"unique_jobs_debug"=>false,
"unique_jobs_reaper_type"=>"ruby",
"unique_jobs_reaper_count"=>1000,
"unique_jobs_reaper_interval"=>30,
"unique_jobs_reaper_timeout"=>20,
"unique_jobs_reaper_resurrector_interval"=>3600
Please let us know if there is any other info we can provide
The text was updated successfully, but these errors were encountered: