Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis script cache gets bloated after update to bull 2.0 #426

Closed
prdn opened this issue Jan 3, 2017 · 32 comments
Closed

redis script cache gets bloated after update to bull 2.0 #426

prdn opened this issue Jan 3, 2017 · 32 comments
Labels

Comments

@prdn
Copy link

prdn commented Jan 3, 2017

Hello,

I'm experiencing a growing Lua script cache after updating to bull-2.0
After 20 days of uptime my redis Lua cache grown up to 15GB

Anyone experiencing same issue?

Thanks

@manast
Copy link
Member

manast commented Jan 3, 2017

can you check what entries are there in the cache? there must be a spurious entry that is for some reason not being reused but created every time it is used.

@prdn
Copy link
Author

prdn commented Jan 3, 2017

@manast I've added some console logs to the script caching function in lib/scripts.js. The hashes seem all the same.
Just to be sure that we are on the same, the value that is growing fast is used_memory_lua
Reverting to v1.1.3 is working fine.

Do you have a preferred way to analyse the redis lua script cache?

@prdn
Copy link
Author

prdn commented Jan 3, 2017

@manast can it be that _retryAtOnce is using directly eval and not caching the script and using evalsha so it is accumulating the same script again and again every time there is a retry?

@manast
Copy link
Member

manast commented Jan 3, 2017

eval is not caching the script, so I don't think that is the reason of the problem. I am not aware of any other command in redis besides LOAD that caches scripts. So if you have time you could put a console.log just before calling load, and you should be able to see which script is being cached continuously: https://redis.io/commands/script-load

@prdn
Copy link
Author

prdn commented Jan 3, 2017

You're right. I added a console.log in my dev environment.
It looks everything normal but just a nodejs process (concurrency = 5) is bloating my redis instance.
On production I need to revert to 1.1.3. The main change seems moving to ioredis and that looks good. I will try to spend a bit of time on it in next days, since it's crucial for me to maintain bull as a dependency.

Thanks for helping

@manast
Copy link
Member

manast commented Jan 4, 2017

The major new thing in 2.0 is the redlock algo, I would bet that is the root of the problem. @doublerebel any ideas?

@prdn
Copy link
Author

prdn commented Jan 7, 2017

I think you're right @manast @doublerebel
Is there an easy way to test the previous locking system?
Otherwise will set up a branch for it

@manast
Copy link
Member

manast commented Jan 7, 2017

when the old locking system was replaced by redlock many other changes where performed, so it is not easy to replace it. I will investigate when I have some time why the script cache is increasing, there could be an issue in the redlock dependency as well: https://github.com/mike-marcacci/node-redlock

@manast
Copy link
Member

manast commented Jan 8, 2017

I have analised the code, and I cannot see anything in redlock or in bull that would currently cache a lot of scripts. In redlock it is actually running eval directly (which is not so performant, I would like to improve this in the future), and in bull, the code related to lock is not using any scripts, it is relayed to redlock. If you put a console log here https://github.com/OptimalBits/bull/blob/master/lib/scripts.js#L46 outputing the hashes you should only see a few ones (correct behaviour) otherwise you should see a lot of them.
Now, is this cache bloating creating problems? it could be some redis optimization that caches lua stuff but that is reused by actual keys if necessary, and therefore nothing to worry about?

@manast
Copy link
Member

manast commented Jan 8, 2017

From redis documentation https://redis.io/commands/eval

"The EVAL command forces you to send the script body again and again. Redis does not need to recompile the script every time as it uses an internal caching mechanism, however paying the cost of the additional bandwidth may not be optimal in many contexts.hanism, however paying the cost of the additional bandwidth may not be optimal in many contexts."

@prdn
Copy link
Author

prdn commented Jan 8, 2017

@manast well my redis db that is around 1GB reached 15GB of memory, that's a pretty big overhead in my opinion and it is a risk on a production environment since there is no clear limit for it.

I think I found the issue:

var lockAcquired = ['and redis.call("HSET", "', queue.toKey(job.jobId), '", "lockAcquired", "1")'].join('');

Here you call redlock passing a variable locking Lua script.
This is probably what's bloating the redis cache

@manast
Copy link
Member

manast commented Jan 8, 2017

without checking redlock, I guess that line is there because it is not currently possible to specify extra keys to be used by redlock. That would certainly need a change in redlock as well, to be merged and released before we can fix it in bull...

@prdn
Copy link
Author

prdn commented Jan 8, 2017

Sounds good. At least it seems the explanation of what I'm seeing. I'll see if I can prepare something for graylog.
I think that any real use of bull-2.0 in production will end up in causing troubles. I'm handling around 1 million jobs per day and with the current cache grow I reach 15GB in 3 days.

Thanks for this awesome lib btw, I truly like it

@manast
Copy link
Member

manast commented Jan 8, 2017

Even if the 15Gb is scary, it may not be a problem if redis just reuse that memory when needed. I checked redis.conf file but unfortunately there is no setting to limit the max amount of lua cache. Maybe @antirez could give some explanation about this behaviour.

@prdn
Copy link
Author

prdn commented Jan 8, 2017

I agree that it would not be scary if there was a recycle of the memory. I checked for that parameter as well.
From here https://redis.io/commands/eval
The reason why scripts can be cached for long time is that it is unlikely for a well written application to have enough different scripts to cause memory problems. Every script is conceptually like the implementation of a new command, and even a large application will likely have just a few hundred of them. Even if the application is modified many times and scripts will change, the memory used is negligible.

I have the feeling that the changing the script each time is against redis design.

@manast manast added the bug label Jan 9, 2017
@gblock0
Copy link

gblock0 commented Jan 17, 2017

We are seeing this as well, our used_memory_lua was at 1118090240. Our backend locks up and once we run script flush things start working again.

joeandrews added a commit to goradish/bull that referenced this issue Jan 17, 2017
…ts#426

NOTES: this will still violate redis cluster rules of each script explicitly defining the keys it writes to in the KEYS array, howerver the current implementation also does this and this should fix the growing lua_memory_cache issue.
@joeandrews
Copy link
Contributor

joeandrews commented Jan 17, 2017

I think we have fixed this on our pull request.

The issue is that line

var lockAcquired = ['and redis.call("HSET", "', queue.toKey(job.jobId), '", "lockAcquired", "1")'].join('');
was creating a new script for every single job ID. 10,000 jobs, results in 10,000 scripts in the script cache and hence this issue.

@jeltok
Copy link

jeltok commented Jan 21, 2017

We experience this issue as well with about 600k jobs a day. We use Redis 3.0.7 on Ubuntu 14.04

After 7 days of use Redis memory is arounf 8GB - data is less than 1GB, rest is reported as Lua scripts cache.

@prdn
Copy link
Author

prdn commented Feb 3, 2017

Any update on this issue?
Thanks

@manast
Copy link
Member

manast commented Feb 3, 2017

we have a PR that solves this issue but there is another issue with that PR that needs to be resolved before it can be merged.

@jeltok
Copy link

jeltok commented Feb 25, 2017

That's how the memory on my Redis instance (that supports Bull) looks like. We are running about 1 million messages a day through the Bull queues. The server has 16 GB of memory, and looses about 3GB a day to Lua cache.
Running SCRIPT FLUSH command any other day helps, but is a hustle. Other than that, Bull works just great for us.
Hope this issue will be fixed soon.

image

@manast
Copy link
Member

manast commented Feb 27, 2017

@jeltok a fix for this issue is the next thing in my list.

manast added a commit that referenced this issue Mar 1, 2017
Fix Lock Redis script to not be a different script for every Job ID #426
@manast
Copy link
Member

manast commented Mar 1, 2017

@jeltok can you please verify that the issue is resolved with bull 2.2.0 so that we can close it?

@engvik
Copy link

engvik commented Mar 1, 2017

Looks like I'm still seeing the issue. Upgraded from 1.1.3 to 2.2.0 today. Have kept an eye on the used_memory_lua field and it's been steadily increasing since the upgrade. Flushing the scripts from the redis instance drops the usage down to normal.

redis

@manast
Copy link
Member

manast commented Mar 1, 2017

hmm, ok, I will need to dig more into it.

@engvik
Copy link

engvik commented Mar 1, 2017

Thanks! Just switched from kue, and other than this it's been working great! Let me know if you need anything 😄

@jeltok
Copy link

jeltok commented Mar 2, 2017

I have installed version 2.2.0 yesterday. Unfortunately it does not improve the issue with Lua cache by much.

image

@manast
Copy link
Member

manast commented Mar 2, 2017

@jeltok I would say nothing at all :). I will try to do a deeper analysis.

@joeandrews
Copy link
Contributor

@manast @jeltok I have some time today will try to look into it further as well.

Joe

manast added a commit that referenced this issue Mar 3, 2017
manast added a commit that referenced this issue Mar 3, 2017
@manast
Copy link
Member

manast commented Mar 8, 2017

Ok, I have created a custom bull-redlock package and fixed one last issue. The queue works pretty well now without leaks. I will make a new release in the following hours.

@manast manast closed this as completed Mar 8, 2017
@jeltok
Copy link

jeltok commented Mar 9, 2017

Thanks a lot!!! Looks much better now :-)

image

@boatmeme
Copy link

Sorry for the noise, just wanted to send some gratitude to @manast for fixing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants