-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Memory leak in MADDPG #6857
Comments
Probably your replay buffer is just using too much memory. Does the leak still happen if you set num_workers=0, buffer_size=1000, and leave ray.init() alone? |
|
You can add CNN to |
I have the same issue using DDPG. I use 50 workers, and replay buffer of size 100000. It is consuming more than 60go after 50M iterations, and it is linearly increasing since the beginning. I'm using release 0.8.6. @ericl Do you think it is normal ? |
I'm running on two custom (propriatary) enviroments, and it otherwise works fine, but with MADDPG Ray throws
RayOutOfMemoryError
after 52 episodes (or 49k timesteps). Theerror.txt
output log:Ray was initialized with
ray.init(redis_max_memory=int(4e9), object_store_memory=int(10e9), memory=int(16e9))
, on a machine with 32GB of ram. I've attatched my slightly modified version of run_maddpg.py from the example code to include all interesting parameters. This issue persists with upgrading to tf 1.15 and ray 0.9dev0, which github reports say fixed similar issues. I'm also not the only person to be having this problem, as evidenced by the comments here in the old example code repo: wsjeon/maddpg-rllib#7 (the one that was marked off topic).I followed this comment from #3884 to set my object store size. Each observation in my game that is returned by a
step
call is of size 538 KB. Therefore,538 K * max(48, 12*4) * 16 = 414 MB
(I believe the default value ofLEARNER_QUEUE_SIZE
is 16.). So, I set theobject_store_memory
to be 10 GB. That should be enough.One super curious thing in this issue is that, while
num_workers
parameter is the number of distributed compute nodes Ray uses, when I runpstree
to check for the spawned processes, I see that there are 89RolloutWorkers
threads that are running. The number changes depending on the memory parameters I set, but it never goes below 60. This doesn't make sense.After the initial start, the number of processes don't keep growing, so this isn't some sort of leak in terms of process number and them not being released.
The nature of the memory management here also leads me to believe that ray somehow isn't respecting the limits I set, which I filed a seperate bug report about here: #6785
The documentation also seems to say that we can enforce memory qoutas on workers- "Enforcement: If an actor exceeds its memory quota, calls to it will throw RayOutOfMemoryError and it may be killed. Memory quota is currently enforced on a best-effort basis for actors only (but quota is taken into account during scheduling in all cases)." As far as I can tell, there's no way to set this in maddpg? Am I missing something?
Getting this resolved is incredibly important to me, and I'm very happy to try and submit PRs and work with you guys debugging this issue (I'm already the new maintainer of the maddpg example repo), but I'm completely as a loss Can anyone offer some insight here?
cc @wsjeon @richardliaw @ericl
Here's the code mentioned above, I had to change the extension from .py to .txt to post it.
run_maddpg.txt
The text was updated successfully, but these errors were encountered: