-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Memory usage constantly growing while training IMPALA #3884
Comments
So it's expected that there will be a lot of shared memory used -- that's probably the Ray object store. By default object store memory is capped to min(20GB, 40% of total memory), so you should see it stabilize there. You can also set the object store memory with ray.init(object_store_memory=) to lower it (I don't think IMPALA actually requires that much shared memory, just O(BATCH_SIZE * NUM_WORKERS)). |
Hmm, I would try to reproduce this without SC2, or otherwise isolate the cause (if there is a reproduction script, I can help track down the issue as well). |
Ok, I think I created a repro:
Object store memory limit is here so that it gets to the limit faster. Train and sample processes grow and grow until they have ~6gb on RAM each (in few minutes), app hangs, and at the end crashes (though end error seems to be different - no more memory in plasma store - probably a different wording of the same issue). |
Thanks! I'll be able to take a look later this week. |
How much memory does the machine have?
Are the changes you made to the Ray source code possibly relevant (what do
they do)?
You can also try setting *--ray-redis-max-memory=5000000000*, though that
probably isn't the issue.
…On Tue, Jan 29, 2019 at 10:35 AM Eric Liang ***@***.***> wrote:
Thanks! I'll be able to take a look later this week.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3884 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPOrf18aWj4JCrtYCqCm2dWXA88dY0gks5vIJRvgaJpZM4aV6sA>
.
|
It has 32gb, ~6gb used before starting the rllib, but some of it is taken by SC2 environments as well (~0.5gb per environment), and ray workers have some footprint as well (even unused ones, ~300mb). But trainer and sampler processes are constantly growing until the crash, which comes even faster if object store memory is limited, other processes are steadily using the memory.
No changes to the ray base. In rllib, the only important change is that each environment is remote worker itself, but I don't think that's relevant. In the last post, I think I made a repro with vanilla ray, installed via pip, with no changes.
Tried. Didn't help. |
I was able to reproduce this, and #3938 fixes the issue for me. After adding the copy, memory usage seems to stabilize once the object store memory limit is reached. Could you try it out? I was running Pong with |
Hey Eric, I installed ray repo from git clone, with your fix, and it seems your fix doesn't work for me. Both my command (
Can we try to iterate on this issue, as it seems it still persists? Can you also share with me basis on how ray manages memory so I could try to understand, as it seems to me that process memory is evergrowing for each ray process? |
Sure, basically when you return objects from remote ray workers, they are copied into a shared memory object store. When processes call ray.get() on objects, they get a memory mapped view of the object when possible. When the object store memory is full, objects are evicted in LRU order. Objects can't be evicted though if they are referenced by some python object (i.e. some numpy array has a view of the shared memory). |
I can also reproduce what you mentioned now -- a hang, and then
@stephanie-wang @robertnishihara do you have any insight into what's going on here? I don't know why we are getting reconstruction on a single node. Also, there is another potential issue with that benchmark -- doing some back-of-the-envelope math, the peak memory usage is actually 20GB:
The 16 constant comes from https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/async_samples_optimizer.py#L27 However even reducing num_envs_per_worker to 1 still causes the error ( |
So the following trains for me up to 150K timesteps, can you confirm? Memory seems to reach steady state at ~4GB shared.
One thing I did do was comment out the compute_apply() block in async_samples_optimizer, so that it could run quickly on my laptop, but that part seems unrelated to the leak. So couple observations:
|
This is almost definitely happening due to object eviction, but I thought that we had fixed that in #3490, which throws an error to the application if an evicted object is needed. I believe that either there is a bug in that fix, or the application is somehow not handling the error. |
Yes, I can. Shared memory stays at ~4.2GB. I'm not sure I understand the vectorization issue. Thing is that rllib is working ok for a minute or two, it is learning, but with ever more memory, and then crashes as it uses all the memory. It manages to execute learner several times so it's not looking like learning data is taking unacceptably much of the RAM, but that it is not being freed... |
The issue is just that the memory used is too much, since the batches are
of size num_envs_per_worker*sample_batch_size (see the size estimates). In
those cases there is no leak, just materializing the arrays in memory costs
20GB already. You might see some learning before the internal queue fills
up.
I think you just need to reduce the sample batch size and/or increase the
memory allocation.
…On Tue, Feb 5, 2019, 8:41 AM bjg2 ***@***.***> wrote:
So the following trains for me up to 150K timesteps, can you confirm?
Yes, I can. Shared memory stays at ~4.2GB.
I'm not sure I understand the vectorization issue. Thing is that rllib is
working ok for a minute or two, it is learning, but with ever more memory,
and then crashes as it uses all the memory. It manages to execute learner
several times so it's not looking like learning data is taking unacceptably
much of the RAM, but that it is not being freed...
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#3884 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA6SvqKFSY8eH2Dajjw8nQSM5xJKSJHks5vKbQygaJpZM4aV6sA>
.
|
I think one thing we can do is make the learner queue size configurable, since a queue size of 16 batches might be too large for many cases. |
Aha, so I understand your idea now. Still, I don't think learning queue ever gets full. I was debugging some more, with this change to LearnerThread (prints added)
Running this
it never printed more than 1 (after/before), and ended up with:
When running the same experiment, but with object-store-memory 25gb, it again doesn't have queued learning, it doesn't break and seems to stabilize memory usage
And when I start our own SC2 IMPALA learning with 1 worker and 10 envs inside it dies after ~1 hour, with If I try the same thing with 15 GB object store memory it breaks after few minutes with something like:
Conclusion: we still haven't managed to start our learning.
PS. I really have trouble understanding memory usage. Old objects are evicted from object cache if another object needs to get in object cache and there's no more memory inside? Can we change some setting so object cache cleanup is done more often, even before it is full? That way, RAM usage should stabilize much faster? Also, look at the example below - I don't understand what is resident memory, what is shared memory, which process uses how many memory, everything seems to be confusing. How can a process have resident memory bigger than whole RAM usage of the computer? |
I think the resident memory is misleading here -- the memory used by the process is just "Memory", and the object store memory is "Shared Mem". So in the screenshot there is ~16GB used by the object store, and ~1.4GB, ~1.1GB used by the processes.
Will look into this, it seems related to #3490
That's right, it's LRU order once full.
I agree some sort of more eager cleanup would be nice here. It's challenging since the object store contains objects from multiple nodes, so you have to do distributed reference counting. In addition, it is not always possible to shrink the shared memory region depending on memory fragmentation (which can also cause "leaks" in itself if memory is fragmented into too small chunks). |
I'm able to reproduce the above errors in #3970 I think the easiest path forward is to greatly reduce the sample batch size, i.e., ensure that The slowly increasing shared memory usage is harder to explain, but could be memory fragmentation in the object store. I don't think it's a simple leak since otherwise you'd be able to reproduce with the Pong example. |
Note that train_batch_size could also increase shared memory usage, since the code buffers sample batches until the target train_batch_size is reached. So you want to consider |
Ah, this might be it: the replay buffer is filled with up to After this fix, |
Maybe I haven't emphasized enough that I don't think that it is possible that we're using that much memory. Obs for SC2 has the same size as the one for pong. Math for our SC2 learning:
Let's say that there is another batch in making while one is training (we never saw more than 1 batch queued to the learner) that is ~320mb. Also, our network weights are 2.9 mb, and pong weights are 4mb (checked). So memory usage regarding learning data should be less than 0.5 GB. It should behave similar to pong and shouldn't eat much more RAM and constantly grow... I tried running with 'replay_buffer_num_slots': 0, it helper a bit, but I don't think it helped much. I see a new process growing in memory that I don't think I saw before - Idea: SC2 diff is due to the remote workersIt seems that pong examples are stable now, but SC2 still can't learn without using evermore memory. I think I managed to pinpoint the current issue I face a bit more - this new redis-server process helped a lot. Idea: redis-server process is growing because the obs from remote workers never get cleaned.
It seems that remote workers are actually creating the problem. With 550 samples per second, something like 550 * 100kb of data should be stored in redis per second, which is ~53mb, but as I understand redis is compressing before values before storing them, so it seems possible they are compressed down ~25 times, 100kb->4kb (obs are mostly the same values, highly compressible). Do you think this might be the case? If yes how can I fix that issue? If this is indeed the case you can close this issue and we can continue the discussion in #3968 [wingman -> rllib] Remote and entangled environments. |
If redis-server is the only problem now, that's easy to fix by setting redis_max_memory to a smaller value (it defaults to 5GB by default). Redis is used to organize communication between workers (i.e. task metadata). The reason it's allowed to use a bunch of memory is to track task lineage for fault tolerance purposes. 5GB is probably overkill for your use case though. There was also another issue before where all logs got written into redis and not removed, but that may have been fixed. |
Think I finally figured out what was causing us to misleading paths of investigation. train/sample processes had the biggest memory usage and their shared memory was growing. Shared memory was growing until it leveled at ~10gb (often didn't even wait until it levels). We thought those processes and shared memory were the culprits for the issue, and we were doing all around fixing those, like increasing object store memory. Accidentally, while trying a bunch of random stuff, after changing object store memory to 3gb, shared memory of these processes was leveling up faster, at ~2.5gb. When we saw these processes are not growing, we took a look at what was, and that was redis-server. So there were two separate issues:
I tried your suggestion, starting with I created a repro. You could run this with my remote worker changes to reproduce the issue on pong environment: pong_experiment. Redis store memory is limited at 100mb, very fast it will reach ~130mb, and will hang. If you change remote_worker_envs to False, it will work forever. |
Redis is used to store object metadata, and the object store is used to store object data in shared memory. So the redis memory usage will be proportional to the number of remote tasks launched, which would naturally be much higher when each step of the environment creates a new object. I am also able to reproduce the hang on a branch with remote envs, and filed: #3997 It's possible we will discover a fix soon, but the only feasible workaround at the moment is to not use Ray tasks for the small step()s (i.e., use https://github.com/openai/baselines/blob/master/baselines/common/vec_env/subproc_vec_env.py for the remote env wrapper). Update: looks like we understand the root cause and it should be fixed soon. |
Great! We're waiting for the fix. |
Hey @bjg2 it should be fixed now. |
Hey, thanks! Tried it out. It seems not to hang when redis gets to the limit as it used to, but it seems not to force the limit as well. Using Here is the repro: https://pastebin.com/ktRfq5ZY |
That leak should be fixed now too. |
Hey Eric, I can confirm redis is not growing anymore. Our SC2 training is finally working! Will stress it more in the next few days, but the issue seems to be closed! Thank you very much! |
System information
Describe the problem
Memory usage (resident memory/shared memory) in sampler and trainer is constantly growing during training IMPALA. Very quickly (in few minutes) it fills all the RAM.
Some important remarks in how do I use the rllib:
With the usage explained,
ray_PolicyEvaluator:sample()
andray_ImpalaAgent:train()
processes are just getting bigger and bigger (they take more and more resident/shared memory), as shown in this system manager screenshot:Am I doing something wrong? Is it doing something I don't expect, like keeps old episodes or something, should I change some config?
The text was updated successfully, but these errors were encountered: