[rllib] Memory usage constantly growing while training IMPALA #3884

bjg2 · 2019-01-28T13:29:37Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.1 LTS
Ray installed from (source or binary): source, with few changes
Ray version: 0.6.2
Python version: Python 3.6.6 :: Anaconda, Inc.
Exact command to reproduce: train IMPALA, described better below

Describe the problem

Memory usage (resident memory/shared memory) in sampler and trainer is constantly growing during training IMPALA. Very quickly (in few minutes) it fills all the RAM.

Some important remarks in how do I use the rllib:

I'm training SC2 via pysc2 in my custom created env
My neural net is fairly big, observations as well, and combining environment observations for inference is important for me - it improves performances a lot. Training is done on a small number of workers (currently on my machine 1) and many environments per worker (currently for test - 10).
SC2 env communication is slow compared to Atari environments, and communication is done via ports - so I implemented making environments in separate remote processes and doing step / reset on them in parallel in the policy evaluator.

With the usage explained, ray_PolicyEvaluator:sample() and ray_ImpalaAgent:train() processes are just getting bigger and bigger (they take more and more resident/shared memory), as shown in this system manager screenshot:

Am I doing something wrong? Is it doing something I don't expect, like keeps old episodes or something, should I change some config?

The text was updated successfully, but these errors were encountered:

ericl · 2019-01-28T20:13:16Z

So it's expected that there will be a lot of shared memory used -- that's probably the Ray object store.

By default object store memory is capped to min(20GB, 40% of total memory), so you should see it stabilize there. You can also set the object store memory with ray.init(object_store_memory=) to lower it (I don't think IMPALA actually requires that much shared memory, just O(BATCH_SIZE * NUM_WORKERS)).

bjg2 · 2019-01-29T11:12:57Z

Setting object_store_memory=int(5e9) in ray init didn't help - memory was raising as fast as it did before and rllib hangs after few minutes, with memory usage as below.

Without it, it hangs as well, few minutes later, with memory usage as below (using a bit more memory):

If you wait a few minutes after it hangs, it starts throwing these crashes in the loop:

I0129 12:12:36.829390 33426 node_manager.cc:1790] Resubmitting task 000000000b3c42d45dced6236f781d17e7fd34e6 on client 86d28c0403aefea0128331a2f637e9cbb87c0762
W0129 12:12:36.829485 33426 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0129 12:12:46.829478 33426 node_manager.cc:1790] Resubmitting task 000000000b3c42d45dced6236f781d17e7fd34e6 on client 86d28c0403aefea0128331a2f637e9cbb87c0762
W0129 12:12:46.829512 33426 node_manager.cc:1230] Task 000000000b3c42d45dced6236f781d17e7fd34e6 already in lineage cache. This is most likely due to reconstruction.

This one is a real blocker, stops me from using rllib at the moment. Can you point me to the steps I can do to mitigate the issue or to debug it properly?

ericl · 2019-01-29T12:06:05Z

Hmm, I would try to reproduce this without SC2, or otherwise isolate the cause (if there is a reproduction script, I can help track down the issue as well).

bjg2 · 2019-01-29T13:34:59Z

Ok, I think I created a repro:

virtualenv -p python3 venv
source venv/bin/activate
pip install ray
pip install ray[rllib]
pip install psutil
pip install tensorflow-gpu
rllib train --run IMPALA --ray-object-store-memory 5000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}'

Object store memory limit is here so that it gets to the limit faster. Train and sample processes grow and grow until they have ~6gb on RAM each (in few minutes), app hangs, and at the end crashes (though end error seems to be different - no more memory in plasma store - probably a different wording of the same issue).

ericl · 2019-01-29T18:35:26Z

Thanks! I'll be able to take a look later this week.

robertnishihara · 2019-01-29T21:29:36Z

How much memory does the machine have? Are the changes you made to the Ray source code possibly relevant (what do they do)? You can also try setting *--ray-redis-max-memory=5000000000*, though that probably isn't the issue.

…

On Tue, Jan 29, 2019 at 10:35 AM Eric Liang ***@***.***> wrote: Thanks! I'll be able to take a look later this week. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPOrf18aWj4JCrtYCqCm2dWXA88dY0gks5vIJRvgaJpZM4aV6sA> .

bjg2 · 2019-01-30T09:41:40Z

How much memory does the machine have?

It has 32gb, ~6gb used before starting the rllib, but some of it is taken by SC2 environments as well (~0.5gb per environment), and ray workers have some footprint as well (even unused ones, ~300mb). But trainer and sampler processes are constantly growing until the crash, which comes even faster if object store memory is limited, other processes are steadily using the memory.

Are the changes you made to the Ray source code possibly relevant (what do
they do)?

No changes to the ray base. In rllib, the only important change is that each environment is remote worker itself, but I don't think that's relevant. In the last post, I think I made a repro with vanilla ray, installed via pip, with no changes.

You can also try setting --ray-redis-max-memory=5000000000, though that
probably isn't the issue.

Tried. Didn't help.

ericl · 2019-02-03T07:29:32Z

I was able to reproduce this, and #3938 fixes the issue for me. After adding the copy, memory usage seems to stabilize once the object store memory limit is reached. Could you try it out?

I was running Pong with rllib train --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_envs_per_worker": 64}' --ray-object-store-memory=1000000000.

bjg2 · 2019-02-04T13:28:54Z

Hey Eric, I installed ray repo from git clone, with your fix, and it seems your fix doesn't work for me.

Both my command (rllib train --run IMPALA --ray-object-store-memory 5000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}') and your one (rllib train --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_envs_per_worker": 64}' --ray-object-store-memory=1000000000) break after few minutes, eitcher with either object does not fit in the plasma store or with something like

I0204 14:24:17.736359 121399 node_manager.cc:1790] Resubmitting task 00000000930b03f67c2dcee7022eeaf9c46a1d39 on client cbd6f7b1ddbbf08a4488ef1023bb0490acb7f719
W0204 14:24:17.736397 121399 node_manager.cc:1230] Task 00000000930b03f67c2dcee7022eeaf9c46a1d39 already in lineage cache. This is most likely due to reconstruction.
W0204 14:24:17.736407 121399 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

Can we try to iterate on this issue, as it seems it still persists? Can you also share with me basis on how ray manages memory so I could try to understand, as it seems to me that process memory is evergrowing for each ray process?

ericl · 2019-02-04T16:54:49Z

Sure, basically when you return objects from remote ray workers, they are copied into a shared memory object store.

When processes call ray.get() on objects, they get a memory mapped view of the object when possible. When the object store memory is full, objects are evicted in LRU order.

Objects can't be evicted though if they are referenced by some python object (i.e. some numpy array has a view of the shared memory).

ericl · 2019-02-04T18:28:48Z

I can also reproduce what you mentioned now -- a hang, and then

I0204 10:25:24.883201 26613 node_manager.cc:1790] Resubmitting task 0000000079c33f09947fa0f6931718f77cc49a86 on client efb5dbf89abe70bf940c2a3a293a157503cc5316
W0204 10:25:24.883232 26613 node_manager.cc:1230] Task 0000000079c33f09947fa0f6931718f77cc49a86 already in lineage cache. This is most likely due to reconstruction.
W0204 10:25:24.883247 26613 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

@stephanie-wang @robertnishihara do you have any insight into what's going on here? I don't know why we are getting reconstruction on a single node.

Also, there is another potential issue with that benchmark -- doing some back-of-the-envelope math, the peak memory usage is actually 20GB:

>>> obs_size = 80*80*4*4
>>> sample_batch_size = 200*obs_size*64   # num_envs_per_worker=64
>>> queue_size = max(train_batch_size, sample_batch_size)*16  # LEARNER_QUEUE_MAX_SIZE=16
>>> queue_size / 1e9
20.97152

The 16 constant comes from https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/async_samples_optimizer.py#L27

However even reducing num_envs_per_worker to 1 still causes the error (./train.py --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_envs_per_worker": 1, "sample_batch_size": 200, "num_gpus": 0}'), so will keep looking into it.

ericl · 2019-02-04T18:46:14Z

So the following trains for me up to 150K timesteps, can you confirm?
./train.py --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0}' --ray-object-store-memory=5000000000

Memory seems to reach steady state at ~4GB shared.

27990 eric      20   0 7895676 4.393g 4.170g S  11.8 38.4   0:40.05 ray_Impala+ 
27991 eric      20   0 6922052 4.390g 4.172g R 121.6 38.4   6:56.36 ray_Policy+

One thing I did do was comment out the compute_apply() block in async_samples_optimizer, so that it could run quickly on my laptop, but that part seems unrelated to the leak.

So couple observations:

There seems to be some "minimum memory" required, between 2-5GB. This might be due to memory fragmentation or other issues.
num_envs_per_worker will inflate the queue size since the returned batches will be of size num_envs_per_worker * sample_batch_size, increasing the peak memory usage. One option when using vectorization is to also reduce sample_batch_size proportionally.

stephanie-wang · 2019-02-04T19:47:07Z

I can also reproduce what you mentioned now -- a hang, and then
I0204 10:25:24.883201 26613 node_manager.cc:1790] Resubmitting task 0000000079c33f09947fa0f6931718f77cc49a86 on client efb5dbf89abe70bf940c2a3a293a157503cc5316
W0204 10:25:24.883232 26613 node_manager.cc:1230] Task 0000000079c33f09947fa0f6931718f77cc49a86 already in lineage cache. This is most likely due to reconstruction.
W0204 10:25:24.883247 26613 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
@stephanie-wang @robertnishihara do you have any insight into what's going on here? I don't know why we are getting reconstruction on a single node.

This is almost definitely happening due to object eviction, but I thought that we had fixed that in #3490, which throws an error to the application if an evicted object is needed. I believe that either there is a bug in that fix, or the application is somehow not handling the error.

bjg2 · 2019-02-05T16:41:12Z

So the following trains for me up to 150K timesteps, can you confirm?

Yes, I can. Shared memory stays at ~4.2GB.

I'm not sure I understand the vectorization issue. Thing is that rllib is working ok for a minute or two, it is learning, but with ever more memory, and then crashes as it uses all the memory. It manages to execute learner several times so it's not looking like learning data is taking unacceptably much of the RAM, but that it is not being freed...

ericl · 2019-02-05T17:11:00Z

The issue is just that the memory used is too much, since the batches are of size num_envs_per_worker*sample_batch_size (see the size estimates). In those cases there is no leak, just materializing the arrays in memory costs 20GB already. You might see some learning before the internal queue fills up. I think you just need to reduce the sample batch size and/or increase the memory allocation.

…

On Tue, Feb 5, 2019, 8:41 AM bjg2 ***@***.***> wrote: So the following trains for me up to 150K timesteps, can you confirm? Yes, I can. Shared memory stays at ~4.2GB. I'm not sure I understand the vectorization issue. Thing is that rllib is working ok for a minute or two, it is learning, but with ever more memory, and then crashes as it uses all the memory. It manages to execute learner several times so it's not looking like learning data is taking unacceptably much of the RAM, but that it is not being freed... — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#3884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SvqKFSY8eH2Dajjw8nQSM5xJKSJHks5vKbQygaJpZM4aV6sA> .

ericl · 2019-02-05T17:41:24Z

I think one thing we can do is make the learner queue size configurable, since a queue size of 16 batches might be too large for many cases.

bjg2 · 2019-02-06T13:10:16Z

Aha, so I understand your idea now. Still, I don't think learning queue ever gets full.

I was debugging some more, with this change to LearnerThread (prints added)

def step(self):
        print('before', self.inqueue.qsize())
        with self.queue_timer:
            batch, _ = self.minibatch_buffer.get()

        with self.grad_timer:
            fetches = self.local_evaluator.compute_apply(batch)
            self.weights_updated = True
            self.stats = fetches.get("stats", {})
        print('after', self.inqueue.qsize())

        self.outqueue.put(batch.count)
        self.learner_queue_size.push(self.inqueue.qsize())

Running this

rllib train --run IMPALA --ray-object-store-memory 5000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}'

it never printed more than 1 (after/before), and ended up with:

before 0
2019-02-06 11:35:27,395 WARNING metrics.py:61 -- WARNING: 1 workers have NOT returned metrics
after 0
before 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0206 11:36:37.398226 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:37.398291 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0206 11:36:47.398284 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:47.398316 91278 node_manager.cc:1230] Task 000000009a457efa0a34886410bbe3bcbc3f5ea9 already in lineage cache. This is most likely due to reconstruction.
W0206 11:36:47.398329 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0206 11:36:57.398454 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:57.398486 91278 node_manager.cc:1230] Task 000000009a457efa0a34886410bbe3bcbc3f5ea9 already in lineage cache. This is most likely due to reconstruction.
W0206 11:36:57.398497 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0206 11:37:07.398537 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:37:07.398568 91278 node_manager.cc:1230] Task 000000009a457efa0a34886410bbe3bcbc3f5ea9 already in lineage cache. This is most likely due to reconstruction.

When running the same experiment, but with object-store-memory 25gb, it again doesn't have queued learning, it doesn't break and seems to stabilize memory usage

rllib train --run IMPALA --ray-object-store-memory 25000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}'

And when I start our own SC2 IMPALA learning with 1 worker and 10 envs inside it dies after ~1 hour, with More than 95% of the memory on node bjg-desktop is used (32.05 / 33.64 GB). Like others, it doesn't have any queued learning. 12.8gb is used before I start learning, 22.8gb is used after first few learning steps. Other 10gb come slowly during one hour.

If I try the same thing with 15 GB object store memory it breaks after few minutes with something like:

I0206 14:04:38.422194 128948 node_manager.cc:1790] Resubmitting task 00000000525e7b956187e732337d0c4cb8b923fc on client 4ae6e110c07bc1f041c76588b285edc2cb14e015
W0206 14:04:38.422307 128948 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

Conclusion: we still haven't managed to start our learning.
What I noticed was that each time task was resubmitted error occured these errors came before:

2019-02-06 14:03:28,418 WARNING metrics.py:61 -- WARNING: 1 workers have NOT returned metrics
WARNING: Logging before InitGoogleLogging() is written to STDERR

PS. I really have trouble understanding memory usage. Old objects are evicted from object cache if another object needs to get in object cache and there's no more memory inside? Can we change some setting so object cache cleanup is done more often, even before it is full? That way, RAM usage should stabilize much faster?

Also, look at the example below - I don't understand what is resident memory, what is shared memory, which process uses how many memory, everything seems to be confusing. How can a process have resident memory bigger than whole RAM usage of the computer?

ericl · 2019-02-06T21:53:21Z

resident memory, what is shared memory, which process uses how many memory, everything seems to be confusing. How can a process have resident memory bigger than whole RAM usage of the compute

I think the resident memory is misleading here -- the memory used by the process is just "Memory", and the object store memory is "Shared Mem". So in the screenshot there is ~16GB used by the object store, and ~1.4GB, ~1.1GB used by the processes.

I0206 11:36:37.398226 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:37.398291 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

Will look into this, it seems related to #3490

PS. I really have trouble understanding memory usage. Old objects are evicted from object cache if another object needs to get in object cache and there's no more memory inside?

That's right, it's LRU order once full.

Can we change some setting so object cache cleanup is done more often, even before it is full? That way, RAM usage should stabilize much faster?

I agree some sort of more eager cleanup would be nice here. It's challenging since the object store contains objects from multiple nodes, so you have to do distributed reference counting. In addition, it is not always possible to shrink the shared memory region depending on memory fragmentation (which can also cause "leaks" in itself if memory is fragmented into too small chunks).

ericl · 2019-02-07T00:06:55Z

I'm able to reproduce the above errors in #3970
In that example script, if you change the BYTES_IN_FLIGHT parameter to be less than the object store mem (e.g., 300MB), then it runs ok forever. Otherwise, you get the errors about resubmission (which are normal when the object store is out of memory, just misleading).

I think the easiest path forward is to greatly reduce the sample batch size, i.e., ensure that obs_size_bytes * max(train_batch_size, sample_batch_size * num_envs_per_worker) * LEARNER_QUEUE_SIZE << object_store_size.
Alternatively, increase the available memory.

The slowly increasing shared memory usage is harder to explain, but could be memory fragmentation in the object store. I don't think it's a simple leak since otherwise you'd be able to reproduce with the Pong example.

ericl · 2019-02-07T00:24:49Z

Note that train_batch_size could also increase shared memory usage, since the code buffers sample batches until the target train_batch_size is reached. So you want to consider max(train_batch_size, sample_batch_size * num_envs_per_worker) as the memory usage per item in the learner queue.

ericl · 2019-02-07T00:32:52Z

Ah, this might be it: the replay buffer is filled with up to 100 items even when replay_proportion = 0.0 🤦‍♂️

#3971

After this fix, ./train.py --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0, "num_envs_per_worker": 64, "sample_batch_size": 50, "train_batch_size": 50, "learner_queue_size": 1}' --ray-object-store-memory=500000000 (500MB) is stable for me.

bjg2 · 2019-02-07T16:28:59Z

Maybe I haven't emphasized enough that I don't think that it is possible that we're using that much memory. Obs for SC2 has the same size as the one for pong. Math for our SC2 learning:

each environment observation is ~100kb
behavior_logits are ~10kb
everything else that goes to episode learning step is much less
single_sample_size_kb ~ obs + new_obs + behaviour_logits = 210kb
num_workers = 1
num_envs_per_worker = 16
sample_batch_size = 50
train_batch_size = num_envs_per_worker * sample_batch_size (set to this value) = 800
train_batch_size_mb = 800 * 210kb ~ 164mb

Let's say that there is another batch in making while one is training (we never saw more than 1 batch queued to the learner) that is ~320mb. Also, our network weights are 2.9 mb, and pong weights are 4mb (checked). So memory usage regarding learning data should be less than 0.5 GB. It should behave similar to pong and shouldn't eat much more RAM and constantly grow...

I tried running with 'replay_buffer_num_slots': 0, it helper a bit, but I don't think it helped much. I see a new process growing in memory that I don't think I saw before - redis-server. It seems to be growing all the time. Not sure how it comes into the story with memory usage we mentioned so far, like object store memory and shared memory in processes?

Idea: SC2 diff is due to the remote workers

It seems that pong examples are stable now, but SC2 still can't learn without using evermore memory. I think I managed to pinpoint the current issue I face a bit more - this new redis-server process helped a lot. Idea: redis-server process is growing because the obs from remote workers never get cleaned.

SC2 with remote workers (~550 samples per second): redis-server process grows ~2mb/s
pong: redis-server process grows ~0.007 mb/s
SC2 without remote workers (~145 samples per second): redis-server process grows ~0.007 mb/s

It seems that remote workers are actually creating the problem. With 550 samples per second, something like 550 * 100kb of data should be stored in redis per second, which is ~53mb, but as I understand redis is compressing before values before storing them, so it seems possible they are compressed down ~25 times, 100kb->4kb (obs are mostly the same values, highly compressible).

Do you think this might be the case? If yes how can I fix that issue? If this is indeed the case you can close this issue and we can continue the discussion in #3968 [wingman -> rllib] Remote and entangled environments.

ericl · 2019-02-07T20:20:21Z

If redis-server is the only problem now, that's easy to fix by setting redis_max_memory to a smaller value (it defaults to 5GB by default).

Redis is used to organize communication between workers (i.e. task metadata). The reason it's allowed to use a bunch of memory is to track task lineage for fault tolerance purposes. 5GB is probably overkill for your use case though.

There was also another issue before where all logs got written into redis and not removed, but that may have been fixed.

bjg2 · 2019-02-08T12:59:45Z

Think I finally figured out what was causing us to misleading paths of investigation. train/sample processes had the biggest memory usage and their shared memory was growing. Shared memory was growing until it leveled at ~10gb (often didn't even wait until it levels). We thought those processes and shared memory were the culprits for the issue, and we were doing all around fixing those, like increasing object store memory. Accidentally, while trying a bunch of random stuff, after changing object store memory to 3gb, shared memory of these processes was leveling up faster, at ~2.5gb. When we saw these processes are not growing, we took a look at what was, and that was redis-server.

So there were two separate issues:

default object store too big, which was eating huge parts of our RAM without need instead of recycling - in our initial tests seems that this was actually the biggest cause of the crashes
redis-server growing in remote setup - in our later tests crashes were due to this process growing ~2mb/s

I tried your suggestion, starting with redis_max_memory=int(1e9), but that didn't help. Redis would grow until ~1.3gb, and then hang, (learning would stop, rllib would seem to freeze). Can you share with us the difference between object store and redis? Object store is not stored in redis? How come only our remote worker communication ends up filling redis, and the framework communication does not (like sending batches from remote sampler to local optimizer)?

I created a repro. You could run this with my remote worker changes to reproduce the issue on pong environment: pong_experiment. Redis store memory is limited at 100mb, very fast it will reach ~130mb, and will hang. If you change remote_worker_envs to False, it will work forever.

ericl · 2019-02-08T21:50:20Z

Redis is used to store object metadata, and the object store is used to store object data in shared memory. So the redis memory usage will be proportional to the number of remote tasks launched, which would naturally be much higher when each step of the environment creates a new object.

I am also able to reproduce the hang on a branch with remote envs, and filed: #3997

It's possible we will discover a fix soon, but the only feasible workaround at the moment is to not use Ray tasks for the small step()s (i.e., use https://github.com/openai/baselines/blob/master/baselines/common/vec_env/subproc_vec_env.py for the remote env wrapper).

Update: looks like we understand the root cause and it should be fixed soon.

bjg2 · 2019-02-11T10:12:55Z

Great! We're waiting for the fix.

ericl · 2019-02-14T06:21:10Z

Hey @bjg2 it should be fixed now.

bjg2 · 2019-02-14T11:56:36Z

Hey, thanks!

Tried it out. It seems not to hang when redis gets to the limit as it used to, but it seems not to force the limit as well. Using redis_max_memory=int(1e8) limit, it went well over 100mb and it continues growing:

Here is the repro: https://pastebin.com/ktRfq5ZY

ericl · 2019-02-15T03:51:54Z

That leak should be fixed now too.

bjg2 · 2019-02-18T10:08:18Z

Hey Eric, I can confirm redis is not growing anymore. Our SC2 training is finally working! Will stress it more in the next few days, but the issue seems to be closed! Thank you very much!

ericl self-assigned this Jan 29, 2019

ericl added bug Something that is supposed to be working; but isn't stability-blocker labels Jan 29, 2019

ericl mentioned this issue Feb 3, 2019

[rllib] Add copy() in async samples optimizer to fix memory leak #3938

Merged

richardliaw closed this as completed in 7ef830b Feb 4, 2019

ericl reopened this Feb 4, 2019

ericl mentioned this issue Feb 6, 2019

Bad error message when objects are evicted due to running out of space #3970

Closed

ericl removed the stability-blocker label Feb 7, 2019

ericl mentioned this issue Feb 7, 2019

[rllib] Replay buffer for IMPALA should default to 0 slots. #3971

Merged

ericl added the stability-blocker label Feb 7, 2019

stephanie-wang mentioned this issue Feb 14, 2019

Fix memory leak in Redis by using auto memory management #4054

Merged

ericl closed this as completed in #4054 Feb 15, 2019

ericl reopened this Feb 15, 2019

bjg2 closed this as completed Feb 18, 2019

jkterry1 mentioned this issue Jan 20, 2020

[rllib] Memory leak in MADDPG #6857

Closed

matej-macak mentioned this issue May 17, 2020

[rllib] [tune] Memory increase when training environment in rllib and tune #8473

Closed

Svalorzen mentioned this issue Jun 3, 2020

[rllib] Memory usage growing infinitely using QMIX #8763

Closed

jcallaham mentioned this issue Sep 27, 2022

Apparent memory leak in PPO training dynamicslab/hydrogym#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] Memory usage constantly growing while training IMPALA #3884

[rllib] Memory usage constantly growing while training IMPALA #3884

bjg2 commented Jan 28, 2019 •

edited

Loading

ericl commented Jan 28, 2019

bjg2 commented Jan 29, 2019 •

edited

Loading

ericl commented Jan 29, 2019

bjg2 commented Jan 29, 2019

ericl commented Jan 29, 2019

robertnishihara commented Jan 29, 2019 via email

bjg2 commented Jan 30, 2019 •

edited

Loading

ericl commented Feb 3, 2019

bjg2 commented Feb 4, 2019 •

edited

Loading

ericl commented Feb 4, 2019

ericl commented Feb 4, 2019 •

edited

Loading

ericl commented Feb 4, 2019 •

edited

Loading

stephanie-wang commented Feb 4, 2019

bjg2 commented Feb 5, 2019 •

edited

Loading

ericl commented Feb 5, 2019 via email

ericl commented Feb 5, 2019

bjg2 commented Feb 6, 2019

ericl commented Feb 6, 2019

ericl commented Feb 7, 2019 •

edited

Loading

ericl commented Feb 7, 2019

ericl commented Feb 7, 2019 •

edited

Loading

bjg2 commented Feb 7, 2019

ericl commented Feb 7, 2019

bjg2 commented Feb 8, 2019 •

edited

Loading

ericl commented Feb 8, 2019 •

edited

Loading

bjg2 commented Feb 11, 2019

ericl commented Feb 14, 2019

bjg2 commented Feb 14, 2019 •

edited

Loading

ericl commented Feb 15, 2019

bjg2 commented Feb 18, 2019

[rllib] Memory usage constantly growing while training IMPALA #3884

[rllib] Memory usage constantly growing while training IMPALA #3884

Comments

bjg2 commented Jan 28, 2019 • edited Loading

System information

Describe the problem

ericl commented Jan 28, 2019

bjg2 commented Jan 29, 2019 • edited Loading

ericl commented Jan 29, 2019

bjg2 commented Jan 29, 2019

ericl commented Jan 29, 2019

robertnishihara commented Jan 29, 2019 via email

bjg2 commented Jan 30, 2019 • edited Loading

ericl commented Feb 3, 2019

bjg2 commented Feb 4, 2019 • edited Loading

ericl commented Feb 4, 2019

ericl commented Feb 4, 2019 • edited Loading

ericl commented Feb 4, 2019 • edited Loading

stephanie-wang commented Feb 4, 2019

bjg2 commented Feb 5, 2019 • edited Loading

ericl commented Feb 5, 2019 via email

ericl commented Feb 5, 2019

bjg2 commented Feb 6, 2019

ericl commented Feb 6, 2019

ericl commented Feb 7, 2019 • edited Loading

ericl commented Feb 7, 2019

ericl commented Feb 7, 2019 • edited Loading

bjg2 commented Feb 7, 2019

Idea: SC2 diff is due to the remote workers

ericl commented Feb 7, 2019

bjg2 commented Feb 8, 2019 • edited Loading

ericl commented Feb 8, 2019 • edited Loading

bjg2 commented Feb 11, 2019

ericl commented Feb 14, 2019

bjg2 commented Feb 14, 2019 • edited Loading

ericl commented Feb 15, 2019

bjg2 commented Feb 18, 2019

bjg2 commented Jan 28, 2019 •

edited

Loading

bjg2 commented Jan 29, 2019 •

edited

Loading

bjg2 commented Jan 30, 2019 •

edited

Loading

bjg2 commented Feb 4, 2019 •

edited

Loading

ericl commented Feb 4, 2019 •

edited

Loading

ericl commented Feb 4, 2019 •

edited

Loading

bjg2 commented Feb 5, 2019 •

edited

Loading

ericl commented Feb 7, 2019 •

edited

Loading

ericl commented Feb 7, 2019 •

edited

Loading

bjg2 commented Feb 8, 2019 •

edited

Loading

ericl commented Feb 8, 2019 •

edited

Loading

bjg2 commented Feb 14, 2019 •

edited

Loading