Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Memory usage constantly growing while training IMPALA #3884

Closed
bjg2 opened this issue Jan 28, 2019 · 30 comments · Fixed by #4054
Closed

[rllib] Memory usage constantly growing while training IMPALA #3884

bjg2 opened this issue Jan 28, 2019 · 30 comments · Fixed by #4054
Assignees
Labels
bug Something that is supposed to be working; but isn't

Comments

@bjg2
Copy link
Contributor

bjg2 commented Jan 28, 2019

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.1 LTS
  • Ray installed from (source or binary): source, with few changes
  • Ray version: 0.6.2
  • Python version: Python 3.6.6 :: Anaconda, Inc.
  • Exact command to reproduce: train IMPALA, described better below

Describe the problem

Memory usage (resident memory/shared memory) in sampler and trainer is constantly growing during training IMPALA. Very quickly (in few minutes) it fills all the RAM.

Some important remarks in how do I use the rllib:

  • I'm training SC2 via pysc2 in my custom created env
  • My neural net is fairly big, observations as well, and combining environment observations for inference is important for me - it improves performances a lot. Training is done on a small number of workers (currently on my machine 1) and many environments per worker (currently for test - 10).
  • SC2 env communication is slow compared to Atari environments, and communication is done via ports - so I implemented making environments in separate remote processes and doing step / reset on them in parallel in the policy evaluator.

With the usage explained, ray_PolicyEvaluator:sample() and ray_ImpalaAgent:train() processes are just getting bigger and bigger (they take more and more resident/shared memory), as shown in this system manager screenshot:
image

Am I doing something wrong? Is it doing something I don't expect, like keeps old episodes or something, should I change some config?

@ericl
Copy link
Contributor

ericl commented Jan 28, 2019

So it's expected that there will be a lot of shared memory used -- that's probably the Ray object store.

By default object store memory is capped to min(20GB, 40% of total memory), so you should see it stabilize there. You can also set the object store memory with ray.init(object_store_memory=) to lower it (I don't think IMPALA actually requires that much shared memory, just O(BATCH_SIZE * NUM_WORKERS)).

@bjg2
Copy link
Contributor Author

bjg2 commented Jan 29, 2019

Setting object_store_memory=int(5e9) in ray init didn't help - memory was raising as fast as it did before and rllib hangs after few minutes, with memory usage as below.
image

Without it, it hangs as well, few minutes later, with memory usage as below (using a bit more memory):
image

If you wait a few minutes after it hangs, it starts throwing these crashes in the loop:

I0129 12:12:36.829390 33426 node_manager.cc:1790] Resubmitting task 000000000b3c42d45dced6236f781d17e7fd34e6 on client 86d28c0403aefea0128331a2f637e9cbb87c0762
W0129 12:12:36.829485 33426 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0129 12:12:46.829478 33426 node_manager.cc:1790] Resubmitting task 000000000b3c42d45dced6236f781d17e7fd34e6 on client 86d28c0403aefea0128331a2f637e9cbb87c0762
W0129 12:12:46.829512 33426 node_manager.cc:1230] Task 000000000b3c42d45dced6236f781d17e7fd34e6 already in lineage cache. This is most likely due to reconstruction.

This one is a real blocker, stops me from using rllib at the moment. Can you point me to the steps I can do to mitigate the issue or to debug it properly?

@ericl
Copy link
Contributor

ericl commented Jan 29, 2019

Hmm, I would try to reproduce this without SC2, or otherwise isolate the cause (if there is a reproduction script, I can help track down the issue as well).

@bjg2
Copy link
Contributor Author

bjg2 commented Jan 29, 2019

Ok, I think I created a repro:

virtualenv -p python3 venv
source venv/bin/activate
pip install ray
pip install ray[rllib]
pip install psutil
pip install tensorflow-gpu
rllib train --run IMPALA --ray-object-store-memory 5000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}'

Object store memory limit is here so that it gets to the limit faster. Train and sample processes grow and grow until they have ~6gb on RAM each (in few minutes), app hangs, and at the end crashes (though end error seems to be different - no more memory in plasma store - probably a different wording of the same issue).

@ericl ericl self-assigned this Jan 29, 2019
@ericl
Copy link
Contributor

ericl commented Jan 29, 2019

Thanks! I'll be able to take a look later this week.

@ericl ericl added bug Something that is supposed to be working; but isn't stability-blocker labels Jan 29, 2019
@robertnishihara
Copy link
Collaborator

robertnishihara commented Jan 29, 2019 via email

@bjg2
Copy link
Contributor Author

bjg2 commented Jan 30, 2019

How much memory does the machine have?

It has 32gb, ~6gb used before starting the rllib, but some of it is taken by SC2 environments as well (~0.5gb per environment), and ray workers have some footprint as well (even unused ones, ~300mb). But trainer and sampler processes are constantly growing until the crash, which comes even faster if object store memory is limited, other processes are steadily using the memory.

Are the changes you made to the Ray source code possibly relevant (what do
they do)?

No changes to the ray base. In rllib, the only important change is that each environment is remote worker itself, but I don't think that's relevant. In the last post, I think I made a repro with vanilla ray, installed via pip, with no changes.

You can also try setting --ray-redis-max-memory=5000000000, though that
probably isn't the issue.

Tried. Didn't help.

@ericl
Copy link
Contributor

ericl commented Feb 3, 2019

I was able to reproduce this, and #3938 fixes the issue for me. After adding the copy, memory usage seems to stabilize once the object store memory limit is reached. Could you try it out?

I was running Pong with rllib train --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_envs_per_worker": 64}' --ray-object-store-memory=1000000000.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 4, 2019

Hey Eric, I installed ray repo from git clone, with your fix, and it seems your fix doesn't work for me.

Both my command (rllib train --run IMPALA --ray-object-store-memory 5000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}') and your one (rllib train --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_envs_per_worker": 64}' --ray-object-store-memory=1000000000) break after few minutes, eitcher with either object does not fit in the plasma store or with something like

I0204 14:24:17.736359 121399 node_manager.cc:1790] Resubmitting task 00000000930b03f67c2dcee7022eeaf9c46a1d39 on client cbd6f7b1ddbbf08a4488ef1023bb0490acb7f719
W0204 14:24:17.736397 121399 node_manager.cc:1230] Task 00000000930b03f67c2dcee7022eeaf9c46a1d39 already in lineage cache. This is most likely due to reconstruction.
W0204 14:24:17.736407 121399 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

Can we try to iterate on this issue, as it seems it still persists? Can you also share with me basis on how ray manages memory so I could try to understand, as it seems to me that process memory is evergrowing for each ray process?

@ericl ericl reopened this Feb 4, 2019
@ericl
Copy link
Contributor

ericl commented Feb 4, 2019

Sure, basically when you return objects from remote ray workers, they are copied into a shared memory object store.

When processes call ray.get() on objects, they get a memory mapped view of the object when possible. When the object store memory is full, objects are evicted in LRU order.

Objects can't be evicted though if they are referenced by some python object (i.e. some numpy array has a view of the shared memory).

@ericl
Copy link
Contributor

ericl commented Feb 4, 2019

I can also reproduce what you mentioned now -- a hang, and then

I0204 10:25:24.883201 26613 node_manager.cc:1790] Resubmitting task 0000000079c33f09947fa0f6931718f77cc49a86 on client efb5dbf89abe70bf940c2a3a293a157503cc5316
W0204 10:25:24.883232 26613 node_manager.cc:1230] Task 0000000079c33f09947fa0f6931718f77cc49a86 already in lineage cache. This is most likely due to reconstruction.
W0204 10:25:24.883247 26613 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

@stephanie-wang @robertnishihara do you have any insight into what's going on here? I don't know why we are getting reconstruction on a single node.

Also, there is another potential issue with that benchmark -- doing some back-of-the-envelope math, the peak memory usage is actually 20GB:

>>> obs_size = 80*80*4*4
>>> sample_batch_size = 200*obs_size*64   # num_envs_per_worker=64
>>> queue_size = max(train_batch_size, sample_batch_size)*16  # LEARNER_QUEUE_MAX_SIZE=16
>>> queue_size / 1e9
20.97152

The 16 constant comes from https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/async_samples_optimizer.py#L27

However even reducing num_envs_per_worker to 1 still causes the error (./train.py --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_envs_per_worker": 1, "sample_batch_size": 200, "num_gpus": 0}'), so will keep looking into it.

@ericl
Copy link
Contributor

ericl commented Feb 4, 2019

So the following trains for me up to 150K timesteps, can you confirm?
./train.py --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0}' --ray-object-store-memory=5000000000

Memory seems to reach steady state at ~4GB shared.

27990 eric      20   0 7895676 4.393g 4.170g S  11.8 38.4   0:40.05 ray_Impala+ 
27991 eric      20   0 6922052 4.390g 4.172g R 121.6 38.4   6:56.36 ray_Policy+ 

One thing I did do was comment out the compute_apply() block in async_samples_optimizer, so that it could run quickly on my laptop, but that part seems unrelated to the leak.

So couple observations:

  • There seems to be some "minimum memory" required, between 2-5GB. This might be due to memory fragmentation or other issues.
  • num_envs_per_worker will inflate the queue size since the returned batches will be of size num_envs_per_worker * sample_batch_size, increasing the peak memory usage. One option when using vectorization is to also reduce sample_batch_size proportionally.

@stephanie-wang
Copy link
Contributor

I can also reproduce what you mentioned now -- a hang, and then

I0204 10:25:24.883201 26613 node_manager.cc:1790] Resubmitting task 0000000079c33f09947fa0f6931718f77cc49a86 on client efb5dbf89abe70bf940c2a3a293a157503cc5316
W0204 10:25:24.883232 26613 node_manager.cc:1230] Task 0000000079c33f09947fa0f6931718f77cc49a86 already in lineage cache. This is most likely due to reconstruction.
W0204 10:25:24.883247 26613 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

@stephanie-wang @robertnishihara do you have any insight into what's going on here? I don't know why we are getting reconstruction on a single node.

This is almost definitely happening due to object eviction, but I thought that we had fixed that in #3490, which throws an error to the application if an evicted object is needed. I believe that either there is a bug in that fix, or the application is somehow not handling the error.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 5, 2019

So the following trains for me up to 150K timesteps, can you confirm?

Yes, I can. Shared memory stays at ~4.2GB.

I'm not sure I understand the vectorization issue. Thing is that rllib is working ok for a minute or two, it is learning, but with ever more memory, and then crashes as it uses all the memory. It manages to execute learner several times so it's not looking like learning data is taking unacceptably much of the RAM, but that it is not being freed...

@ericl
Copy link
Contributor

ericl commented Feb 5, 2019 via email

@ericl
Copy link
Contributor

ericl commented Feb 5, 2019

I think one thing we can do is make the learner queue size configurable, since a queue size of 16 batches might be too large for many cases.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 6, 2019

Aha, so I understand your idea now. Still, I don't think learning queue ever gets full.

I was debugging some more, with this change to LearnerThread (prints added)

def step(self):
        print('before', self.inqueue.qsize())
        with self.queue_timer:
            batch, _ = self.minibatch_buffer.get()

        with self.grad_timer:
            fetches = self.local_evaluator.compute_apply(batch)
            self.weights_updated = True
            self.stats = fetches.get("stats", {})
        print('after', self.inqueue.qsize())

        self.outqueue.put(batch.count)
        self.learner_queue_size.push(self.inqueue.qsize())

Running this

rllib train --run IMPALA --ray-object-store-memory 5000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}'

it never printed more than 1 (after/before), and ended up with:

before 0
2019-02-06 11:35:27,395 WARNING metrics.py:61 -- WARNING: 1 workers have NOT returned metrics
after 0
before 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0206 11:36:37.398226 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:37.398291 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0206 11:36:47.398284 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:47.398316 91278 node_manager.cc:1230] Task 000000009a457efa0a34886410bbe3bcbc3f5ea9 already in lineage cache. This is most likely due to reconstruction.
W0206 11:36:47.398329 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0206 11:36:57.398454 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:57.398486 91278 node_manager.cc:1230] Task 000000009a457efa0a34886410bbe3bcbc3f5ea9 already in lineage cache. This is most likely due to reconstruction.
W0206 11:36:57.398497 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.
I0206 11:37:07.398537 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:37:07.398568 91278 node_manager.cc:1230] Task 000000009a457efa0a34886410bbe3bcbc3f5ea9 already in lineage cache. This is most likely due to reconstruction.

When running the same experiment, but with object-store-memory 25gb, it again doesn't have queued learning, it doesn't break and seems to stabilize memory usage

rllib train --run IMPALA --ray-object-store-memory 25000000000 --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0.5, "num_gpus_per_worker":0.5, "num_envs_per_worker": 64, "num_cpus_per_worker":11}'

And when I start our own SC2 IMPALA learning with 1 worker and 10 envs inside it dies after ~1 hour, with More than 95% of the memory on node bjg-desktop is used (32.05 / 33.64 GB). Like others, it doesn't have any queued learning. 12.8gb is used before I start learning, 22.8gb is used after first few learning steps. Other 10gb come slowly during one hour.

If I try the same thing with 15 GB object store memory it breaks after few minutes with something like:

I0206 14:04:38.422194 128948 node_manager.cc:1790] Resubmitting task 00000000525e7b956187e732337d0c4cb8b923fc on client 4ae6e110c07bc1f041c76588b285edc2cb14e015
W0206 14:04:38.422307 128948 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

Conclusion: we still haven't managed to start our learning.
What I noticed was that each time task was resubmitted error occured these errors came before:

2019-02-06 14:03:28,418 WARNING metrics.py:61 -- WARNING: 1 workers have NOT returned metrics
WARNING: Logging before InitGoogleLogging() is written to STDERR

PS. I really have trouble understanding memory usage. Old objects are evicted from object cache if another object needs to get in object cache and there's no more memory inside? Can we change some setting so object cache cleanup is done more often, even before it is full? That way, RAM usage should stabilize much faster?

Also, look at the example below - I don't understand what is resident memory, what is shared memory, which process uses how many memory, everything seems to be confusing. How can a process have resident memory bigger than whole RAM usage of the computer?

image 1
pasted_image_at_2019-02-06__11_39_am

@ericl
Copy link
Contributor

ericl commented Feb 6, 2019

resident memory, what is shared memory, which process uses how many memory, everything seems to be confusing. How can a process have resident memory bigger than whole RAM usage of the compute

I think the resident memory is misleading here -- the memory used by the process is just "Memory", and the object store memory is "Shared Mem". So in the screenshot there is ~16GB used by the object store, and ~1.4GB, ~1.1GB used by the processes.

I0206 11:36:37.398226 91278 node_manager.cc:1790] Resubmitting task 000000009a457efa0a34886410bbe3bcbc3f5ea9 on client 706a70588175087bd56148f08457ddeb2dc4d706
W0206 11:36:37.398291 91278 node_manager.cc:1263] A task was resubmitted, so we are ignoring it. This should only happen during reconstruction.

Will look into this, it seems related to #3490

PS. I really have trouble understanding memory usage. Old objects are evicted from object cache if another object needs to get in object cache and there's no more memory inside?

That's right, it's LRU order once full.

Can we change some setting so object cache cleanup is done more often, even before it is full? That way, RAM usage should stabilize much faster?

I agree some sort of more eager cleanup would be nice here. It's challenging since the object store contains objects from multiple nodes, so you have to do distributed reference counting. In addition, it is not always possible to shrink the shared memory region depending on memory fragmentation (which can also cause "leaks" in itself if memory is fragmented into too small chunks).

@ericl
Copy link
Contributor

ericl commented Feb 7, 2019

I'm able to reproduce the above errors in #3970
In that example script, if you change the BYTES_IN_FLIGHT parameter to be less than the object store mem (e.g., 300MB), then it runs ok forever. Otherwise, you get the errors about resubmission (which are normal when the object store is out of memory, just misleading).

I think the easiest path forward is to greatly reduce the sample batch size, i.e., ensure that obs_size_bytes * max(train_batch_size, sample_batch_size * num_envs_per_worker) * LEARNER_QUEUE_SIZE << object_store_size.
Alternatively, increase the available memory.

The slowly increasing shared memory usage is harder to explain, but could be memory fragmentation in the object store. I don't think it's a simple leak since otherwise you'd be able to reproduce with the Pong example.

@ericl
Copy link
Contributor

ericl commented Feb 7, 2019

Note that train_batch_size could also increase shared memory usage, since the code buffers sample batches until the target train_batch_size is reached. So you want to consider max(train_batch_size, sample_batch_size * num_envs_per_worker) as the memory usage per item in the learner queue.

@ericl
Copy link
Contributor

ericl commented Feb 7, 2019

Ah, this might be it: the replay buffer is filled with up to 100 items even when replay_proportion = 0.0 🤦‍♂️

#3971

After this fix, ./train.py --run IMPALA --env PongDeterministic-v4 --config '{"num_workers": 1, "num_gpus": 0, "num_envs_per_worker": 64, "sample_batch_size": 50, "train_batch_size": 50, "learner_queue_size": 1}' --ray-object-store-memory=500000000 (500MB) is stable for me.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 7, 2019

Maybe I haven't emphasized enough that I don't think that it is possible that we're using that much memory. Obs for SC2 has the same size as the one for pong. Math for our SC2 learning:

each environment observation is ~100kb
behavior_logits are ~10kb
everything else that goes to episode learning step is much less
single_sample_size_kb ~ obs + new_obs + behaviour_logits = 210kb
num_workers = 1
num_envs_per_worker = 16
sample_batch_size = 50
train_batch_size = num_envs_per_worker * sample_batch_size (set to this value) = 800
train_batch_size_mb = 800 * 210kb ~ 164mb

Let's say that there is another batch in making while one is training (we never saw more than 1 batch queued to the learner) that is ~320mb. Also, our network weights are 2.9 mb, and pong weights are 4mb (checked). So memory usage regarding learning data should be less than 0.5 GB. It should behave similar to pong and shouldn't eat much more RAM and constantly grow...

I tried running with 'replay_buffer_num_slots': 0, it helper a bit, but I don't think it helped much. I see a new process growing in memory that I don't think I saw before - redis-server. It seems to be growing all the time. Not sure how it comes into the story with memory usage we mentioned so far, like object store memory and shared memory in processes?

image

Idea: SC2 diff is due to the remote workers

It seems that pong examples are stable now, but SC2 still can't learn without using evermore memory. I think I managed to pinpoint the current issue I face a bit more - this new redis-server process helped a lot. Idea: redis-server process is growing because the obs from remote workers never get cleaned.

  • SC2 with remote workers (~550 samples per second): redis-server process grows ~2mb/s
  • pong: redis-server process grows ~0.007 mb/s
  • SC2 without remote workers (~145 samples per second): redis-server process grows ~0.007 mb/s

It seems that remote workers are actually creating the problem. With 550 samples per second, something like 550 * 100kb of data should be stored in redis per second, which is ~53mb, but as I understand redis is compressing before values before storing them, so it seems possible they are compressed down ~25 times, 100kb->4kb (obs are mostly the same values, highly compressible).

Do you think this might be the case? If yes how can I fix that issue? If this is indeed the case you can close this issue and we can continue the discussion in #3968 [wingman -> rllib] Remote and entangled environments.

@ericl
Copy link
Contributor

ericl commented Feb 7, 2019

If redis-server is the only problem now, that's easy to fix by setting redis_max_memory to a smaller value (it defaults to 5GB by default).

Redis is used to organize communication between workers (i.e. task metadata). The reason it's allowed to use a bunch of memory is to track task lineage for fault tolerance purposes. 5GB is probably overkill for your use case though.

There was also another issue before where all logs got written into redis and not removed, but that may have been fixed.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 8, 2019

Think I finally figured out what was causing us to misleading paths of investigation. train/sample processes had the biggest memory usage and their shared memory was growing. Shared memory was growing until it leveled at ~10gb (often didn't even wait until it levels). We thought those processes and shared memory were the culprits for the issue, and we were doing all around fixing those, like increasing object store memory. Accidentally, while trying a bunch of random stuff, after changing object store memory to 3gb, shared memory of these processes was leveling up faster, at ~2.5gb. When we saw these processes are not growing, we took a look at what was, and that was redis-server.

So there were two separate issues:

  • default object store too big, which was eating huge parts of our RAM without need instead of recycling - in our initial tests seems that this was actually the biggest cause of the crashes
  • redis-server growing in remote setup - in our later tests crashes were due to this process growing ~2mb/s

I tried your suggestion, starting with redis_max_memory=int(1e9), but that didn't help. Redis would grow until ~1.3gb, and then hang, (learning would stop, rllib would seem to freeze). Can you share with us the difference between object store and redis? Object store is not stored in redis? How come only our remote worker communication ends up filling redis, and the framework communication does not (like sending batches from remote sampler to local optimizer)?

I created a repro. You could run this with my remote worker changes to reproduce the issue on pong environment: pong_experiment. Redis store memory is limited at 100mb, very fast it will reach ~130mb, and will hang. If you change remote_worker_envs to False, it will work forever.

@ericl
Copy link
Contributor

ericl commented Feb 8, 2019

Redis is used to store object metadata, and the object store is used to store object data in shared memory. So the redis memory usage will be proportional to the number of remote tasks launched, which would naturally be much higher when each step of the environment creates a new object.

I am also able to reproduce the hang on a branch with remote envs, and filed: #3997

It's possible we will discover a fix soon, but the only feasible workaround at the moment is to not use Ray tasks for the small step()s (i.e., use https://github.com/openai/baselines/blob/master/baselines/common/vec_env/subproc_vec_env.py for the remote env wrapper).

Update: looks like we understand the root cause and it should be fixed soon.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 11, 2019

Great! We're waiting for the fix.

@ericl
Copy link
Contributor

ericl commented Feb 14, 2019

Hey @bjg2 it should be fixed now.

@bjg2
Copy link
Contributor Author

bjg2 commented Feb 14, 2019

Hey, thanks!

Tried it out. It seems not to hang when redis gets to the limit as it used to, but it seems not to force the limit as well. Using redis_max_memory=int(1e8) limit, it went well over 100mb and it continues growing:
image

Here is the repro: https://pastebin.com/ktRfq5ZY

@ericl
Copy link
Contributor

ericl commented Feb 15, 2019

That leak should be fixed now too.

@ericl ericl reopened this Feb 15, 2019
@bjg2
Copy link
Contributor Author

bjg2 commented Feb 18, 2019

Hey Eric, I can confirm redis is not growing anymore. Our SC2 training is finally working! Will stress it more in the next few days, but the issue seems to be closed! Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants