Apparent memory leak in PPO training #55

jcallaham · 2022-09-27T11:47:29Z

When running a very simple (serial) PPO training with the ppo_train.py script the training runs successfully for 3 iterations and then crashes (will post the error message later).

I'm not sure if this is an issue on the Firedrake or Ray side - I've run into memory-leak-type behavior with Firedrake before, but there are a couple of documented instances of this kind of thing with Ray:

Debugging ideas:

Rebuild image with latest versions of OpenAI Gym (0.26 currently) and Ray (2.0.0)... may also require resolving Gym API Compliance #54
Try garbage collection during env.reset()
Use ray.rllib.algorithms.callbacks.MemoryTrackingCallbacks to track in Tensorboard
Compare memory usage with SpinningUp PPO implementation to RLlib to see if the problem is with our environment

The text was updated successfully, but these errors were encountered:

jcallaham · 2022-09-28T09:29:49Z

Here's the error:

Traceback (most recent call last):
  File "/home/hydrogym/examples/cylinder/rllib/ppo_train.py", line 81, in <module>
    result = trainer.train()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
    num_recreated += self.try_recover_from_step_attempt(
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2185, in try_recover_from_step_attempt
    raise error
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
    results = self.training_step()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 407, in training_step
    train_batch = synchronous_parallel_sample(
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/execution/rollout_ops.py", line 100, in synchronous_parallel_sample
    sample_batches = ray.get(
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/_private/worker.py", line 2275, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): �[36mray::RolloutWorker.sample()�[39m (pid=4376, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fd1117fec80>)
KeyError: 140535696007088

During handling of the above exception, another exception occurred:

�[36mray::RolloutWorker.sample()�[39m (pid=4376, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fd1117fec80>)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 370, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/64/1b8ba2b3c45b8676d2fb085520a503.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

�[36mray::RolloutWorker.sample()�[39m (pid=4376, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fd1117fec80>)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 806, in sample
    batches = [self.input_reader.next()]
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/sampler.py", line 92, in next
    batches = [self.get_data()]
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/sampler.py", line 282, in get_data
    item = next(self._env_runner)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/sampler.py", line 734, in _env_runner
    base_env.send_actions(actions_to_send)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/env/vector_env.py", line 396, in send_actions
    ) = self.vector_env.vector_step(action_vector)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/env/vector_env.py", line 309, in vector_step
    raise e
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/env/vector_env.py", line 302, in vector_step
    obs, r, done, info = self.envs[i].step(actions[i])
  File "/home/hydrogym/hydrogym/env.py", line 35, in step
    self.iter += 1
  File "/home/hydrogym/hydrogym/ts.py", line 174, in step
    self.u += Bu * ctrl
  File "/home/firedrake/firedrake/src/firedrake/firedrake/adjoint/function.py", line 132, in wrapper
    func = __iadd__(self, other, **kwargs)
  File "<decorator-gen-31>", line 2, in __iadd__
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/function.py", line 431, in __iadd__
    assemble_expressions.evaluate_expression(
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "<decorator-gen-20>", line 2, in evaluate_expression
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 71, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/assemble_expressions.py", line 526, in evaluate_expression
    firedrake.op2.par_loop(kernel, subset or iterset, *args)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 628, in par_loop
    parloop(*args, **kwargs)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 641, in parloop
    LegacyParloop(knl, *args, **kwargs)()
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 210, in __call__
    self._compute(self.iterset.core_part)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 192, in _compute
    self.global_kernel(self.comm, part.offset, part.offset+part.size, *self.arglist)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 290, in __call__
    func = self.compile(comm)
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 360, in compile
    return compilation.load(self, extension, self.name,
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 641, in load
    dll = compiler(cppargs, ldargs, cpp=cpp, comm=comm).get_so(code, extension)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 450, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/64/1b8ba2b3c45b8676d2fb085520a503.so: failed to map segment from shared object

Actually I'm not so sure this is a memory error now. I've run it with varying numbers of steps per episode (which effectively varies the number of overall episodes) and it always crashes after 3 RLlib iterations. I'm going to try re-running on the medium mesh as well to see if the size of the problem has any impact.

jcallaham · 2022-09-29T09:45:17Z

Same behavior on the medium-resolution mesh, and actually almost the same message appears using the SpinningUp PPO implementation:

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 288, in __call__
    func = self._func_cache[key]
KeyError: 140381756667824

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 370, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/49/75a957f46960a6a7d9f63290e99968.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hydrogym/examples/ppo/cyl.py", line 35, in <module>
    ppo.ppo(
  File "/home/hydrogym/examples/ppo/ppo.py", line 453, in ppo
    next_o, r, d, _ = env.step(a)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py", line 13, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/hydrogym/hydrogym/env.py", line 34, in step
    self.solver.step(self.iter, control=action)
  File "/home/hydrogym/hydrogym/ts.py", line 174, in step
    self.u += Bu * ctrl
  File "/home/firedrake/firedrake/src/firedrake/firedrake/adjoint/function.py", line 132, in wrapper
    func = __iadd__(self, other, **kwargs)
  File "<decorator-gen-31>", line 2, in __iadd__
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/function.py", line 431, in __iadd__
    assemble_expressions.evaluate_expression(
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "<decorator-gen-20>", line 2, in evaluate_expression
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 71, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/assemble_expressions.py", line 526, in evaluate_expression
    firedrake.op2.par_loop(kernel, subset or iterset, *args)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 628, in par_loop
    parloop(*args, **kwargs)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 641, in parloop
    LegacyParloop(knl, *args, **kwargs)()
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 210, in __call__
    self._compute(self.iterset.core_part)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 192, in _compute
    self.global_kernel(self.comm, part.offset, part.offset+part.size, *self.arglist)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 290, in __call__
    func = self.compile(comm)
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 360, in compile
    return compilation.load(self, extension, self.name,
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 641, in load
    dll = compiler(cppargs, ldargs, cpp=cpp, comm=comm).get_so(code, extension)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 450, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/49/75a957f46960a6a7d9f63290e99968.so: cannot apply additional memory protection after relocation: Cannot allocate memory

So I guess this is not a Ray issue after all. Just for reference, that appears after 125 epochs of 100 steps each using the medium mesh. Also, the error always pops up on line 174, which is here:

            for (B, ctrl) in zip(self.B, control):
                Bu, _ = B.split()
                print(ctrl)
                print(Bu*ctrl)
                self.u += Bu * ctrl  # <--- ERROR

I'm going to try a few things:

Rerun on coarse mesh to see if there's any difference in the number of successful epochs (which might help determine if it's actually a memory issue)
Try a different method of updating the velocity field with the control inputs (was it the same as above the first time I tried training?)
Run with more steps per epoch (again, just to see if there's any change)

jcallaham · 2022-09-29T10:11:24Z

The control updates definitely are different: originally (here in ts.IPCS)

        for (u, v) in zip(self.control, control):
            u.assign( u + (self.dt/self.flow.TAU)*(v - u) )

To currently (in core.PDEModel)

        for i, (u, v) in enumerate(zip(self.control, act)):
            self.control[i] += (dt / self.TAU) * (v - u)

If I remember correctly the former may not have actually been updating the controls, but possibly it didn't crash?

jcallaham · 2022-09-30T07:30:12Z

Looks like it was actually much simpler... after some more tracking using the SpinningUp implementation it seems like the issue was just that the solver was eventually diverging. Decreasing the time step seems to have resolved the problem - here's the episode_reward_mean so far with RLlib PPO training:

ludgerpaehler · 2022-09-30T08:09:44Z

Sorry have a deadline today so cannot comment in depth right now. But this is a known problem for which there exist a number of potential approaches to remedy this.

It would probably prudent to implement a number of them to protect the user from diverging simulator trajectories.

I'll post the references in here once I managed to get out of the abyss of deadline hell.

jcallaham · 2022-09-30T09:06:38Z

Yeah that would be great! It seems like ideally we should be able to set it up so that if the regular sim is pretty stable with respect to CFL and all then the RL training also won't diverge.

I'll leave this open to track.

jcallaham · 2022-10-27T14:46:20Z

Actually, after apparently successfully training once with RLLib, I'm still getting this error (on main branch):

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 288, in __call__
    func = self._func_cache[key]
KeyError: 140018891712752

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 370, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/13/fc0045378f1278afa135c09e20bd74.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hydrogym/examples/ppo/cyl.py", line 36, in <module>
    ppo.ppo(
  File "/home/hydrogym/examples/ppo/ppo.py", line 454, in ppo
    next_o, r, d, _ = env.step(a)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py", line 13, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/hydrogym/hydrogym/env.py", line 35, in step
    self.solver.step(self.iter, control=action)
  File "/home/hydrogym/hydrogym/ts.py", line 174, in step
    self.u += Bu * ctrl
  File "/home/firedrake/firedrake/src/firedrake/firedrake/adjoint/function.py", line 132, in wrapper
    func = __iadd__(self, other, **kwargs)
  File "<decorator-gen-31>", line 2, in __iadd__
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/function.py", line 431, in __iadd__
    assemble_expressions.evaluate_expression(
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "<decorator-gen-20>", line 2, in evaluate_expression
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 71, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/assemble_expressions.py", line 526, in evaluate_expression
    firedrake.op2.par_loop(kernel, subset or iterset, *args)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 628, in par_loop
    parloop(*args, **kwargs)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 641, in parloop
    LegacyParloop(knl, *args, **kwargs)()
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 210, in __call__
    self._compute(self.iterset.core_part)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 192, in _compute
    self.global_kernel(self.comm, part.offset, part.offset+part.size, *self.arglist)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 290, in __call__
    func = self.compile(comm)
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 360, in compile
    return compilation.load(self, extension, self.name,
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 641, in load
    dll = compiler(cppargs, ldargs, cpp=cpp, comm=comm).get_so(code, extension)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 450, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/13/fc0045378f1278afa135c09e20bd74.so: failed to map segment from shared object

This doesn't actually seem related to Ray or RL training at all - I can actually reproduce just by running the PD control example. Which should hopefully make it a bit easier to zero in on at least.

jcallaham · 2022-11-15T13:07:33Z

Alright, should be fixed now. As best I can tell it was actually somehow a discrepancy between floating point types? For some reason I was using np.float64 or np.float32 in different places, and then AdjFloat when it might need to be differentiable - not sure I understand why it matters, but now it's all just float in the general case and AdjFloat where need be.

Tested with pd-control.py and ppo_train.py on release branch.

jcallaham added bug Something isn't working priority High-priority core feature labels Sep 27, 2022

jcallaham self-assigned this Sep 27, 2022

jcallaham mentioned this issue Sep 27, 2022

Debugging PPO training #57

Merged

jcallaham changed the title ~~RLlib memory leak~~ Apparent memory leak in PPO training Sep 29, 2022

jcallaham mentioned this issue Oct 22, 2022

Major refactor for initial release #56

Closed

jcallaham closed this as completed Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent memory leak in PPO training #55

Apparent memory leak in PPO training #55

jcallaham commented Sep 27, 2022

jcallaham commented Sep 28, 2022

jcallaham commented Sep 29, 2022

jcallaham commented Sep 29, 2022 •

edited

Loading

jcallaham commented Sep 30, 2022

ludgerpaehler commented Sep 30, 2022

jcallaham commented Sep 30, 2022

jcallaham commented Oct 27, 2022

jcallaham commented Nov 15, 2022

Apparent memory leak in PPO training #55

Apparent memory leak in PPO training #55

Comments

jcallaham commented Sep 27, 2022

Debugging ideas:

jcallaham commented Sep 28, 2022

jcallaham commented Sep 29, 2022

jcallaham commented Sep 29, 2022 • edited Loading

jcallaham commented Sep 30, 2022

ludgerpaehler commented Sep 30, 2022

jcallaham commented Sep 30, 2022

jcallaham commented Oct 27, 2022

jcallaham commented Nov 15, 2022

jcallaham commented Sep 29, 2022 •

edited

Loading