Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent memory leak in PPO training #55

Closed
jcallaham opened this issue Sep 27, 2022 · 8 comments
Closed

Apparent memory leak in PPO training #55

jcallaham opened this issue Sep 27, 2022 · 8 comments
Assignees
Labels
bug Something isn't working priority High-priority core feature

Comments

@jcallaham
Copy link
Collaborator

When running a very simple (serial) PPO training with the ppo_train.py script the training runs successfully for 3 iterations and then crashes (will post the error message later).

I'm not sure if this is an issue on the Firedrake or Ray side - I've run into memory-leak-type behavior with Firedrake before, but there are a couple of documented instances of this kind of thing with Ray:

Debugging ideas:

  • Rebuild image with latest versions of OpenAI Gym (0.26 currently) and Ray (2.0.0)... may also require resolving Gym API Compliance #54
  • Try garbage collection during env.reset()
  • Use ray.rllib.algorithms.callbacks.MemoryTrackingCallbacks to track in Tensorboard
  • Compare memory usage with SpinningUp PPO implementation to RLlib to see if the problem is with our environment
@jcallaham jcallaham added bug Something isn't working priority High-priority core feature labels Sep 27, 2022
@jcallaham jcallaham self-assigned this Sep 27, 2022
@jcallaham
Copy link
Collaborator Author

Here's the error:

Traceback (most recent call last):
  File "/home/hydrogym/examples/cylinder/rllib/ppo_train.py", line 81, in <module>
    result = trainer.train()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
    num_recreated += self.try_recover_from_step_attempt(
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2185, in try_recover_from_step_attempt
    raise error
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
    results = self.training_step()
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 407, in training_step
    train_batch = synchronous_parallel_sample(
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/execution/rollout_ops.py", line 100, in synchronous_parallel_sample
    sample_batches = ray.get(
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/_private/worker.py", line 2275, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): �[36mray::RolloutWorker.sample()�[39m (pid=4376, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fd1117fec80>)
KeyError: 140535696007088

During handling of the above exception, another exception occurred:

�[36mray::RolloutWorker.sample()�[39m (pid=4376, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fd1117fec80>)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 370, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/64/1b8ba2b3c45b8676d2fb085520a503.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

�[36mray::RolloutWorker.sample()�[39m (pid=4376, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fd1117fec80>)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 806, in sample
    batches = [self.input_reader.next()]
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/sampler.py", line 92, in next
    batches = [self.get_data()]
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/sampler.py", line 282, in get_data
    item = next(self._env_runner)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/evaluation/sampler.py", line 734, in _env_runner
    base_env.send_actions(actions_to_send)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/env/vector_env.py", line 396, in send_actions
    ) = self.vector_env.vector_step(action_vector)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/env/vector_env.py", line 309, in vector_step
    raise e
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/ray/rllib/env/vector_env.py", line 302, in vector_step
    obs, r, done, info = self.envs[i].step(actions[i])
  File "/home/hydrogym/hydrogym/env.py", line 35, in step
    self.iter += 1
  File "/home/hydrogym/hydrogym/ts.py", line 174, in step
    self.u += Bu * ctrl
  File "/home/firedrake/firedrake/src/firedrake/firedrake/adjoint/function.py", line 132, in wrapper
    func = __iadd__(self, other, **kwargs)
  File "<decorator-gen-31>", line 2, in __iadd__
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/function.py", line 431, in __iadd__
    assemble_expressions.evaluate_expression(
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "<decorator-gen-20>", line 2, in evaluate_expression
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 71, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/assemble_expressions.py", line 526, in evaluate_expression
    firedrake.op2.par_loop(kernel, subset or iterset, *args)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 628, in par_loop
    parloop(*args, **kwargs)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 641, in parloop
    LegacyParloop(knl, *args, **kwargs)()
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 210, in __call__
    self._compute(self.iterset.core_part)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 192, in _compute
    self.global_kernel(self.comm, part.offset, part.offset+part.size, *self.arglist)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 290, in __call__
    func = self.compile(comm)
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 360, in compile
    return compilation.load(self, extension, self.name,
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 641, in load
    dll = compiler(cppargs, ldargs, cpp=cpp, comm=comm).get_so(code, extension)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 450, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/64/1b8ba2b3c45b8676d2fb085520a503.so: failed to map segment from shared object

Actually I'm not so sure this is a memory error now. I've run it with varying numbers of steps per episode (which effectively varies the number of overall episodes) and it always crashes after 3 RLlib iterations. I'm going to try re-running on the medium mesh as well to see if the size of the problem has any impact.

@jcallaham
Copy link
Collaborator Author

Same behavior on the medium-resolution mesh, and actually almost the same message appears using the SpinningUp PPO implementation:

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 288, in __call__
    func = self._func_cache[key]
KeyError: 140381756667824

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 370, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/49/75a957f46960a6a7d9f63290e99968.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hydrogym/examples/ppo/cyl.py", line 35, in <module>
    ppo.ppo(
  File "/home/hydrogym/examples/ppo/ppo.py", line 453, in ppo
    next_o, r, d, _ = env.step(a)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py", line 13, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/hydrogym/hydrogym/env.py", line 34, in step
    self.solver.step(self.iter, control=action)
  File "/home/hydrogym/hydrogym/ts.py", line 174, in step
    self.u += Bu * ctrl
  File "/home/firedrake/firedrake/src/firedrake/firedrake/adjoint/function.py", line 132, in wrapper
    func = __iadd__(self, other, **kwargs)
  File "<decorator-gen-31>", line 2, in __iadd__
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/function.py", line 431, in __iadd__
    assemble_expressions.evaluate_expression(
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "<decorator-gen-20>", line 2, in evaluate_expression
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 71, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/assemble_expressions.py", line 526, in evaluate_expression
    firedrake.op2.par_loop(kernel, subset or iterset, *args)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 628, in par_loop
    parloop(*args, **kwargs)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 641, in parloop
    LegacyParloop(knl, *args, **kwargs)()
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 210, in __call__
    self._compute(self.iterset.core_part)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 192, in _compute
    self.global_kernel(self.comm, part.offset, part.offset+part.size, *self.arglist)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 290, in __call__
    func = self.compile(comm)
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 360, in compile
    return compilation.load(self, extension, self.name,
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 641, in load
    dll = compiler(cppargs, ldargs, cpp=cpp, comm=comm).get_so(code, extension)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 450, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/49/75a957f46960a6a7d9f63290e99968.so: cannot apply additional memory protection after relocation: Cannot allocate memory

So I guess this is not a Ray issue after all. Just for reference, that appears after 125 epochs of 100 steps each using the medium mesh. Also, the error always pops up on line 174, which is here:

            for (B, ctrl) in zip(self.B, control):
                Bu, _ = B.split()
                print(ctrl)
                print(Bu*ctrl)
                self.u += Bu * ctrl  # <--- ERROR

I'm going to try a few things:

  • Rerun on coarse mesh to see if there's any difference in the number of successful epochs (which might help determine if it's actually a memory issue)
  • Try a different method of updating the velocity field with the control inputs (was it the same as above the first time I tried training?)
  • Run with more steps per epoch (again, just to see if there's any change)

@jcallaham jcallaham changed the title RLlib memory leak Apparent memory leak in PPO training Sep 29, 2022
@jcallaham
Copy link
Collaborator Author

jcallaham commented Sep 29, 2022

The control updates definitely are different: originally (here in ts.IPCS)

        for (u, v) in zip(self.control, control):
            u.assign( u + (self.dt/self.flow.TAU)*(v - u) )

To currently (in core.PDEModel)

        for i, (u, v) in enumerate(zip(self.control, act)):
            self.control[i] += (dt / self.TAU) * (v - u)

If I remember correctly the former may not have actually been updating the controls, but possibly it didn't crash?

@jcallaham
Copy link
Collaborator Author

Looks like it was actually much simpler... after some more tracking using the SpinningUp implementation it seems like the issue was just that the solver was eventually diverging. Decreasing the time step seems to have resolved the problem - here's the episode_reward_mean so far with RLlib PPO training:

image

@ludgerpaehler
Copy link
Collaborator

Sorry have a deadline today so cannot comment in depth right now. But this is a known problem for which there exist a number of potential approaches to remedy this.

It would probably prudent to implement a number of them to protect the user from diverging simulator trajectories.

I'll post the references in here once I managed to get out of the abyss of deadline hell.

@jcallaham
Copy link
Collaborator Author

Yeah that would be great! It seems like ideally we should be able to set it up so that if the regular sim is pretty stable with respect to CFL and all then the RL training also won't diverge.

I'll leave this open to track.

@jcallaham
Copy link
Collaborator Author

Actually, after apparently successfully training once with RLLib, I'm still getting this error (on main branch):

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 288, in __call__
    func = self._func_cache[key]
KeyError: 140018891712752

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 370, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/13/fc0045378f1278afa135c09e20bd74.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hydrogym/examples/ppo/cyl.py", line 36, in <module>
    ppo.ppo(
  File "/home/hydrogym/examples/ppo/ppo.py", line 454, in ppo
    next_o, r, d, _ = env.step(a)
  File "/home/firedrake/firedrake/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py", line 13, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/hydrogym/hydrogym/env.py", line 35, in step
    self.solver.step(self.iter, control=action)
  File "/home/hydrogym/hydrogym/ts.py", line 174, in step
    self.u += Bu * ctrl
  File "/home/firedrake/firedrake/src/firedrake/firedrake/adjoint/function.py", line 132, in wrapper
    func = __iadd__(self, other, **kwargs)
  File "<decorator-gen-31>", line 2, in __iadd__
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/function.py", line 431, in __iadd__
    assemble_expressions.evaluate_expression(
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "<decorator-gen-20>", line 2, in evaluate_expression
  File "/home/firedrake/firedrake/src/firedrake/firedrake/utils.py", line 71, in wrapper
    return f(*args, **kwargs)
  File "/home/firedrake/firedrake/src/firedrake/firedrake/assemble_expressions.py", line 526, in evaluate_expression
    firedrake.op2.par_loop(kernel, subset or iterset, *args)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 628, in par_loop
    parloop(*args, **kwargs)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 641, in parloop
    LegacyParloop(knl, *args, **kwargs)()
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 210, in __call__
    self._compute(self.iterset.core_part)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/parloop.py", line 192, in _compute
    self.global_kernel(self.comm, part.offset, part.offset+part.size, *self.arglist)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 290, in __call__
    func = self.compile(comm)
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/global_kernel.py", line 360, in compile
    return compilation.load(self, extension, self.name,
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 641, in load
    dll = compiler(cppargs, ldargs, cpp=cpp, comm=comm).get_so(code, extension)
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/compilation.py", line 450, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/firedrake/firedrake/.cache/pyop2/13/fc0045378f1278afa135c09e20bd74.so: failed to map segment from shared object

This doesn't actually seem related to Ray or RL training at all - I can actually reproduce just by running the PD control example. Which should hopefully make it a bit easier to zero in on at least.

@jcallaham
Copy link
Collaborator Author

Alright, should be fixed now. As best I can tell it was actually somehow a discrepancy between floating point types? For some reason I was using np.float64 or np.float32 in different places, and then AdjFloat when it might need to be differentiable - not sure I understand why it matters, but now it's all just float in the general case and AdjFloat where need be.

Tested with pd-control.py and ppo_train.py on release branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority High-priority core feature
Projects
None yet
Development

No branches or pull requests

2 participants