Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak in Ape-X #3452

Closed
ericl opened this issue Dec 2, 2018 · 15 comments
Closed

Possible memory leak in Ape-X #3452

ericl opened this issue Dec 2, 2018 · 15 comments
Assignees

Comments

@ericl
Copy link
Contributor

ericl commented Dec 2, 2018

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.0
  • Python version: 2.7
  • Exact command to reproduce: rllib train -f crash.yaml

You can run this on any 64-core CPU machine:

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

Describe the problem

Source code / logs

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 99, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 1010, in main_loop
    self._wait_for_and_process_task(task)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 967, in _wait_for_and_process_task
    self._process_task(task, execution_info)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 865, in _process_task
    traceback_str)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 889, in _handle_process_task_failure
    self._store_outputs_in_object_store(return_object_ids, failure_objects)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 798, in _store_outputs_in_object_store
    self.put_object(object_ids[i], outputs[i])
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 411, in put_object
    self.store_and_register(object_id, value)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 346, in store_and_register
    self.task_driver_id))
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/utils.py", line 404, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 534, in pyarrow._plasma.PlasmaClient.put
    buffer = self.create(target_id, serialized.total_bytes)
  File "pyarrow/_plasma.pyx", line 344, in pyarrow._plasma.PlasmaClient.create
    check_status(self.client.get().Create(object_id.data, data_size,
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
    raise ArrowIOError(message)
ArrowIOError: Broken pipe

  This error is unexpected and should not have happened. Somehow a worker
  crashed in an unanticipated way causing the main_loop to throw an exception,
  which is being caught in "python/ray/workers/default_worker.py".
  

The rest of the experiment keeps running, but the particular trial fails.

@ericl
Copy link
Contributor Author

ericl commented Dec 2, 2018

E1202 01:37:11.016106 29049 node_manager.cc:1562] The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.
E1202 01:37:19.988700 29049 node_manager.cc:1562] The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.
E1202 01:37:21.016217 29049 node_manager.cc:1562] The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.

I'm suspecting ape-x might be leaking object store memory unintentionally, perhaps through a view on a numpy array that's allocated in shared memory. I.e. this is an application-level error that is causing weird error messages, rather than a ray bug.

This also seems to happen if the replay buffer is disabled (?)

@ericl ericl changed the title Ape-X task failure after several hours with ArrowIOError Possible memory leak in Ape-X Dec 2, 2018
@ericl ericl self-assigned this Dec 2, 2018
@stephanie-wang
Copy link
Contributor

@ericl and I determined that the error messages like The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang. are expected, but we should fix the backend so that the job doesn't hang. I'm currently working on a PR to treat the task as failed if the object really has been evicted.

@ericl
Copy link
Contributor Author

ericl commented Dec 5, 2018

I'm still looking into the mysterious ArrowIOError. It's unclear if it's related to memory usage but does deterministically happen after a few hours.

@robertnishihara
Copy link
Collaborator

@stephanie-wang so an object that had been evicted was needed? Do we understand why that's happening?

@stephanie-wang
Copy link
Contributor

stephanie-wang commented Dec 5, 2018 via email

@ericl
Copy link
Contributor Author

ericl commented Dec 6, 2018

The above PR doesn't entire solve the stability issue:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 36/64 CPUs, 0/0 GPUs
Memory usage on this node: 195.2/270.5 GB
Result logdir: /home/ubuntu/ray_results/apex
RUNNING trials:
 - APEX_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:    RUNNING [pid=104786], 28039 s, 892 iter, 41957280 ts, 36.3 rew
 - APEX_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4:  RUNNING [pid=104820], 28037 s, 892 iter, 41728480 ts, 611 rew
 - APEX_QbertNoFrameskip-v4_2_env=QbertNoFrameskip-v4:  RUNNING [pid=104781], 28035 s, 892 iter, 42232000 ts, 359 rew
 - APEX_SpaceInvadersNoFrameskip-v4_3_env=SpaceInvadersNoFrameskip-v4:  RUNNING [pid=104782], 28066 s, 895 iter, 44041920 ts, 306 rew
W1206 16:47:06.093418 104746 client_connection.cc:247] [worker]ProcessMessage with type 19 took 10022 ms.
F1206 16:47:06.094967 104746 node_manager.cc:386]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
    @           0x5768e6  google::LogMessage::Fail()
    @           0x576832  google::LogMessage::SendToLog()
    @           0x5761b6  google::LogMessage::Flush()
    @           0x575fc5  google::LogMessage::~LogMessage()
    @           0x4dd798  ray::RayLog::~RayLog()
    @           0x516f19  ray::raylet::NodeManager::ClientRemoved()
    @           0x4ae0a1  ray::gcs::ClientTable::HandleNotification()
    @           0x4ae90b  _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8UniqueIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_S6_SG_E_clES3_S6_SG_EUlS3_S6_SC_E_E9_M_invokeERKSt9_Any_dataS3_S6_SC_
    @           0x4bf5bc  _ZZN3ray3gcs3LogINS_8UniqueIDE15ClientTableDataE9SubscribeERKS2_S6_RKSt8functionIFvPNS0_14AsyncGcsClientES6_RKSt6vectorI16ClientTableDataTSaISB_EEEERKS7_IFvS9_EEENKUlRKSsE_clESP_
    @           0x4d9869  (anonymous namespace)::ProcessCallback()
    @           0x4da228  ray::gcs::SubscribeRedisCallback()
    @           0x527d26  redisProcessCallbacks
    @           0x4dd223  RedisAsioClient::handle_read()
    @           0x4dd6b1  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
    @           0x497461  boost::asio::detail::task_io_service::run()
    @           0x48fb4e  main
    @     0x7f98e2b8c830  __libc_start_main
    @           0x494351  (unknown)

Note that message id 10 is NotifyUnblocked, and also the high memory usage.

Is it possible this is some slowdown related to super high redis memory usage?

@stephanie-wang
Copy link
Contributor

Actually, I think that message id corresponds to FetchOrReconstruct according to the protocol.

My first guess would have been that some large data structure is getting resized, but it's not obvious from the message handler what that data structure might be. I'm guessing you didn't see anything obvious from the debug state?

@ericl
Copy link
Contributor Author

ericl commented Dec 7, 2018 via email

@stephanie-wang
Copy link
Contributor

I guess it's possible...you could try something like PEXPIRE with a long timeout to see if that's the issue? Here's an example where we use it for one of the tables. But the raylet interacts with redis asynchronously so it would have to be an indirect effect due to context switching or something like that.

@ericl
Copy link
Contributor Author

ericl commented Dec 8, 2018

It's not redis high memory usage. I'm running with LRU eviction on and the redis memory is steady at 500MB.

Messages of type 8 seem to take longer to process over time though:

W1208 02:55:47.824949 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 122 ms.
W1208 03:13:23.504184 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 572 ms.
W1208 03:13:23.504999 20918 node_manager.cc:243] Last heartbeat was sent 616 ms ago 
W1208 03:13:37.007567 20918 node_manager.cc:243] Last heartbeat was sent 638 ms ago 
W1208 03:37:19.096535 20918 node_manager.cc:243] Last heartbeat was sent 1104 ms ago 
W1208 03:38:03.787739 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 1013 ms.
W1208 03:38:03.788363 20918 node_manager.cc:243] Last heartbeat was sent 1028 ms ago 
W1208 04:27:43.250385 20918 node_manager.cc:243] Last heartbeat was sent 2276 ms ago 
W1208 04:30:23.140275 20918 node_manager.cc:243] Last heartbeat was sent 2198 ms ago 
W1208 06:14:52.880249 20918 node_manager.cc:243] Last heartbeat was sent 4472 ms ago 
W1208 06:22:02.923844 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 4497 ms.
W1208 06:22:02.924873 20918 node_manager.cc:243] Last heartbeat was sent 4519 ms ago 

@ericl
Copy link
Contributor Author

ericl commented Dec 8, 2018

@stephanie-wang do you think it's #3470 here?

For the 4.5s delay, the hash map size was around ~28 million entries. That seems about right if you assume a hash map resize can go at 5-6 million entries per second.

@ericl
Copy link
Contributor Author

ericl commented Dec 10, 2018

Closing since the issue is confirmed to just be the hash map resizing.

@ericl ericl closed this as completed Dec 10, 2018
@ericl
Copy link
Contributor Author

ericl commented Dec 22, 2018

Note, to re-run this case for QA in the future use these yamls:

apex.yaml:

cluster_name: apex
min_workers: 0
max_workers: 0
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a
auth:
    ssh_user: ubuntu
head_node:
    InstanceType: m4.16xlarge
    ImageId: ami-09edd5690cc795127
worker_nodes:
    InstanceType: m4.16xlarge
    ImageId: ami-09edd5690cc795127
file_mounts:
    "/home/ubuntu/crash.yaml": "~/Desktop/crash.yaml"
head_setup_commands: []
setup_commands:
    - echo ok
    - rm -rf /home/ubuntu/.local/lib/python2.7/site-packages/ray
    - rm -rf /tmp/ray
    - source activate tensorflow_p27 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.0-cp27-cp27mu-manylinux1_x86_64.whl --user
head_start_ray_commands:
    - source activate tensorflow_p27 && ray stop
worker_start_ray_commands:
    - echo ok

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

And this commandline:

ray exec apex.yaml "source activate tensorflow_p27 && rllib train -f crash.yaml --ray-redis-max-memory=5000000000" --start --tmux

For multiple local schedulers, run:
ray exec apex.yaml "source activate tensorflow_p27 && rllib train -f crash.yaml --ray-num-local-schedulers=4 --ray-redis-max-memory=5000000000" --start --tmux

@richardliaw

This comment has been minimized.

@richardliaw

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants