Possible memory leak in Ape-X #3452

ericl · 2018-12-02T01:21:00Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
Ray installed from (source or binary): binary
Ray version: 0.6.0
Python version: 2.7
Exact command to reproduce: rllib train -f crash.yaml

You can run this on any 64-core CPU machine:

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

Describe the problem

Source code / logs

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 99, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 1010, in main_loop
    self._wait_for_and_process_task(task)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 967, in _wait_for_and_process_task
    self._process_task(task, execution_info)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 865, in _process_task
    traceback_str)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 889, in _handle_process_task_failure
    self._store_outputs_in_object_store(return_object_ids, failure_objects)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 798, in _store_outputs_in_object_store
    self.put_object(object_ids[i], outputs[i])
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 411, in put_object
    self.store_and_register(object_id, value)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 346, in store_and_register
    self.task_driver_id))
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/utils.py", line 404, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 534, in pyarrow._plasma.PlasmaClient.put
    buffer = self.create(target_id, serialized.total_bytes)
  File "pyarrow/_plasma.pyx", line 344, in pyarrow._plasma.PlasmaClient.create
    check_status(self.client.get().Create(object_id.data, data_size,
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
    raise ArrowIOError(message)
ArrowIOError: Broken pipe

  This error is unexpected and should not have happened. Somehow a worker
  crashed in an unanticipated way causing the main_loop to throw an exception,
  which is being caught in "python/ray/workers/default_worker.py".

The rest of the experiment keeps running, but the particular trial fails.

The text was updated successfully, but these errors were encountered:

ericl · 2018-12-02T01:40:09Z

E1202 01:37:11.016106 29049 node_manager.cc:1562] The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.
E1202 01:37:19.988700 29049 node_manager.cc:1562] The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.
E1202 01:37:21.016217 29049 node_manager.cc:1562] The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.

I'm suspecting ape-x might be leaking object store memory unintentionally, perhaps through a view on a numpy array that's allocated in shared memory. I.e. this is an application-level error that is causing weird error messages, rather than a ray bug.

This also seems to happen if the replay buffer is disabled (?)

stephanie-wang · 2018-12-05T02:47:58Z

@ericl and I determined that the error messages like The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang. are expected, but we should fix the backend so that the job doesn't hang. I'm currently working on a PR to treat the task as failed if the object really has been evicted.

ericl · 2018-12-05T02:57:15Z

I'm still looking into the mysterious ArrowIOError. It's unclear if it's related to memory usage but does deterministically happen after a few hours.

robertnishihara · 2018-12-05T02:59:18Z

@stephanie-wang so an object that had been evicted was needed? Do we understand why that's happening?

stephanie-wang · 2018-12-05T03:34:32Z

Oh, I think that was expected because that version was run with 2gb of object store memory.

…

On Tue, Dec 4, 2018, 6:59 PM Robert Nishihara ***@***.*** wrote: @stephanie-wang <https://github.com/stephanie-wang> so an object that had been evicted was needed? Do we understand why that's happening? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3452 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACcSBHujKm9Qa63o84gb9p47SikWFdG9ks5u1zaJgaJpZM4Y9Lv9> .

ericl · 2018-12-06T18:42:57Z

The above PR doesn't entire solve the stability issue:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 36/64 CPUs, 0/0 GPUs
Memory usage on this node: 195.2/270.5 GB
Result logdir: /home/ubuntu/ray_results/apex
RUNNING trials:
 - APEX_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:    RUNNING [pid=104786], 28039 s, 892 iter, 41957280 ts, 36.3 rew
 - APEX_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4:  RUNNING [pid=104820], 28037 s, 892 iter, 41728480 ts, 611 rew
 - APEX_QbertNoFrameskip-v4_2_env=QbertNoFrameskip-v4:  RUNNING [pid=104781], 28035 s, 892 iter, 42232000 ts, 359 rew
 - APEX_SpaceInvadersNoFrameskip-v4_3_env=SpaceInvadersNoFrameskip-v4:  RUNNING [pid=104782], 28066 s, 895 iter, 44041920 ts, 306 rew
W1206 16:47:06.093418 104746 client_connection.cc:247] [worker]ProcessMessage with type 19 took 10022 ms.
F1206 16:47:06.094967 104746 node_manager.cc:386]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
    @           0x5768e6  google::LogMessage::Fail()
    @           0x576832  google::LogMessage::SendToLog()
    @           0x5761b6  google::LogMessage::Flush()
    @           0x575fc5  google::LogMessage::~LogMessage()
    @           0x4dd798  ray::RayLog::~RayLog()
    @           0x516f19  ray::raylet::NodeManager::ClientRemoved()
    @           0x4ae0a1  ray::gcs::ClientTable::HandleNotification()
    @           0x4ae90b  _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8UniqueIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_S6_SG_E_clES3_S6_SG_EUlS3_S6_SC_E_E9_M_invokeERKSt9_Any_dataS3_S6_SC_
    @           0x4bf5bc  _ZZN3ray3gcs3LogINS_8UniqueIDE15ClientTableDataE9SubscribeERKS2_S6_RKSt8functionIFvPNS0_14AsyncGcsClientES6_RKSt6vectorI16ClientTableDataTSaISB_EEEERKS7_IFvS9_EEENKUlRKSsE_clESP_
    @           0x4d9869  (anonymous namespace)::ProcessCallback()
    @           0x4da228  ray::gcs::SubscribeRedisCallback()
    @           0x527d26  redisProcessCallbacks
    @           0x4dd223  RedisAsioClient::handle_read()
    @           0x4dd6b1  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
    @           0x497461  boost::asio::detail::task_io_service::run()
    @           0x48fb4e  main
    @     0x7f98e2b8c830  __libc_start_main
    @           0x494351  (unknown)

Note that message id 10 is NotifyUnblocked, and also the high memory usage.

Is it possible this is some slowdown related to super high redis memory usage?

stephanie-wang · 2018-12-07T01:22:37Z

Actually, I think that message id corresponds to FetchOrReconstruct according to the protocol.

My first guess would have been that some large data structure is getting resized, but it's not obvious from the message handler what that data structure might be. I'm guessing you didn't see anything obvious from the debug state?

ericl · 2018-12-07T01:28:05Z

Just the hash maps that we already know is large. Could it be related to the large redis memory size?

…

On Thu, Dec 6, 2018, 5:22 PM Stephanie Wang ***@***.***> wrote: Actually, I think that message id corresponds to FetchOrReconstruct according to the protocol <https://github.com/ray-project/ray/blob/master/src/ray/raylet/format/node_manager.fbs> . My first guess would have been that some large data structure is getting resized, but it's not obvious from the message handler what that data structure might be. I'm guessing you didn't see anything obvious from the debug state? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3452 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6StMllyFhpIEzZqsWdhHTYQh34RLgks5u2cLmgaJpZM4Y9Lv9> .

stephanie-wang · 2018-12-07T01:37:15Z

I guess it's possible...you could try something like PEXPIRE with a long timeout to see if that's the issue? Here's an example where we use it for one of the tables. But the raylet interacts with redis asynchronously so it would have to be an indirect effect due to context switching or something like that.

ericl · 2018-12-08T08:11:59Z

It's not redis high memory usage. I'm running with LRU eviction on and the redis memory is steady at 500MB.

Messages of type 8 seem to take longer to process over time though:

W1208 02:55:47.824949 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 122 ms.
W1208 03:13:23.504184 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 572 ms.
W1208 03:13:23.504999 20918 node_manager.cc:243] Last heartbeat was sent 616 ms ago 
W1208 03:13:37.007567 20918 node_manager.cc:243] Last heartbeat was sent 638 ms ago 
W1208 03:37:19.096535 20918 node_manager.cc:243] Last heartbeat was sent 1104 ms ago 
W1208 03:38:03.787739 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 1013 ms.
W1208 03:38:03.788363 20918 node_manager.cc:243] Last heartbeat was sent 1028 ms ago 
W1208 04:27:43.250385 20918 node_manager.cc:243] Last heartbeat was sent 2276 ms ago 
W1208 04:30:23.140275 20918 node_manager.cc:243] Last heartbeat was sent 2198 ms ago 
W1208 06:14:52.880249 20918 node_manager.cc:243] Last heartbeat was sent 4472 ms ago 
W1208 06:22:02.923844 20918 client_connection.cc:247] [worker]ProcessMessage with type 8 took 4497 ms.
W1208 06:22:02.924873 20918 node_manager.cc:243] Last heartbeat was sent 4519 ms ago

ericl · 2018-12-08T08:19:42Z

@stephanie-wang do you think it's #3470 here?

For the 4.5s delay, the hash map size was around ~28 million entries. That seems about right if you assume a hash map resize can go at 5-6 million entries per second.

ericl · 2018-12-10T00:36:41Z

Closing since the issue is confirmed to just be the hash map resizing.

ericl · 2018-12-22T23:36:37Z

Note, to re-run this case for QA in the future use these yamls:

apex.yaml:

cluster_name: apex
min_workers: 0
max_workers: 0
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a
auth:
    ssh_user: ubuntu
head_node:
    InstanceType: m4.16xlarge
    ImageId: ami-09edd5690cc795127
worker_nodes:
    InstanceType: m4.16xlarge
    ImageId: ami-09edd5690cc795127
file_mounts:
    "/home/ubuntu/crash.yaml": "~/Desktop/crash.yaml"
head_setup_commands: []
setup_commands:
    - echo ok
    - rm -rf /home/ubuntu/.local/lib/python2.7/site-packages/ray
    - rm -rf /tmp/ray
    - source activate tensorflow_p27 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.0-cp27-cp27mu-manylinux1_x86_64.whl --user
head_start_ray_commands:
    - source activate tensorflow_p27 && ray stop
worker_start_ray_commands:
    - echo ok

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

And this commandline:

ray exec apex.yaml "source activate tensorflow_p27 && rllib train -f crash.yaml --ray-redis-max-memory=5000000000" --start --tmux

For multiple local schedulers, run:
ray exec apex.yaml "source activate tensorflow_p27 && rllib train -f crash.yaml --ray-num-local-schedulers=4 --ray-redis-max-memory=5000000000" --start --tmux

ericl changed the title ~~Ape-X task failure after several hours with ArrowIOError~~ Possible memory leak in Ape-X Dec 2, 2018

ericl self-assigned this Dec 2, 2018

ericl mentioned this issue Dec 6, 2018

[rllib] Copy data before passing to Ape-X learner thread (fixes transient plasma crashes) #3484

Merged

stephanie-wang mentioned this issue Dec 7, 2018

Throw exception for ray.get of an evicted actor object #3490

Merged

ericl mentioned this issue Dec 8, 2018

Add option to evict keys LRU from the sharded redis tables #3499

Merged

1 task

ericl closed this as completed Dec 10, 2018

This comment has been minimized.

Sign in to view

ericl mentioned this issue Feb 3, 2019

[rllib] Add copy() in async samples optimizer to fix memory leak #3938

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak in Ape-X #3452

Possible memory leak in Ape-X #3452

ericl commented Dec 2, 2018 •

edited

Loading

ericl commented Dec 2, 2018 •

edited

Loading

stephanie-wang commented Dec 5, 2018

ericl commented Dec 5, 2018

robertnishihara commented Dec 5, 2018

stephanie-wang commented Dec 5, 2018 via email

ericl commented Dec 6, 2018 •

edited

Loading

stephanie-wang commented Dec 7, 2018

ericl commented Dec 7, 2018 via email

stephanie-wang commented Dec 7, 2018

ericl commented Dec 8, 2018

ericl commented Dec 8, 2018

ericl commented Dec 10, 2018

ericl commented Dec 22, 2018 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

Possible memory leak in Ape-X #3452

Possible memory leak in Ape-X #3452

Comments

ericl commented Dec 2, 2018 • edited Loading

System information

Describe the problem

Source code / logs

ericl commented Dec 2, 2018 • edited Loading

stephanie-wang commented Dec 5, 2018

ericl commented Dec 5, 2018

robertnishihara commented Dec 5, 2018

stephanie-wang commented Dec 5, 2018 via email

ericl commented Dec 6, 2018 • edited Loading

stephanie-wang commented Dec 7, 2018

ericl commented Dec 7, 2018 via email

stephanie-wang commented Dec 7, 2018

ericl commented Dec 8, 2018

ericl commented Dec 8, 2018

ericl commented Dec 10, 2018

ericl commented Dec 22, 2018 • edited Loading

This comment has been minimized.

This comment has been minimized.

ericl commented Dec 2, 2018 •

edited

Loading

ericl commented Dec 2, 2018 •

edited

Loading

ericl commented Dec 6, 2018 •

edited

Loading

ericl commented Dec 22, 2018 •

edited

Loading