[Bug] Unable to succeed in selecting a random port #22352

birgerbr · 2022-02-14T13:31:17Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

After running for 6 days, the ray server fails to accept new clients.
I found this error repeated in ray_client_server.err:

RROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 1 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 2 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 3 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 4 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 5 attempts failed.
ERROR:grpc._server:Exception iterating responses: Unable to succeed in selecting a random port.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_server.py", line 461, in _take_response_from_response_iterator
    return next(response_iterator), True
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 608, in Datapath
    server = self.proxy_manager.create_specific_server(client_id)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 196, in create_specific_server
    port = self._get_unused_port()
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 152, in _get_unused_port
    raise RuntimeError("Unable to succeed in selecting a random port.")
RuntimeError: Unable to succeed in selecting a random port.

I've can not recall seeing this error before upgrading to 1.10.0.

Versions / Dependencies

Ray version 1.10.0, Python 3.8, Ubuntu 20.04.

Reproduction script

I do not have any script for reproduction, the server was running for 6 days before the issue started.

Anything else

The server was running in Kubernetes.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

architkulkarni · 2022-02-14T19:03:04Z

Hi @ckw017, can I assign this to you?

ckw017 · 2022-02-14T19:07:01Z

@architkulkarni Yep, that's fine

@birgerbr As a sanity check, in the logs how many ray_client_server_23***.err files are there? Guessing if the server is accepting a lot of client connections it exhausted all the ports. If you have an estimate of how many concurrent connections are usually active, that would be handy. It looks like we might have an effective limit at ~1000 since the current port range is only 23001-23999

Also, what version of Ray were you on before upgrading?

birgerbr · 2022-02-15T07:03:37Z

We were using version 1.9.2.

There are 16 ray_client_server_23***.err files. This was on our staging cluster, and it seems to not have been very active during those days. The number of concurrent connections was probably below 3.

ckw017 · 2022-02-16T19:07:53Z

Got it. If possible can you share the full ray_client_server.err file?

If you still have the cluster up, or if you run into this again can you try running this on the head node of the cluster:

for port in range(23000, 24000):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.bind(("", port))
    except OSError:
        print("failed to bind to", port)
        traceback.print_exc()
        continue
    finally:
        s.close()

birgerbr · 2022-02-17T06:43:24Z

Here is the log file
ray_client_server.err. The cluster has been restarted, but I can run that code if I see the issue again.

FYI: After the cluster was restarted I found another issue with our cluster. Not sure these issues can be connected, but the other issue was that the latest ray operator image for Kubernetes had some changes that made it incompatible with our current configuration. We solved that by using the 1.10.0 operator image for now. I'm assuming that we will need to update our configuration as done here f51566e when we update to ray 1.11.0.

ckw017 · 2022-02-17T17:45:14Z

Sounds good, digging through the logs it looks like it hit "Server startup failed" 1,000 times before it ran into the "Unable to succeed in selecting a random port", which likely explains how all the ports were exhausted. If the incompatible image is what's causing then server failures then reverting might resolve this as well.

birgerbr · 2022-02-21T08:48:02Z

Our cluster is again getting "Server startup failed", and the operator is running rayproject/ray:1.10.0 which matches the version used in the head node.

The "Unable to succeed in selecting a random port." were not in the logs, but they might have come if we had let it continue.

Your script above ran without any issues.

vicyap · 2022-03-03T20:53:52Z

Hello, I think I am hitting a similar issue. My issue has nothing to do with ports, but I do see "Server startup failed". If my issue isn't related, I can file a new one.

My Ray cluster is installed through the Helm chart in this repo. The operator and head node are both using version 1.10.0.

This line fails for me:
ray.init("ray://mycluster.internal:10001", runtime_env={"pip": ["torchaudio==0.10.0", "boto3"]})

However, this does work if I am on a node in the cluster, eg. ray.init("auto").

Here is the client traceback:

>>> ray.init("ray://mycluster.internal:10001", runtime_env={"pip": ["torchaudio==0.10.0", "boto3"]})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/venv38/lib/python3.8/site-packages/ray/worker.py", line 785, in init
    return builder.connect()
  File "/venv38/lib/python3.8/site-packages/ray/client_builder.py", line 151, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/venv38/lib/python3.8/site-packages/ray/util/client_connect.py", line 33, in connect
    conn = ray.connect(
  File "/venv38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 228, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/venv38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 88, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/venv38/lib/python3.8/site-packages/ray/util/client/worker.py", line 697, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 623, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 279, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 233, in _create_runtime_env
    raise RuntimeError(
RuntimeError: Failed to create runtime_env for Ray client server: Failed to install pip requirements:
Collecting torchaudio==0.10.0
Using cached torchaudio-0.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting boto3
Using cached boto3-1.21.12-py3-none-any.whl (132 kB)
Collecting torch==1.10.0

And on the head node, I tailed some logs...

==> dashboard_agent.log <==
2022-03-03 12:48:07,199 INFO runtime_env_agent.py:169 -- Runtime env already failed. Env: {"extensions": {"_ray_commit": "5ea565317a8104c04ae7892bb9bb41c6d72f12df"}, "pipRuntimeEnv": {"config": {"packages": ["torchaudio==0
.10.0", "boto3"]}}, "uris": {"pipUri": "pip://4ceac73cac9531aae34bb906bc8ce6b1b6c04183"}}, err: Failed to install pip requirements:
Collecting torchaudio==0.10.0
Using cached torchaudio-0.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting boto3
Using cached boto3-1.21.12-py3-none-any.whl (132 kB)
Collecting torch==1.10.0

==> ray_client_server.err <==
INFO:ray.util.client.server.proxier:New data connection from client e9e585f631d5412ead897bda8238d092:

==> debug_state_gcs.txt <==

==> debug_state.txt <==

==> debug_state_gcs.txt <==

==> debug_state.txt <==

==> debug_state_gcs.txt <==

==> debug_state.txt <==

==> ray_client_server.err <==
INFO:ray.util.client.server.proxier:e9e585f631d5412ead897bda8238d092 last started stream at 1646340487.1929243. Current stream started at 1646340487.1929243.
ERROR:grpc._server:Exception iterating responses: Server startup failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_server.py", line 461, in _take_response_from_response_iterator
    return next(response_iterator), True
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 712, in Logstream
    channel = self.proxy_manager.get_channel(client_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 346, in get_channel
    server.wait_ready()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 69, in wait_ready
    raise RuntimeError("Server startup failed.")
RuntimeError: Server startup failed.

I also get "Server startup failed".

Also the first line of dashboard_agent.log is
2022-03-03 12:48:07,199 INFO runtime_env_agent.py:169 -- Runtime env already failed.

Any help is appreciated.

ckw017 · 2022-03-03T21:18:45Z

cc @shrekris-anyscale can you help triage vicyap's issue, looks like an issue with runtime envs

shrekris-anyscale · 2022-03-03T21:59:18Z

Sure, let me take a look.

simon-mo · 2022-03-03T22:06:23Z

@vicyap looking at the traceback, it looks like the the process or container might be OOM killed. This is a common problem for installing PyTorch within a container.

Please take a look at pytorch/pytorch#1022 (comment) for workaround.

architkulkarni · 2022-03-03T22:06:27Z

@vicyap is it possible to try the same thing without the torchaudio package and see if it works? We've had a couple other users report about the same thing, using torch with Helm charts.

simon-mo · 2022-03-03T22:07:26Z

@architkulkarni is there a way to do something like pip --no-cache-dir install torch?

architkulkarni · 2022-03-03T22:14:20Z

@vicyap you can try setting PIP_NO_CACHE_DIR=1 on the cluster:https://pip.pypa.io/en/latest/topics/configuration/#environment-variables

Actually, you might have to set it to 0 instead of 1, as this issue seems to still be open: pypa/pip#5735

Another user reported that using the "conda" field instead seemed to work https://discuss.ray.io/t/failed-to-lease-worker-from-node/5140/6?u=architkulkarni though it's not clear why it would fix a memory issue. If you're okay with running in an isolated conda environment (so not inheriting the existing python packages already installed on the cluster), this could work.

peterhaddad3121 · 2022-04-12T19:54:52Z

@architkulkarni I seem to be experiencing similar behaviors but a little differently.

I documented an issue here, if it's okay for you to take a look at. Could use some eyes. #23865

stale · 2022-08-12T05:39:01Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale · 2022-09-20T18:21:21Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

birgerbr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 14, 2022

ckw017 self-assigned this Feb 14, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 12, 2022

stale bot closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unable to succeed in selecting a random port #22352

[Bug] Unable to succeed in selecting a random port #22352

birgerbr commented Feb 14, 2022

architkulkarni commented Feb 14, 2022

ckw017 commented Feb 14, 2022 •

edited

Loading

birgerbr commented Feb 15, 2022

ckw017 commented Feb 16, 2022

birgerbr commented Feb 17, 2022

ckw017 commented Feb 17, 2022 •

edited

Loading

birgerbr commented Feb 21, 2022

vicyap commented Mar 3, 2022 •

edited

Loading

ckw017 commented Mar 3, 2022

shrekris-anyscale commented Mar 3, 2022

simon-mo commented Mar 3, 2022

architkulkarni commented Mar 3, 2022

simon-mo commented Mar 3, 2022

architkulkarni commented Mar 3, 2022 •

edited

Loading

peterhaddad3121 commented Apr 12, 2022

stale bot commented Aug 12, 2022

stale bot commented Sep 20, 2022

[Bug] Unable to succeed in selecting a random port #22352

[Bug] Unable to succeed in selecting a random port #22352

Comments

birgerbr commented Feb 14, 2022

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

architkulkarni commented Feb 14, 2022

ckw017 commented Feb 14, 2022 • edited Loading

birgerbr commented Feb 15, 2022

ckw017 commented Feb 16, 2022

birgerbr commented Feb 17, 2022

ckw017 commented Feb 17, 2022 • edited Loading

birgerbr commented Feb 21, 2022

vicyap commented Mar 3, 2022 • edited Loading

ckw017 commented Mar 3, 2022

shrekris-anyscale commented Mar 3, 2022

simon-mo commented Mar 3, 2022

architkulkarni commented Mar 3, 2022

simon-mo commented Mar 3, 2022

architkulkarni commented Mar 3, 2022 • edited Loading

peterhaddad3121 commented Apr 12, 2022

stale bot commented Aug 12, 2022

stale bot commented Sep 20, 2022

ckw017 commented Feb 14, 2022 •

edited

Loading

ckw017 commented Feb 17, 2022 •

edited

Loading

vicyap commented Mar 3, 2022 •

edited

Loading

architkulkarni commented Mar 3, 2022 •

edited

Loading