Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to connect to Redis when running in a docker container #11943

Closed
2 tasks
roireshef opened this issue Nov 11, 2020 · 9 comments · Fixed by #11944
Closed
2 tasks

Failure to connect to Redis when running in a docker container #11943

roireshef opened this issue Nov 11, 2020 · 9 comments · Fixed by #11944
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks

Comments

@roireshef
Copy link
Contributor

roireshef commented Nov 11, 2020

When running in a docker container, the address resolution results with the docker container IP and not the physical machine's IP, which results in the following traceback (*** are hidden fields for security reasons):

020-11-11 14:48:28,960	INFO worker.py:672 -- Connecting to existing Ray cluster at address: ***:***
2020-11-11 14:48:28,968	WARNING services.py:218 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-11-11 14:48:29,977	WARNING services.py:218 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-11-11 14:48:30,986	WARNING services.py:218 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-11-11 14:48:31,996	WARNING services.py:218 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-11-11 14:48:33,005	WARNING services.py:218 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "***", line 39, in <module>
    ray.init(address='***:***', _redis_password='***')
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 779, in init
    connect_only=True)
  File "/usr/local/lib/python3.6/dist-packages/ray/node.py", line 179, in __init__
    redis_password=self.redis_password))
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/services.py", line 211, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/services.py", line 194, in get_address_info_from_redis_helper
    "Redis has started but no raylets have registered yet.")
RuntimeError: Redis has started but no raylets have registered yet.

Ray version and other system information (Python version, TensorFlow version, OS):
Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only
on both latest master and releases/1.0.1

  • Running in a docker container

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

ray start --block --head --port=*** --redis-password=*** --node-ip-address=*** --gcs-server-port=6005 --dashboard-port=6006 --node-manager-port=6007 --object-manager-port=6008 --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 --include-dashboard=false

then in python:
ray.init(address='***:***', _redis_password='***')

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@roireshef roireshef added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2020
@roireshef
Copy link
Contributor Author

Looks like the root cause is in https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L187
I'll open a PR for that

@rkooo567 rkooo567 added autoscaler P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2020
@ijrsvt
Copy link
Contributor

ijrsvt commented Nov 11, 2020

@roireshef Quick question, what networking options are you applying to the Docker container?

@roireshef
Copy link
Contributor Author

roireshef commented Nov 12, 2020

@roireshef Quick question, what networking options are you applying to the Docker container?

I'm passing to Ray all the port configurations in https://docs.ray.io/en/master/configure.html#ports-configurations manually and making sure to have docker set up with port forwarding for all of those ports, if that's what you are asking. Could you please verify this list is complete?

If not, please be more specific on what parameters you want to know about...

@roireshef
Copy link
Contributor Author

roireshef commented Nov 12, 2020

Please be aware that running hostname --ip-address inside docker gives a different result (docker container virtual IP) than running this same command outside docker on its host machine (which will then result with the machine's IP). The same happens when calling address resolution in _private/services.py - the docker IP is returned, which is invalid to use from a different machine, for example. I'm not sure why is there any motivation to call address resolution if the user already specifies the node IP address manually, but it might be all related

@ijrsvt
Copy link
Contributor

ijrsvt commented Nov 12, 2020

I'm passing to Ray all the port configurations in https://docs.ray.io/en/master/configure.html#ports-configurations manually and making sure to have docker set up with port forwarding for all of those ports, if that's what you are asking. Could you please verify this list is complete?
@roireshef That is what I'm looking for. I've usually run Docker with --net=host. Does using that option avoid this problem?

@ijrsvt
Copy link
Contributor

ijrsvt commented Nov 12, 2020

I'm not sure why is there any motivation to call address resolution if the user already specifies the node IP address manually, but it might be all related

cc @ericl Can you provide any context here?

@roireshef
Copy link
Contributor Author

roireshef commented Nov 12, 2020

I'm passing to Ray all the port configurations in https://docs.ray.io/en/master/configure.html#ports-configurations manually and making sure to have docker set up with port forwarding for all of those ports, if that's what you are asking. Could you please verify this list is complete?
@roireshef That is what I'm looking for. I've usually run Docker with --net=host. Does using that option avoid this problem?

Hi @ijrsvt, assuming the --net=host is a flag you're passing to docker run command - I was already doing that. So no, it doesn't solve the problem. BTW, even if you run docker with this flag, calling hostname --ip-address from inside the docker won't give you the node's external IP, only the docker virtual IP, which is not the IP that should be used.

@roireshef
Copy link
Contributor Author

roireshef commented Nov 19, 2020

I'm not sure why is there any motivation to call address resolution if the user already specifies the node IP address manually, but it might be all related

cc @ericl Can you provide any context here?

@ericl , @ijrsvt - should we just propagate node-ip-address if it's provided by the user, to avoid issues? I'm not sure how to do that myself, otherwise I would have opened a PR for that... unless I'm missing some conflicting use case.

@defangc23
Copy link

encounter the same issue: ray docker containers can not find each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants