Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

Merged
merged 5 commits into from
Nov 3, 2022

Conversation

jasoonn
Copy link
Contributor

@jasoonn jasoonn commented Oct 23, 2022

Why are these changes needed?

Misuse of Docker API

As shown in the following code segment, the existing ray-serve test in compatibility-test.py has the following pattern:

  • Create a container that launches a Python REPL process by Docker API.
  • Attach a socket to the container with parameters params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1}.
  • Use the socket to interact with Python REPL.
  • Use a while loop to check the received message of the socket.
    def test_ray_serve(self):
    client = docker.from_env()
    container = client.containers.run(ray_image, remove=True, detach=True, stdin_open=True, tty=True,
    network_mode='host', command=["/bin/sh", "-c", "python"])
    s = container.attach_socket(
    params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1})
    s._sock.setblocking(0)
    s._sock.sendall(b'''
    import ray
    import time
    import ray.serve as serve
    import os
    import requests
    from ray._private.test_utils import wait_for_condition
    def retry_with_timeout(func, count=90):
    tmp = 0
    err = None
    while tmp < count:
    try:
    return func()
    except Exception as e:
    err = e
    tmp += 1
    assert err is not None
    raise err
    ray.init(address='ray://127.0.0.1:10001')
    @serve.deployment
    def d(*args):
    return f"{os.getpid()}"
    d.deploy()
    pid1 = ray.get(d.get_handle().remote())
    print('ready')
    ''')
    count = 0
    while count < 90:
    try:
    buf = s._sock.recv(4096)
    logger.info(buf.decode())
    if buf.decode().find('ready') != -1:
    break
    except Exception as e:
    pass
    time.sleep(1)
    count += 1
    if count >= 90:
    raise Exception('failed to run script')
    # kill the gcs on head node. If fate sharing is enabled
    # the whole head node pod will terminate.
    utils.shell_assert_success(
    'kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \\$2}") -- /bin/bash -c "ps aux | grep gcs_server | grep -v grep | awk \'{print \$2}\' | xargs kill"')
    # wait for new head node getting created
    time.sleep(10)
    # make sure the new head is ready
    utils.shell_assert_success(
    'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')
    s._sock.sendall(b'''
    def get_new_value():
    return ray.get(d.get_handle().remote())
    pid2 = retry_with_timeout(get_new_value)
    if pid1 == pid2:
    print('successful: {} {}'.format(pid1, pid2))
    sys.exit(0)
    else:
    print('failed: {} {}'.format(pid1, pid2))
    raise Exception('failed')
    ''')
    count = 0
    while count < 90:
    try:
    buf = s._sock.recv(4096)
    logger.info(buf.decode())
    if buf.decode().find('successful') != -1:
    break
    if buf.decode().find('failed') != -1:
    raise Exception('test failed {}'.format(buf.decode()))
    except Exception as e:
    pass
    time.sleep(1)
    count += 1
    if count >= 90:
    raise Exception('failed to run script')
    container.stop()
    client.close()

    However, this is buggy because the received message of the socket includes STDIN, STDOUT, and STDERR. That is, the received message includes the messages sent by the function sendall (STDIN) in the above example. Hence, the condition buf.decode().find('ready') != -1 will always be fulfilled by L210 def ready(self):. In addition, STDOUT may also cause some bugs, e.g. #617.

Solution

Check exit_code instead.

Misunderstanding of Ray HA

Originally, the test is defined as follows. First, the test will deploy a model on ray serve, which will simply return the PID. Then, it will get the PID from the deployed model. Next, kill the GCS server and wait for the new head pod ready. Finally, connect to the deployed model again and compare the PIDs. Check this for more details.

if pid1 == pid2:
print('successful: {} {}'.format(pid1, pid2))
sys.exit(0)
else:
print('failed: {} {}'.format(pid1, pid2))
raise Exception('failed')

Ideally, the same PIDs are expected. However, there is a bug currently that when the GCS server on the old head pod is killed, all the head pod and worker pods will be recreated. Hence, the PID might be different, which will cause the failure of the test.

Solution

Design new test scripts for ray serve.

Explanations for some changes

  • Initialize a serve instance again in tests/scripts/test_ray_serve_2.py
    Despite running a serve application in tests/scripts/test_ray_serve_1.py, the serve instance is gone after killing the gcs_server process. It reports the error: ray.serve.exceptions.RayServeException: There is no instance running on this Ray cluster. Please call `serve.start(detached=True) to start one. when I tried to query the application. Hence, it needs to initialize the serve instance, and the previously deployed application will appear in the newly initialized serve instance.

Related issue number

Closes #621

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-requested a review October 23, 2022 04:12
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @jasoonn for your contribution!

Based on #619 (comment), worker pods being recreated is a bug, and thus same PIDs are expected behavior in ideal condition.

count += 1
if count >= 90:
raise Exception('failed to run script')
time.sleep(180)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #635 is merged, we can replace the sleep function with the new wait function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#635 is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will replace the sleep function with the new wait function.

from ray._private.test_utils import wait_for_condition

ray.init(address='ray://127.0.0.1:10001', namespace=sys.argv[1])
serve.start(detached=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the Ray Serve API doc, serve.start is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.

@serve.deployment
def d(*args):
return "HelloWorld"
d.deploy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the Ray Serve API doc, deploy() is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.

def d(*args):
return "HelloWorld"
d.deploy()
val = ray.get(d.get_handle().remote())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the Ray Serve API doc, get_handle() is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.

import ray.serve as serve
import os
import requests
from ray._private.test_utils import wait_for_condition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some import packages are unused in this script.

raise err

retry_with_timeout(lambda: ray.init(address='ray://127.0.0.1:10001', namespace=sys.argv[1]))
retry_with_timeout(lambda: serve.start(detached=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEPRECATED

def d(*args):
return "HelloWorld"

val = retry_with_timeout(lambda: ray.get(d.get_handle().remote()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEPRECATED

def d(*args):
return "HelloWorld"

val = retry_with_timeout(lambda: ray.get(d.get_handle().remote()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubted that whether the serve deployment is the same as the deployment in test_ray_serve_1.py. We can test it by updating return "HelloWorld" to return "123". If val is still equal to HelloWorld, maybe the deployment is the same.

Copy link
Contributor Author

@jasoonn jasoonn Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I updated return HelloWorld to return 123 in test_ray_serve_2.py, the val is still equal to HelloWorld. I will update the test scripts to remove this kind of ambiguity.

@jasoonn
Copy link
Contributor Author

jasoonn commented Oct 23, 2022

Thank @jasoonn for your contribution!

Based on #619 (comment), worker pods being recreated is a bug, and thus same PIDs are expected behavior in ideal condition.

@kevin85421 Thanks for the notification. I updated the description in pr to show this information.
For the usage of the deprecation APIs, I will try to use other APIs to substitute them.

@jasoonn
Copy link
Contributor Author

jasoonn commented Oct 24, 2022

  1. I updated tests/scripts/test_ray_serve_1.py to remove deprecated API call serve.start(detached=True). However, the serve instance is gone after killing the GCS server, which causes the need of calling serve.start(detached=True) in tests/scripts/test_ray_serve_2.py to initialize the serve instance.
  2. For get_handle() API in tests/scripts/test_ray_serve_2.py, it is possible to substitute the API call with HTTP response from the serve application(on port 8000 in the head pod). However, this requires extra setting to expose the port of the head pod or running shell command kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \$2}") -- /bin/bash -c "wget 127.0.0.1:8000 -o wgetLog -O rayServeTest && cat rayServeTest" to get the output. For simplicity, I chose to use get_handle API here.

@jasoonn
Copy link
Contributor Author

jasoonn commented Oct 25, 2022

There is an interesting thing happening in my previous version of test scripts. In tests/scripts/test_ray_serve_1.py, I first deployed a model to return "HelloWorld". Then, in tests/scripts/test_ray_serve_2.py, I declared another model d, and utilize d.get_handle() to get the handle of the deployed model. To my surprise, this will return the handle of the model in test_ray_serve_1.py instead of the model in test_ray_serve_2.py. To verify this, I updated return "HelloWorld" to return 123 in test_ray_serve_2.py, and the val I got is still equal to HelloWorld in test_ray_serve_2.py

@kevin85421
Copy link
Member

There is an interesting thing happening in my previous version of test scripts. In tests/scripts/test_ray_serve_1.py, I first deployed a model to return "HelloWorld". Then, in tests/scripts/test_ray_serve_2.py, I declared another model d, and utilize d.get_handle() to get the handle of the deployed model. To my surprise, this will return the handle of the model in test_ray_serve_1.py instead of the model in test_ray_serve_2.py. To verify this, I updated return "HelloWorld" to return 123 in test_ray_serve_2.py, and the val I got is still equal to HelloWorld in test_ray_serve_2.py

This behavior seems to be a little weird to me. Is it expected behavior? This may be able to be solved by #647. cc @simon-mo @sihanwang41. Thank you!

@kevin85421
Copy link
Member

  1. I updated tests/scripts/test_ray_serve_1.py to remove deprecated API call serve.start(detached=True). However, the serve instance is gone after killing the GCS server, which causes the need of calling serve.start(detached=True) in tests/scripts/test_ray_serve_2.py to initialize the serve instance.
  2. For get_handle() API in tests/scripts/test_ray_serve_2.py, it is possible to substitute the API call with HTTP response from the serve application(on port 8000 in the head pod). However, this requires extra setting to expose the port of the head pod or running shell command kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \$2}") -- /bin/bash -c "wget 127.0.0.1:8000 -o wgetLog -O rayServeTest && cat rayServeTest" to get the output. For simplicity, I chose to use get_handle API here.

I asked @jasoonn to replace the deprecated API (e.g. serve.start, get_handle, deploy) with new API. However, in this case, we cannot avoid using serve.start and get_handle. Is there any method to avoid them? Thanks! @simon-mo @sihanwang41

@DmitriGekhtman
Copy link
Collaborator

@kevin85421 if you think the PR looks good, feel free to approve and merge.

We can note the use of deprecated APIs in a separate KubeRay issue if necessary, for later cleanup.
You can also open Ray issues for any unexpected Serve behavior.

@DmitriGekhtman DmitriGekhtman added this to the v0.4.0 release milestone Nov 3, 2022
@DmitriGekhtman
Copy link
Collaborator

I'm going to merge this.
@kevin85421 should feel free to sync with @sihanwang41 and/or @simon-mo to discuss if there are Ray-Serve-related details to refine.

@DmitriGekhtman DmitriGekhtman merged commit 1ab5a00 into ray-project:master Nov 3, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ray_serve flaky (ray-project#650)

Cleans up RayServe compatibility test.

Co-authored-by: Ubuntu <azureuser@Ubuntu.pqxb1uggpgbehcpt4orv5untcb.rx.internal.cloudapp.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky
5 participants