-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @jasoonn for your contribution!
Based on #619 (comment), worker pods being recreated is a bug, and thus same PIDs are expected behavior in ideal condition.
tests/compatibility-test.py
Outdated
count += 1 | ||
if count >= 90: | ||
raise Exception('failed to run script') | ||
time.sleep(180) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #635 is merged, we can replace the sleep function with the new wait function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#635 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I will replace the sleep function with the new wait function.
tests/scripts/test_ray_serve_1.py
Outdated
from ray._private.test_utils import wait_for_condition | ||
|
||
ray.init(address='ray://127.0.0.1:10001', namespace=sys.argv[1]) | ||
serve.start(detached=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the Ray Serve API doc, serve.start
is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.
tests/scripts/test_ray_serve_1.py
Outdated
@serve.deployment | ||
def d(*args): | ||
return "HelloWorld" | ||
d.deploy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the Ray Serve API doc, deploy()
is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.
tests/scripts/test_ray_serve_1.py
Outdated
def d(*args): | ||
return "HelloWorld" | ||
d.deploy() | ||
val = ray.get(d.get_handle().remote()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the Ray Serve API doc, get_handle()
is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.
tests/scripts/test_ray_serve_1.py
Outdated
import ray.serve as serve | ||
import os | ||
import requests | ||
from ray._private.test_utils import wait_for_condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some import packages are unused in this script.
tests/scripts/test_ray_serve_2.py
Outdated
raise err | ||
|
||
retry_with_timeout(lambda: ray.init(address='ray://127.0.0.1:10001', namespace=sys.argv[1])) | ||
retry_with_timeout(lambda: serve.start(detached=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEPRECATED
tests/scripts/test_ray_serve_2.py
Outdated
def d(*args): | ||
return "HelloWorld" | ||
|
||
val = retry_with_timeout(lambda: ray.get(d.get_handle().remote())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEPRECATED
tests/scripts/test_ray_serve_2.py
Outdated
def d(*args): | ||
return "HelloWorld" | ||
|
||
val = retry_with_timeout(lambda: ray.get(d.get_handle().remote())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubted that whether the serve deployment is the same as the deployment in test_ray_serve_1.py
. We can test it by updating return "HelloWorld"
to return "123"
. If val
is still equal to HelloWorld
, maybe the deployment is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I updated return HelloWorld
to return 123
in test_ray_serve_2.py
, the val
is still equal to HelloWorld
. I will update the test scripts to remove this kind of ambiguity.
@kevin85421 Thanks for the notification. I updated the description in pr to show this information. |
|
668be8b
to
8c346eb
Compare
There is an interesting thing happening in my previous version of test scripts. In tests/scripts/test_ray_serve_1.py, I first deployed a model to |
This behavior seems to be a little weird to me. Is it expected behavior? This may be able to be solved by #647. cc @simon-mo @sihanwang41. Thank you! |
I asked @jasoonn to replace the deprecated API (e.g. |
@kevin85421 if you think the PR looks good, feel free to approve and merge. We can note the use of deprecated APIs in a separate KubeRay issue if necessary, for later cleanup. |
I'm going to merge this. |
…ray_serve flaky (ray-project#650) Cleans up RayServe compatibility test. Co-authored-by: Ubuntu <azureuser@Ubuntu.pqxb1uggpgbehcpt4orv5untcb.rx.internal.cloudapp.net>
Why are these changes needed?
Misuse of Docker API
As shown in the following code segment, the existing ray-serve test in compatibility-test.py has the following pattern:
kuberay/tests/compatibility-test.py
Lines 174 to 267 in 3f7b34c
However, this is buggy because the received message of the socket includes STDIN, STDOUT, and STDERR. That is, the received message includes the messages sent by the function sendall (STDIN) in the above example. Hence, the condition buf.decode().find('ready') != -1 will always be fulfilled by L210 def ready(self):. In addition, STDOUT may also cause some bugs, e.g. #617.
Solution
Check exit_code instead.
Misunderstanding of Ray HA
Originally, the test is defined as follows. First, the test will deploy a model on ray serve, which will simply return the PID. Then, it will get the PID from the deployed model. Next, kill the GCS server and wait for the new head pod ready. Finally, connect to the deployed model again and compare the PIDs. Check this for more details.
kuberay/tests/compatibility-test.py
Lines 242 to 247 in 3f7b34c
Ideally, the same PIDs are expected. However, there is a bug currently that when the GCS server on the old head pod is killed, all the head pod and worker pods will be recreated. Hence, the PID might be different, which will cause the failure of the test.
Solution
Design new test scripts for ray serve.
Explanations for some changes
Despite running a serve application in tests/scripts/test_ray_serve_1.py, the serve instance is gone after killing the gcs_server process. It reports the error:
ray.serve.exceptions.RayServeException: There is no instance running on this Ray cluster. Please call `serve.start(detached=True) to start one.
when I tried to query the application. Hence, it needs to initialize the serve instance, and the previously deployed application will appear in the newly initialized serve instance.Related issue number
Closes #621
Checks