[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

jasoonn · 2022-10-23T03:50:18Z

Why are these changes needed?

Misuse of Docker API

As shown in the following code segment, the existing ray-serve test in compatibility-test.py has the following pattern:

Create a container that launches a Python REPL process by Docker API.
Attach a socket to the container with parameters params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1}.
Use the socket to interact with Python REPL.

Use a while loop to check the received message of the socket.

Lines 174 to 267 in 3f7b34c

    
               def test_ray_serve(self): 
        
                   client = docker.from_env() 
        
                   container = client.containers.run(ray_image, remove=True, detach=True, stdin_open=True, tty=True, 
        
                                                     network_mode='host', command=["/bin/sh", "-c", "python"]) 
        
                   s = container.attach_socket( 
        
                       params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1}) 
        
                   s._sock.setblocking(0) 
        
                   s._sock.sendall(b''' 
        
           import ray 
        
           import time 
        
           import ray.serve as serve 
        
           import os 
        
           import requests 
        
           from ray._private.test_utils import wait_for_condition 
        
           def retry_with_timeout(func, count=90): 
        
               tmp = 0 
        
               err = None 
        
               while tmp < count: 
        
                   try: 
        
                       return func() 
        
                   except Exception as e: 
        
                       err = e 
        
                       tmp += 1 
        
               assert err is not None 
        
               raise err 
        
           ray.init(address='ray://127.0.0.1:10001') 
        
           @serve.deployment 
        
           def d(*args): 
        
               return f"{os.getpid()}" 
        
           d.deploy() 
        
           pid1 = ray.get(d.get_handle().remote()) 
        
           print('ready') 
        
                   ''') 
        
                   count = 0 
        
                   while count < 90: 
        
                       try: 
        
                           buf = s._sock.recv(4096) 
        
                           logger.info(buf.decode()) 
        
                           if buf.decode().find('ready') != -1: 
        
                               break 
        
                       except Exception as e: 
        
                           pass 
        
                       time.sleep(1) 
        
                       count += 1 
        
                   if count >= 90: 
        
                       raise Exception('failed to run script') 
        
                   # kill the gcs on head node. If fate sharing is enabled 
        
                   # the whole head node pod will terminate. 
        
                   utils.shell_assert_success( 
        
                       'kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \\$2}") -- /bin/bash -c "ps aux | grep gcs_server | grep -v grep | awk \'{print \$2}\' | xargs kill"') 
        
                   # wait for new head node getting created 
        
                   time.sleep(10) 
        
                   # make sure the new head is ready 
        
                   utils.shell_assert_success( 
        
                       'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s') 
        
                   s._sock.sendall(b''' 
        
           def get_new_value(): 
        
               return ray.get(d.get_handle().remote()) 
        
           pid2 = retry_with_timeout(get_new_value) 
        
           if pid1 == pid2: 
        
               print('successful: {} {}'.format(pid1, pid2)) 
        
               sys.exit(0) 
        
           else: 
        
               print('failed: {} {}'.format(pid1, pid2)) 
        
               raise Exception('failed') 
        
                   ''') 
        
                   count = 0 
        
                   while count < 90: 
        
                       try: 
        
                           buf = s._sock.recv(4096) 
        
                           logger.info(buf.decode()) 
        
                           if buf.decode().find('successful') != -1: 
        
                               break 
        
                           if buf.decode().find('failed') != -1: 
        
                               raise Exception('test failed {}'.format(buf.decode())) 
        
                       except Exception as e: 
        
                           pass 
        
                       time.sleep(1) 
        
                       count += 1 
        
                   if count >= 90: 
        
                       raise Exception('failed to run script') 
        
                   container.stop() 
        
                   client.close()

However, this is buggy because the received message of the socket includes STDIN, STDOUT, and STDERR. That is, the received message includes the messages sent by the function sendall (STDIN) in the above example. Hence, the condition buf.decode().find('ready') != -1 will always be fulfilled by L210 def ready(self):. In addition, STDOUT may also cause some bugs, e.g. #617.

Solution

Check exit_code instead.

Misunderstanding of Ray HA

Originally, the test is defined as follows. First, the test will deploy a model on ray serve, which will simply return the PID. Then, it will get the PID from the deployed model. Next, kill the GCS server and wait for the new head pod ready. Finally, connect to the deployed model again and compare the PIDs. Check this for more details.

kuberay/tests/compatibility-test.py

Lines 242 to 247 in 3f7b34c

    
           if pid1 == pid2: 
        
               print('successful: {} {}'.format(pid1, pid2)) 
        
               sys.exit(0) 
        
           else: 
        
               print('failed: {} {}'.format(pid1, pid2)) 
        
               raise Exception('failed')

Ideally, the same PIDs are expected. However, there is a bug currently that when the GCS server on the old head pod is killed, all the head pod and worker pods will be recreated. Hence, the PID might be different, which will cause the failure of the test.

Solution

Design new test scripts for ray serve.

Explanations for some changes

Initialize a serve instance again in tests/scripts/test_ray_serve_2.py
Despite running a serve application in tests/scripts/test_ray_serve_1.py, the serve instance is gone after killing the gcs_server process. It reports the error: ray.serve.exceptions.RayServeException: There is no instance running on this Ray cluster. Please call `serve.start(detached=True) to start one. when I tried to query the application. Hence, it needs to initialize the serve instance, and the previously deployed application will appear in the newly initialized serve instance.

Related issue number

Closes #621

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421

Thank @jasoonn for your contribution!

Based on #619 (comment), worker pods being recreated is a bug, and thus same PIDs are expected behavior in ideal condition.

kevin85421 · 2022-10-23T07:10:42Z

tests/compatibility-test.py

-            count += 1
-        if count >= 90:
-            raise Exception('failed to run script')
+        time.sleep(180)


After #635 is merged, we can replace the sleep function with the new wait function.

#635 is merged.

Got it. I will replace the sleep function with the new wait function.

kevin85421 · 2022-10-23T07:17:59Z

tests/scripts/test_ray_serve_1.py

+from ray._private.test_utils import wait_for_condition
+
+ray.init(address='ray://127.0.0.1:10001', namespace=sys.argv[1])
+serve.start(detached=True)


Based on the Ray Serve API doc, serve.start is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.

kevin85421 · 2022-10-23T07:23:46Z

tests/scripts/test_ray_serve_1.py

+@serve.deployment
+def d(*args):
+    return "HelloWorld"
+d.deploy()


Based on the Ray Serve API doc, deploy() is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.

kevin85421 · 2022-10-23T07:24:47Z

tests/scripts/test_ray_serve_1.py

+def d(*args):
+    return "HelloWorld"
+d.deploy()
+val = ray.get(d.get_handle().remote())


Based on the Ray Serve API doc, get_handle() is deprecated and may be removed in future Ray releases. Maybe we need to use other functions.

kevin85421 · 2022-10-23T07:25:58Z

tests/scripts/test_ray_serve_1.py

+import ray.serve as serve
+import os
+import requests
+from ray._private.test_utils import wait_for_condition


Some import packages are unused in this script.

kevin85421 · 2022-10-23T07:26:45Z

tests/scripts/test_ray_serve_2.py

+    raise err
+
+retry_with_timeout(lambda: ray.init(address='ray://127.0.0.1:10001', namespace=sys.argv[1]))
+retry_with_timeout(lambda: serve.start(detached=True))


kevin85421 · 2022-10-23T07:26:54Z

tests/scripts/test_ray_serve_2.py

+def d(*args):
+    return "HelloWorld"
+
+val = retry_with_timeout(lambda: ray.get(d.get_handle().remote()))


kevin85421 · 2022-10-23T07:32:16Z

tests/scripts/test_ray_serve_2.py

+def d(*args):
+    return "HelloWorld"
+
+val = retry_with_timeout(lambda: ray.get(d.get_handle().remote()))


I doubted that whether the serve deployment is the same as the deployment in test_ray_serve_1.py. We can test it by updating return "HelloWorld" to return "123". If val is still equal to HelloWorld, maybe the deployment is the same.

After I updated return HelloWorld to return 123 in test_ray_serve_2.py, the val is still equal to HelloWorld. I will update the test scripts to remove this kind of ambiguity.

jasoonn · 2022-10-23T14:19:40Z

Thank @jasoonn for your contribution!

Based on #619 (comment), worker pods being recreated is a bug, and thus same PIDs are expected behavior in ideal condition.

@kevin85421 Thanks for the notification. I updated the description in pr to show this information.
For the usage of the deprecation APIs, I will try to use other APIs to substitute them.

jasoonn · 2022-10-24T23:11:45Z

I updated tests/scripts/test_ray_serve_1.py to remove deprecated API call serve.start(detached=True). However, the serve instance is gone after killing the GCS server, which causes the need of calling serve.start(detached=True) in tests/scripts/test_ray_serve_2.py to initialize the serve instance.
For get_handle() API in tests/scripts/test_ray_serve_2.py, it is possible to substitute the API call with HTTP response from the serve application(on port 8000 in the head pod). However, this requires extra setting to expose the port of the head pod or running shell command kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \$2}") -- /bin/bash -c "wget 127.0.0.1:8000 -o wgetLog -O rayServeTest && cat rayServeTest" to get the output. For simplicity, I chose to use get_handle API here.

jasoonn · 2022-10-25T02:38:09Z

There is an interesting thing happening in my previous version of test scripts. In tests/scripts/test_ray_serve_1.py, I first deployed a model to return "HelloWorld". Then, in tests/scripts/test_ray_serve_2.py, I declared another model d, and utilize d.get_handle() to get the handle of the deployed model. To my surprise, this will return the handle of the model in test_ray_serve_1.py instead of the model in test_ray_serve_2.py. To verify this, I updated return "HelloWorld" to return 123 in test_ray_serve_2.py, and the val I got is still equal to HelloWorld in test_ray_serve_2.py

kevin85421 · 2022-10-25T18:35:59Z

There is an interesting thing happening in my previous version of test scripts. In tests/scripts/test_ray_serve_1.py, I first deployed a model to return "HelloWorld". Then, in tests/scripts/test_ray_serve_2.py, I declared another model d, and utilize d.get_handle() to get the handle of the deployed model. To my surprise, this will return the handle of the model in test_ray_serve_1.py instead of the model in test_ray_serve_2.py. To verify this, I updated return "HelloWorld" to return 123 in test_ray_serve_2.py, and the val I got is still equal to HelloWorld in test_ray_serve_2.py

This behavior seems to be a little weird to me. Is it expected behavior? This may be able to be solved by #647. cc @simon-mo @sihanwang41. Thank you!

kevin85421 · 2022-10-25T18:43:18Z

I updated tests/scripts/test_ray_serve_1.py to remove deprecated API call serve.start(detached=True). However, the serve instance is gone after killing the GCS server, which causes the need of calling serve.start(detached=True) in tests/scripts/test_ray_serve_2.py to initialize the serve instance.

For get_handle() API in tests/scripts/test_ray_serve_2.py, it is possible to substitute the API call with HTTP response from the serve application(on port 8000 in the head pod). However, this requires extra setting to expose the port of the head pod or running shell command kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \$2}") -- /bin/bash -c "wget 127.0.0.1:8000 -o wgetLog -O rayServeTest && cat rayServeTest" to get the output. For simplicity, I chose to use get_handle API here.

I asked @jasoonn to replace the deprecated API (e.g. serve.start, get_handle, deploy) with new API. However, in this case, we cannot avoid using serve.start and get_handle. Is there any method to avoid them? Thanks! @simon-mo @sihanwang41

DmitriGekhtman · 2022-10-28T23:56:05Z

@kevin85421 if you think the PR looks good, feel free to approve and merge.

We can note the use of deprecated APIs in a separate KubeRay issue if necessary, for later cleanup.
You can also open Ray issues for any unexpected Serve behavior.

DmitriGekhtman · 2022-11-03T20:34:45Z

I'm going to merge this.
@kevin85421 should feel free to sync with @sihanwang41 and/or @simon-mo to discuss if there are Ray-Serve-related details to refine.

…ray_serve flaky (ray-project#650) Cleans up RayServe compatibility test. Co-authored-by: Ubuntu <azureuser@Ubuntu.pqxb1uggpgbehcpt4orv5untcb.rx.internal.cloudapp.net>

kevin85421 self-requested a review October 23, 2022 04:12

kevin85421 reviewed Oct 23, 2022

View reviewed changes

kevin85421 requested review from wilsonwang371 and DmitriGekhtman October 24, 2022 21:26

Ubuntu added 5 commits October 25, 2022 00:22

Update-ray-serve-test

91e8eae

update_ray_serve_scripts

fa25a7b

replace sleep function with the new wait function

847c162

Update-ray-serve-test

03c29b6

update_ray_serve_scripts

8c346eb

jasoonn force-pushed the ray-serve-test branch from 668be8b to 8c346eb Compare October 25, 2022 00:41

DmitriGekhtman requested review from architkulkarni, sihanwang41, shrekris-anyscale, brucez-anyscale and simon-mo October 25, 2022 18:06

simon-mo assigned shrekris-anyscale Oct 25, 2022

simon-mo assigned sihanwang41 Oct 25, 2022

DmitriGekhtman added this to the v0.4.0 release milestone Nov 3, 2022

DmitriGekhtman merged commit 1ab5a00 into ray-project:master Nov 3, 2022

kevin85421 mentioned this pull request Apr 26, 2023

[Serve] Cannot get serve deployment after a RayCluster recovers ray-project/ray#34799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

jasoonn commented Oct 23, 2022 •

edited

Loading

kevin85421 left a comment

kevin85421 Oct 23, 2022

kevin85421 Oct 24, 2022

jasoonn Oct 24, 2022

kevin85421 Oct 23, 2022

kevin85421 Oct 23, 2022

kevin85421 Oct 23, 2022

kevin85421 Oct 23, 2022

kevin85421 Oct 23, 2022

kevin85421 Oct 23, 2022

kevin85421 Oct 23, 2022

jasoonn Oct 24, 2022 •

edited

Loading

jasoonn commented Oct 23, 2022

jasoonn commented Oct 24, 2022 •

edited

Loading

jasoonn commented Oct 25, 2022 •

edited

Loading

kevin85421 commented Oct 25, 2022

kevin85421 commented Oct 25, 2022

DmitriGekhtman commented Oct 28, 2022

DmitriGekhtman commented Nov 3, 2022

	def test_ray_serve(self):
	client = docker.from_env()
	container = client.containers.run(ray_image, remove=True, detach=True, stdin_open=True, tty=True,
	network_mode='host', command=["/bin/sh", "-c", "python"])
	s = container.attach_socket(
	params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1})
	s._sock.setblocking(0)
	s._sock.sendall(b'''
	import ray
	import time
	import ray.serve as serve
	import os
	import requests
	from ray._private.test_utils import wait_for_condition

	def retry_with_timeout(func, count=90):
	tmp = 0
	err = None
	while tmp < count:
	try:
	return func()
	except Exception as e:
	err = e
	tmp += 1
	assert err is not None
	raise err

	ray.init(address='ray://127.0.0.1:10001')

	@serve.deployment
	def d(*args):
	return f"{os.getpid()}"

	d.deploy()
	pid1 = ray.get(d.get_handle().remote())

	print('ready')
	''')

	count = 0
	while count < 90:
	try:
	buf = s._sock.recv(4096)
	logger.info(buf.decode())
	if buf.decode().find('ready') != -1:
	break
	except Exception as e:
	pass
	time.sleep(1)
	count += 1
	if count >= 90:
	raise Exception('failed to run script')

	# kill the gcs on head node. If fate sharing is enabled
	# the whole head node pod will terminate.
	utils.shell_assert_success(
	'kubectl exec -it $(kubectl get pods -A\| grep -e "-head" \| awk "{print \\$2}") -- /bin/bash -c "ps aux \| grep gcs_server \| grep -v grep \| awk \'{print \$2}\' \| xargs kill"')
	# wait for new head node getting created
	time.sleep(10)
	# make sure the new head is ready
	utils.shell_assert_success(
	'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A \| grep -e "-head" \| awk "{print \$2}") --timeout=900s')

	s._sock.sendall(b'''
	def get_new_value():
	return ray.get(d.get_handle().remote())
	pid2 = retry_with_timeout(get_new_value)

	if pid1 == pid2:
	print('successful: {} {}'.format(pid1, pid2))
	sys.exit(0)
	else:
	print('failed: {} {}'.format(pid1, pid2))
	raise Exception('failed')
	''')

	count = 0
	while count < 90:
	try:
	buf = s._sock.recv(4096)
	logger.info(buf.decode())
	if buf.decode().find('successful') != -1:
	break
	if buf.decode().find('failed') != -1:
	raise Exception('test failed {}'.format(buf.decode()))
	except Exception as e:
	pass
	time.sleep(1)
	count += 1
	if count >= 90:
	raise Exception('failed to run script')

	container.stop()
	client.close()

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

Conversation

jasoonn commented Oct 23, 2022 • edited Loading

Why are these changes needed?

Misuse of Docker API

Solution

Misunderstanding of Ray HA

Solution

Explanations for some changes

Related issue number

Checks

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasoonn Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

jasoonn commented Oct 23, 2022

jasoonn commented Oct 24, 2022 • edited Loading

jasoonn commented Oct 25, 2022 • edited Loading

kevin85421 commented Oct 25, 2022

kevin85421 commented Oct 25, 2022

DmitriGekhtman commented Oct 28, 2022

DmitriGekhtman commented Nov 3, 2022

jasoonn commented Oct 23, 2022 •

edited

Loading

jasoonn Oct 24, 2022 •

edited

Loading

jasoonn commented Oct 24, 2022 •

edited

Loading

jasoonn commented Oct 25, 2022 •

edited

Loading