Skip to content

Commit

Permalink
[Feature] Refactor test framework & test kuberay-operator chart with …
Browse files Browse the repository at this point in the history
…configuration framework (ray-project#759)

Refactors for integration tests --

Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

Skip test_kill_head due to

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638
[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed  ray-project#634.
Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

Refactor and relieve flakiness of test_ray_serve_work

working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged.
Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
  • Loading branch information
kevin85421 authored Dec 1, 2022
1 parent 5517de4 commit 4e8d8be
Show file tree
Hide file tree
Showing 10 changed files with 301 additions and 362 deletions.
18 changes: 1 addition & 17 deletions .github/workflows/actions/compatibility/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,28 +48,12 @@ runs:
docker images ls -a
shell: bash

- name: Download Artifact Apiserver
uses: actions/download-artifact@v2
with:
name: apiserver_img
path: /tmp

- name: Load KubeRay Apiserver Docker Image
run: |
docker load --input /tmp/apiserver.tar
docker images ls -a
shell: bash

- name: Run ${{ inputs.ray_version }} compatibility test
# compatibility test depends on operator & apiserver images built in previous steps.
run: |
pushd manifests/base/
kustomize edit set image kuberay/operator=kuberay/operator:${{ steps.vars.outputs.sha_short }}
kustomize edit set image kuberay/apiserver=kuberay/apiserver:${{ steps.vars.outputs.sha_short }}
popd
echo "Using Ray image ${{ inputs.ray_version }}"
PYTHONPATH="./tests/" \
RAY_IMAGE="rayproject/ray:${{ inputs.ray_version }}" \
OPERATOR_IMAGE="kuberay/operator:${{ steps.vars.outputs.sha_short }}" \
APISERVER_IMAGE="kuberay/apiserver:${{ steps.vars.outputs.sha_short }}" python ./tests/compatibility-test.py
python ./tests/compatibility-test.py
shell: bash
18 changes: 1 addition & 17 deletions .github/workflows/actions/configuration/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,29 +48,13 @@ runs:
docker images ls -a
shell: bash

- name: Download Artifact Apiserver
uses: actions/download-artifact@v2
with:
name: apiserver_img
path: /tmp

- name: Load KubeRay Apiserver Docker Image
run: |
docker load --input /tmp/apiserver.tar
docker images ls -a
shell: bash

- name: Run configuration tests for sample YAML files.
# compatibility test depends on operator & apiserver images built in previous steps.
run: |
pushd manifests/base/
kustomize edit set image kuberay/operator=kuberay/operator:${{ steps.vars.outputs.sha_short }}
kustomize edit set image kuberay/apiserver=kuberay/apiserver:${{ steps.vars.outputs.sha_short }}
popd
echo "Using Ray image ${{ inputs.ray_version }}"
cd tests/framework
GITHUB_ACTIONS=true \
RAY_IMAGE="rayproject/ray:${{ inputs.ray_version }}" \
OPERATOR_IMAGE="kuberay/operator:${{ steps.vars.outputs.sha_short }}" \
APISERVER_IMAGE="kuberay/apiserver:${{ steps.vars.outputs.sha_short }}" python test_sample_raycluster_yamls.py
python test_sample_raycluster_yamls.py
shell: bash
6 changes: 3 additions & 3 deletions ray-operator/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,13 +135,13 @@ python3 ../scripts/rbac-check.py
We have some [end-to-end tests](https://github.com/ray-project/kuberay/blob/master/.github/workflows/actions/compatibility/action.yaml) on GitHub Actions.
These tests operate small Ray clusters running within a [kind](https://kind.sigs.k8s.io/) (Kubernetes-in-docker) environment. To run the tests yourself, follow these steps:

* Step1: Install related dependencies, including [kind](https://kind.sigs.k8s.io/), [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), and [kustomize](https://kustomize.io/).
* Step1: Install related dependencies, including [kind](https://kind.sigs.k8s.io/) and [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/).

* Step2: You must be in `/path/to/your/kuberay/`.
```bash
# [Usage]: RAY_IMAGE=$RAY_IMAGE OPERATOR_IMAGE=$OPERATOR_IMAGE APISERVER_IMAGE=$APISERVER_IMAGE python3 tests/compatibility-test.py
# [Usage]: RAY_IMAGE=$RAY_IMAGE OPERATOR_IMAGE=$OPERATOR_IMAGE python3 tests/compatibility-test.py
# These 3 environment variables are optional.
# [Example]:
RAY_IMAGE=rayproject/ray:2.0.0 OPERATOR_IMAGE=kuberay/operator:nightly APISERVER_IMAGE=kuberay/apiserver:nightly python3 tests/compatibility-test.py
RAY_IMAGE=rayproject/ray:2.0.0 OPERATOR_IMAGE=kuberay/operator:nightly python3 tests/compatibility-test.py
```

150 changes: 66 additions & 84 deletions tests/compatibility-test.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,14 @@
import docker

from kuberay_utils import utils
from kubernetes import client, config
from framework.prototype import (
CONST,
CurlServiceRule,
K8S_CLUSTER_MANAGER,
OperatorManager,
RuleSet,
shell_subprocess_run
)

logger = logging.getLogger(__name__)
logging.basicConfig(
Expand All @@ -23,7 +30,6 @@
# Docker images
ray_image = 'rayproject/ray:1.9.0'
kuberay_operator_image = 'kuberay/operator:nightly'
kuberay_apiserver_image = 'kuberay/apiserver:nightly'


class BasicRayTestCase(unittest.TestCase):
Expand All @@ -36,12 +42,15 @@ def setUpClass(cls):
# from another local ray container. The local ray container
# outside Kind environment has the same ray version as the
# ray cluster running inside Kind environment.
utils.delete_cluster()
utils.create_cluster()
images = [ray_image, kuberay_operator_image, kuberay_apiserver_image]
utils.download_images(images)
utils.apply_kuberay_resources(images, kuberay_operator_image, kuberay_apiserver_image)
utils.create_kuberay_cluster(BasicRayTestCase.cluster_template_file,
K8S_CLUSTER_MANAGER.delete_kind_cluster()
K8S_CLUSTER_MANAGER.create_kind_cluster()
image_dict = {
CONST.RAY_IMAGE_KEY: ray_image,
CONST.OPERATOR_IMAGE_KEY: kuberay_operator_image
}
operator_manager = OperatorManager(image_dict)
operator_manager.prepare_operator()
utils.create_ray_cluster(BasicRayTestCase.cluster_template_file,
ray_version, ray_image)

def test_simple_code(self):
Expand Down Expand Up @@ -138,38 +147,42 @@ class RayFTTestCase(unittest.TestCase):

@classmethod
def setUpClass(cls):
if not utils.ray_ft_supported(ray_version):
raise unittest.SkipTest("ray ft is not supported")
utils.delete_cluster()
utils.create_cluster()
images = [ray_image, kuberay_operator_image, kuberay_apiserver_image]
utils.download_images(images)
utils.apply_kuberay_resources(images, kuberay_operator_image, kuberay_apiserver_image)
utils.create_kuberay_cluster(RayFTTestCase.cluster_template_file,
if not utils.is_feature_supported(ray_version, CONST.RAY_FT):
raise unittest.SkipTest(f"{CONST.RAY_FT} is not supported")
K8S_CLUSTER_MANAGER.delete_kind_cluster()
K8S_CLUSTER_MANAGER.create_kind_cluster()
image_dict = {
CONST.RAY_IMAGE_KEY: ray_image,
CONST.OPERATOR_IMAGE_KEY: kuberay_operator_image
}
operator_manager = OperatorManager(image_dict)
operator_manager.prepare_operator()
utils.create_ray_cluster(RayFTTestCase.cluster_template_file,
ray_version, ray_image)

@unittest.skip("Skip test_kill_head due to its flakiness.")
def test_kill_head(self):
# This test will delete head node and wait for a new replacement to
# come up.
utils.shell_assert_success(
shell_subprocess_run(
'kubectl delete pod $(kubectl get pods -A | grep -e "-head" | awk "{print \$2}")')

# wait for new head node to start
time.sleep(80)
utils.shell_assert_success('kubectl get pods -A')
shell_subprocess_run('kubectl get pods -A')

# make sure the new head is ready
# shell_assert_success('kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')
# make sure both head and worker pods are ready
rtn = utils.shell_run(
'kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s')
rtn = shell_subprocess_run(
'kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s', check = False)
if rtn != 0:
utils.shell_run('kubectl get pods -A')
utils.shell_run(
shell_subprocess_run('kubectl get pods -A')
shell_subprocess_run(
'kubectl describe pod $(kubectl get pods | grep -e "-head" | awk "{print \$1}")')
utils.shell_run(
shell_subprocess_run(
'kubectl logs $(kubectl get pods | grep -e "-head" | awk "{print \$1}")')
utils.shell_run(
shell_subprocess_run(
'kubectl logs -n $(kubectl get pods -A | grep -e "-operator" | awk \'{print $1 " " $2}\')')
assert rtn == 0

Expand All @@ -188,13 +201,9 @@ def test_ray_serve(self):
raise Exception(f"There was an exception during the execution of test_ray_serve_1.py. The exit code is {exit_code}." +
"See above for command output. The output will be printed by the function exec_run_container.")

# Initialize k8s client
config.load_kube_config()
k8s_api = client.CoreV1Api()

# KubeRay only allows at most 1 head pod per RayCluster instance at the same time. In addition,
# if we have 0 head pods at this moment, it indicates that the head pod crashes unexpectedly.
headpods = utils.get_pod(k8s_api, namespace='default', label_selector='ray.io/node-type=head')
headpods = utils.get_pod(namespace='default', label_selector='ray.io/node-type=head')
assert(len(headpods.items) == 1)
old_head_pod = headpods.items[0]
old_head_pod_name = old_head_pod.metadata.name
Expand All @@ -203,10 +212,10 @@ def test_ray_serve(self):
# Kill the gcs_server process on head node. If fate sharing is enabled, the whole head node pod
# will terminate.
exec_command = ['pkill gcs_server']
utils.pod_exec_command(k8s_api, pod_name=old_head_pod_name, namespace='default', exec_command=exec_command)
utils.pod_exec_command(pod_name=old_head_pod_name, namespace='default', exec_command=exec_command)

# Waiting for all pods become ready and running.
utils.wait_for_new_head(k8s_api, old_head_pod_name, restart_count, 'default', timeout=300, retry_interval_ms=1000)
utils.wait_for_new_head(old_head_pod_name, restart_count, 'default', timeout=300, retry_interval_ms=1000)

# Try to connect to the deployed model again
utils.copy_to_container(container, 'tests/scripts', '/usr/local/', 'test_ray_serve_2.py')
Expand Down Expand Up @@ -234,13 +243,9 @@ def test_detached_actor(self):
raise Exception(f"There was an exception during the execution of test_detached_actor_1.py. The exit code is {exit_code}." +
"See above for command output. The output will be printed by the function exec_run_container.")

# Initialize k8s client
config.load_kube_config()
k8s_api = client.CoreV1Api()

# KubeRay only allows at most 1 head pod per RayCluster instance at the same time. In addition,
# if we have 0 head pods at this moment, it indicates that the head pod crashes unexpectedly.
headpods = utils.get_pod(k8s_api, namespace='default', label_selector='ray.io/node-type=head')
headpods = utils.get_pod(namespace='default', label_selector='ray.io/node-type=head')
assert(len(headpods.items) == 1)
old_head_pod = headpods.items[0]
old_head_pod_name = old_head_pod.metadata.name
Expand All @@ -249,10 +254,10 @@ def test_detached_actor(self):
# Kill the gcs_server process on head node. If fate sharing is enabled, the whole head node pod
# will terminate.
exec_command = ['pkill gcs_server']
utils.pod_exec_command(k8s_api, pod_name=old_head_pod_name, namespace='default', exec_command=exec_command)
utils.pod_exec_command(pod_name=old_head_pod_name, namespace='default', exec_command=exec_command)

# Waiting for all pods become ready and running.
utils.wait_for_new_head(k8s_api, old_head_pod_name, restart_count, 'default', timeout=300, retry_interval_ms=1000)
utils.wait_for_new_head(old_head_pod_name, restart_count, 'default', timeout=300, retry_interval_ms=1000)

# Try to connect to the detached actor again.
# [Note] When all pods become running and ready, the RayCluster still needs tens of seconds to relaunch actors. Hence,
Expand All @@ -264,75 +269,52 @@ def test_detached_actor(self):
raise Exception(f"There was an exception during the execution of test_detached_actor_2.py. The exit code is {exit_code}." +
"See above for command output. The output will be printed by the function exec_run_container.")

k8s_api.api_client.rest_client.pool_manager.clear()
k8s_api.api_client.close()
container.stop()
docker_client.close()

class RayServiceTestCase(unittest.TestCase):
"""Integration tests for RayService"""
service_template_file = 'tests/config/ray-service.yaml.template'
service_serve_update_template_file = 'tests/config/ray-service-serve-update.yaml.template'
service_cluster_update_template_file = 'tests/config/ray-service-cluster-update.yaml.template'

# The previous logic for testing updates was problematic.
# We need to test RayService updates.
@classmethod
def setUpClass(cls):
if not utils.ray_service_supported(ray_version):
raise unittest.SkipTest("ray service is not supported")
# Ray Service is running inside a local Kind environment.
# We use the Ray nightly version now.
# We wait for the serve service ready.
# The test will check the successful response from serve service.
utils.delete_cluster()
utils.create_cluster()
images = [ray_image, kuberay_operator_image, kuberay_apiserver_image]
utils.download_images(images)
utils.apply_kuberay_resources(images, kuberay_operator_image, kuberay_apiserver_image)
utils.create_kuberay_service(
RayServiceTestCase.service_template_file, ray_version, ray_image)
if not utils.is_feature_supported(ray_version, CONST.RAY_SERVICE):
raise unittest.SkipTest(f"{CONST.RAY_SERVICE} is not supported")
K8S_CLUSTER_MANAGER.delete_kind_cluster()
K8S_CLUSTER_MANAGER.create_kind_cluster()
image_dict = {
CONST.RAY_IMAGE_KEY: ray_image,
CONST.OPERATOR_IMAGE_KEY: kuberay_operator_image
}
operator_manager = OperatorManager(image_dict)
operator_manager.prepare_operator()

def test_ray_serve_work(self):
time.sleep(5)
curl_cmd = 'curl -X POST -H \'Content-Type: application/json\' localhost:8000 -d \'["MANGO", 2]\''
utils.wait_for_condition(
lambda: utils.shell_run(curl_cmd) == 0,
timeout=15,
)
utils.create_kuberay_service(
RayServiceTestCase.service_serve_update_template_file,
ray_version, ray_image)
curl_cmd = 'curl -X POST -H \'Content-Type: application/json\' localhost:8000 -d \'["MANGO", 2]\''
time.sleep(5)
utils.wait_for_condition(
lambda: utils.shell_run(curl_cmd) == 0,
timeout=60,
)
utils.create_kuberay_service(
RayServiceTestCase.service_cluster_update_template_file,
ray_version, ray_image)
time.sleep(5)
curl_cmd = 'curl -X POST -H \'Content-Type: application/json\' localhost:8000 -d \'["MANGO", 2]\''
utils.wait_for_condition(
lambda: utils.shell_run(curl_cmd) == 0,
timeout=180,
)

"""Create a RayService, send a request to RayService via `curl`, and compare the result."""
cr_event = utils.create_ray_service(
RayServiceTestCase.service_template_file, ray_version, ray_image)
# When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready
# for serving requests. This `sleep` function is a workaround, and should be removed
# when https://github.com/ray-project/kuberay/pull/730 is merged.
time.sleep(60)
cr_event.rulesets = [RuleSet([CurlServiceRule()])]
cr_event.check_rule_sets()

def parse_environment():
global ray_version, ray_image, kuberay_operator_image, kuberay_apiserver_image
global ray_version, ray_image, kuberay_operator_image
for k, v in os.environ.items():
if k == 'RAY_IMAGE':
ray_image = v
ray_version = ray_image.split(':')[-1]
elif k == 'OPERATOR_IMAGE':
kuberay_operator_image = v
elif k == 'APISERVER_IMAGE':
kuberay_apiserver_image = v


if __name__ == '__main__':
parse_environment()
logger.info('Setting Ray image to: {}'.format(ray_image))
logger.info('Setting Ray version to: {}'.format(ray_version))
logger.info('Setting KubeRay operator image to: {}'.format(kuberay_operator_image))
logger.info('Setting KubeRay apiserver image to: {}'.format(kuberay_apiserver_image))
unittest.main(verbosity=2)
2 changes: 1 addition & 1 deletion tests/config/ray-service.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ spec:
serveConfig:
importPath: fruit.deployment_graph
runtimeEnv: |
working_dir: "https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip"
working_dir: "https://github.com/ray-project/test_dag/archive/41d09119cbdf8450599f993f51318e9e27c59098.zip"
deployments:
- name: MangoStand
numReplicas: 1
Expand Down
File renamed without changes.
Loading

0 comments on commit 4e8d8be

Please sign in to comment.