You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Running ray[serve]=2.2.0 and kuberay=0.4.0
What Happened
Occasionally upon running a kubectl apply for a new cluster spec, the ray autoscaler container encounters 401 errors attempting to access the Kubernetes cluster API.
This doesn't seem paired to any particular config change, more of a random occurrence. Since it occurred as I attempted to edit an unused field (headGroupSpec.replicas)
The issue resolved after running another kubectl apply for the cluster (without changes) after some time.
While in this state, the autoscaler is unable to perform and the head node enters a crash loop which can result in reduced availability.
Example kubectl get pods status
NAME READY STATUS RESTARTS AGE
demo-ray-serve-raycluster-wslxb-head-spxb4 1/2 CrashLoopBackOff 7 (3m20s ago) 20m
What Expected to Happen
Running kubectl apply with the ray serve cluster spec should apply a rolling update to the ray cluster pods without ending in a crash loop or endpoint downtime.
Reproduction script
The autoscaler configs implements closely the tutorial autoscaler.
....
spec
....
rayClusterConfig:
rayVersion: '2.2.0' # should match the Ray version in the image of the containers
###################### AutoScaling #################################
# If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
# Ray autoscaler integration is supported only for Ray versions >= 1.11.0
# Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
enableInTreeAutoscaling: true
# autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
# The example configuration shown below represents the DEFAULT values.
# (You may delete autoscalerOptions if the defaults are suitable.)
autoscalerOptions:
# upscalingMode is "Default" or "Aggressive."
# Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
# Default: Upscaling is not rate-limited.
# Aggressive: An alias for Default; upscaling is not rate-limited.
upscalingMode: Default
# idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
idleTimeoutSeconds: 300
# image optionally overrides the autoscaler's container image.
# If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
# the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
## image: "my-repo/my-custom-autoscaler-image:tag"
# imagePullPolicy optionally overrides the autoscaler container's image pull policy.
imagePullPolicy: IfNotPresent # Always
# Optionally specify the autoscaler container's securityContext.
securityContext: { }
env: [ ]
envFrom: [ ]
# resources specifies optional resource request and limit overrides for the autoscaler container.
# The default autoscaler resource limits and requests should be sufficient for production use-cases.
# However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
resources:
limits:
cpu: "750m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
Anything else
Autoscaler container logs during crash loop.
The Ray head is ready. Starting the autoscaler.
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2386, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2132, in kuberay_autoscaler
run_kuberay_autoscaler(cluster_name, cluster_namespace)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 63, in run_kuberay_autoscaler
retry_on_failure=False,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 501, in run
self._initialize_autoscaler()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 237, in _initialize_autoscaler
prom_metrics=self.prom_metrics,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 240, in __init__
self.reset(errors_fatal=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1097, in reset
raise e
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1014, in reset
new_config = self.config_reader()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 79, in _fetch_ray_cr_from_k8s_with_retries
raise e from None
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
return self._fetch_ray_cr_from_k8s()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 89, in _fetch_ray_cr_from_k8s
result.raise_for_status()
File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://kubernetes.default:443/apis/ray.io/v1alpha1/namespaces/ray-serve/rayclusters/demo-ray-serve-raycluster-wslxb
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Running
ray[serve]=2.2.0
andkuberay=0.4.0
What Happened
Occasionally upon running a
kubectl apply
for a new cluster spec, the ray autoscaler container encounters401
errors attempting to access the Kubernetes cluster API.This doesn't seem paired to any particular config change, more of a random occurrence. Since it occurred as I attempted to edit an unused field (
headGroupSpec.replicas
)The issue resolved after running another
kubectl apply
for the cluster (without changes) after some time.While in this state, the autoscaler is unable to perform and the head node enters a crash loop which can result in reduced availability.
Example
kubectl get pods
statusWhat Expected to Happen
Running
kubectl apply
with the ray serve cluster spec should apply a rolling update to the ray cluster pods without ending in a crash loop or endpoint downtime.Reproduction script
The autoscaler configs implements closely the tutorial autoscaler.
Anything else
Autoscaler container logs during crash loop.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: