[Bug] 401 errors for cluster autoscaler in ray service after apply update. #924

bewestphal · 2023-02-23T02:37:43Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Running ray[serve]=2.2.0 and kuberay=0.4.0

What Happened

Occasionally upon running a kubectl apply for a new cluster spec, the ray autoscaler container encounters 401 errors attempting to access the Kubernetes cluster API.

This doesn't seem paired to any particular config change, more of a random occurrence. Since it occurred as I attempted to edit an unused field (headGroupSpec.replicas)

The issue resolved after running another kubectl apply for the cluster (without changes) after some time.

While in this state, the autoscaler is unable to perform and the head node enters a crash loop which can result in reduced availability.

Example kubectl get pods status

NAME                                                      READY   STATUS             RESTARTS        AGE
demo-ray-serve-raycluster-wslxb-head-spxb4                1/2     CrashLoopBackOff   7 (3m20s ago)   20m

What Expected to Happen

Running kubectl apply with the ray serve cluster spec should apply a rolling update to the ray cluster pods without ending in a crash loop or endpoint downtime.

Reproduction script

The autoscaler configs implements closely the tutorial autoscaler.

....
spec
  ....
  rayClusterConfig:
    rayVersion: '2.2.0' # should match the Ray version in the image of the containers

    ###################### AutoScaling #################################
    # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
    # Ray autoscaler integration is supported only for Ray versions >= 1.11.0
    # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
    enableInTreeAutoscaling: true
    # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
    # The example configuration shown below represents the DEFAULT values.
    # (You may delete autoscalerOptions if the defaults are suitable.)
    autoscalerOptions:
      # upscalingMode is "Default" or "Aggressive."
      # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
      # Default: Upscaling is not rate-limited.
      # Aggressive: An alias for Default; upscaling is not rate-limited.
      upscalingMode: Default
      # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
      idleTimeoutSeconds: 300
      # image optionally overrides the autoscaler's container image.
      # If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
      # the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
      ## image: "my-repo/my-custom-autoscaler-image:tag"
      # imagePullPolicy optionally overrides the autoscaler container's image pull policy.
      imagePullPolicy: IfNotPresent # Always
      # Optionally specify the autoscaler container's securityContext.
      securityContext: { }
      env: [ ]
      envFrom: [ ]
      # resources specifies optional resource request and limit overrides for the autoscaler container.
      # The default autoscaler resource limits and requests should be sufficient for production use-cases.
      # However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
      resources:
        limits:
          cpu: "750m"
          memory: "512Mi"
        requests:
          cpu: "500m"
          memory: "512Mi"

Anything else

Autoscaler container logs during crash loop.

The Ray head is ready. Starting the autoscaler.
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2386, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2132, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 63, in run_kuberay_autoscaler
    retry_on_failure=False,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 501, in run
    self._initialize_autoscaler()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 237, in _initialize_autoscaler
    prom_metrics=self.prom_metrics,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 240, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1097, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1014, in reset
    new_config = self.config_reader()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
    ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 79, in _fetch_ray_cr_from_k8s_with_retries
    raise e from None
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
    return self._fetch_ray_cr_from_k8s()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 89, in _fetch_ray_cr_from_k8s
    result.raise_for_status()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://kubernetes.default:443/apis/ray.io/v1alpha1/namespaces/ray-serve/rayclusters/demo-ray-serve-raycluster-wslxb

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

DmitriGekhtman · 2023-02-23T17:42:39Z

Could help to kubectl logs the operator to see if there's an funny business with service accounts.

But indeed headGroupSpec.replicas is deprecated and the only fields that support updates are replicas, minReplicas, maxReplicas for the worker groups.

bewestphal · 2023-02-23T22:05:51Z

Here's a dump of some kuberay operator logs.
kuberay.log

Encountered this issue after my head node pod crashed, the issue left the pod in a crash loop state. Mentioned in serve slack:
https://ray-distributed.slack.com/archives/CNCKBBRJL/p1677188591700029

Pod crash happened around 2023-02-23T21:01:07.00Z in the logs

In this case, the issue resolved itself eventually without another kubectl apply.

kevin85421 · 2023-06-06T05:30:19Z

Closed by #1128.

bewestphal added the bug label Feb 23, 2023

akshay-anyscale added rayservice serve labels Mar 7, 2023

akelloway mentioned this issue Mar 14, 2023

[Bug] Autoscaler deployment fails - reports Forbidden access (403) to Kubernetes API #960

Closed

2 tasks

edoakes added the P1 label Mar 23, 2023

akshay-anyscale assigned sihanwang41 Mar 28, 2023

kevin85421 mentioned this issue May 15, 2023

KubeRay v0.6.0 roadmap #1052

Closed

akshay-anyscale assigned kevin85421 May 17, 2023

kevin85421 closed this as completed Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 401 errors for cluster autoscaler in ray service after apply update. #924

[Bug] 401 errors for cluster autoscaler in ray service after apply update. #924

bewestphal commented Feb 23, 2023 •

edited

Loading

DmitriGekhtman commented Feb 23, 2023

bewestphal commented Feb 23, 2023 •

edited

Loading

kevin85421 commented Jun 6, 2023

[Bug] 401 errors for cluster autoscaler in ray service after apply update. #924

[Bug] 401 errors for cluster autoscaler in ray service after apply update. #924

Comments

bewestphal commented Feb 23, 2023 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

What Happened

What Expected to Happen

Reproduction script

Anything else

Are you willing to submit a PR?

DmitriGekhtman commented Feb 23, 2023

bewestphal commented Feb 23, 2023 • edited Loading

kevin85421 commented Jun 6, 2023

bewestphal commented Feb 23, 2023 •

edited

Loading

bewestphal commented Feb 23, 2023 •

edited

Loading