Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 401 errors for cluster autoscaler in ray service after apply update. #924

Closed
1 of 2 tasks
bewestphal opened this issue Feb 23, 2023 · 3 comments
Closed
1 of 2 tasks
Assignees
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks rayservice serve

Comments

@bewestphal
Copy link

bewestphal commented Feb 23, 2023

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Running ray[serve]=2.2.0 and kuberay=0.4.0

What Happened

Occasionally upon running a kubectl apply for a new cluster spec, the ray autoscaler container encounters 401 errors attempting to access the Kubernetes cluster API.

This doesn't seem paired to any particular config change, more of a random occurrence. Since it occurred as I attempted to edit an unused field (headGroupSpec.replicas)

The issue resolved after running another kubectl apply for the cluster (without changes) after some time.

While in this state, the autoscaler is unable to perform and the head node enters a crash loop which can result in reduced availability.

Example kubectl get pods status

NAME                                                      READY   STATUS             RESTARTS        AGE
demo-ray-serve-raycluster-wslxb-head-spxb4                1/2     CrashLoopBackOff   7 (3m20s ago)   20m

What Expected to Happen

Running kubectl apply with the ray serve cluster spec should apply a rolling update to the ray cluster pods without ending in a crash loop or endpoint downtime.

Reproduction script

The autoscaler configs implements closely the tutorial autoscaler.

....
spec
  ....
  rayClusterConfig:
    rayVersion: '2.2.0' # should match the Ray version in the image of the containers

    ###################### AutoScaling #################################
    # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
    # Ray autoscaler integration is supported only for Ray versions >= 1.11.0
    # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
    enableInTreeAutoscaling: true
    # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
    # The example configuration shown below represents the DEFAULT values.
    # (You may delete autoscalerOptions if the defaults are suitable.)
    autoscalerOptions:
      # upscalingMode is "Default" or "Aggressive."
      # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
      # Default: Upscaling is not rate-limited.
      # Aggressive: An alias for Default; upscaling is not rate-limited.
      upscalingMode: Default
      # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
      idleTimeoutSeconds: 300
      # image optionally overrides the autoscaler's container image.
      # If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
      # the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
      ## image: "my-repo/my-custom-autoscaler-image:tag"
      # imagePullPolicy optionally overrides the autoscaler container's image pull policy.
      imagePullPolicy: IfNotPresent # Always
      # Optionally specify the autoscaler container's securityContext.
      securityContext: { }
      env: [ ]
      envFrom: [ ]
      # resources specifies optional resource request and limit overrides for the autoscaler container.
      # The default autoscaler resource limits and requests should be sufficient for production use-cases.
      # However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
      resources:
        limits:
          cpu: "750m"
          memory: "512Mi"
        requests:
          cpu: "500m"
          memory: "512Mi"

Anything else

Autoscaler container logs during crash loop.

The Ray head is ready. Starting the autoscaler.
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2386, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2132, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 63, in run_kuberay_autoscaler
    retry_on_failure=False,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 501, in run
    self._initialize_autoscaler()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 237, in _initialize_autoscaler
    prom_metrics=self.prom_metrics,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 240, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1097, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1014, in reset
    new_config = self.config_reader()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
    ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 79, in _fetch_ray_cr_from_k8s_with_retries
    raise e from None
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
    return self._fetch_ray_cr_from_k8s()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 89, in _fetch_ray_cr_from_k8s
    result.raise_for_status()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://kubernetes.default:443/apis/ray.io/v1alpha1/namespaces/ray-serve/rayclusters/demo-ray-serve-raycluster-wslxb

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@bewestphal bewestphal added the bug Something isn't working label Feb 23, 2023
@DmitriGekhtman
Copy link
Collaborator

Could help to kubectl logs the operator to see if there's an funny business with service accounts.

But indeed headGroupSpec.replicas is deprecated and the only fields that support updates are replicas, minReplicas, maxReplicas for the worker groups.

@bewestphal
Copy link
Author

bewestphal commented Feb 23, 2023

Here's a dump of some kuberay operator logs.
kuberay.log

Encountered this issue after my head node pod crashed, the issue left the pod in a crash loop state. Mentioned in serve slack:
https://ray-distributed.slack.com/archives/CNCKBBRJL/p1677188591700029

Pod crash happened around 2023-02-23T21:01:07.00Z in the logs

In this case, the issue resolved itself eventually without another kubectl apply.

@kevin85421
Copy link
Member

Closed by #1128.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks rayservice serve
Projects
None yet
Development

No branches or pull requests

6 participants