Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing to launch dask worker pods on AWS #870

Closed
scottyhq opened this issue Nov 9, 2020 · 7 comments · Fixed by #871
Closed

failing to launch dask worker pods on AWS #870

scottyhq opened this issue Nov 9, 2020 · 7 comments · Fixed by #871

Comments

@scottyhq
Copy link
Member

scottyhq commented Nov 9, 2020

The last update to prod on AWS is failing to launch dask workers. Not sure what is wrong, but this is the first time we've deployed to prod via github actions. Everything still is working fine on staging.
https://github.com/pangeo-data/pangeo-cloud-federation/runs/1375800740?check_suite_focus=true

on prod, dask-worker pods remain in 'Pending state' with nothing in the logs. No error messages in the jupyter notebook either. Only digging into other dask gateways related pods do I see some error messages. cc @TomAugspurger @rsignell-usgs

kubectl logs -n icesat2-prod traefik-icesat2-prod-dask-gateway-84bd7f7c7-97j5l

time="2020-11-09T18:42:03Z" level=error msg="Cannot create service: subset not found" ingress=dask-3982fed9bac34208997d8323f27de40b namespace=icesat2-prod serviceName=dask-3982fed9bac34208997d8323f27de40b servicePort=8786 providerName=kubernetescrd
time="2020-11-09T18:42:05Z" level=error msg="subset not found for icesat2-prod/dask-3982fed9bac34208997d8323f27de40b" namespace=icesat2-prod ingress=dask-3982fed9bac34208997d8323f27de40b providerName=kubernetescrd
time="2020-11-09T18:42:05Z" level=error msg="Cannot create service: subset not found" servicePort=8786 providerName=kubernetescrd ingress=dask-3982fed9bac34208997d8323f27de40b serviceName=dask-3982fed9bac34208997d8323f27de40b namespace=icesat2-prod
@scottyhq
Copy link
Member Author

scottyhq commented Nov 9, 2020

perhaps i need to re-apply dask-gateway CRDs, but i thought these things are all wrapped up in the helm deploy step now...? #584 (comment)

@TomAugspurger
Copy link
Member

TomAugspurger commented Nov 9, 2020 via email

@cspencerjones
Copy link
Contributor

I am getting this too. Thanks for opening an issue.

from dask_gateway import Gateway
from dask.distributed import Client

gateway = Gateway()
cluster = gateway.new_cluster()

cluster.scale(10)
cluster

gives

Name: icesat2-prod.fd01ef4c1a934ac7a5e0347a75eb5abc

Dashboard: /services/dask-gateway/clusters/icesat2-prod.fd01ef4c1a934ac7a5e0347a75eb5abc/status

But no workers ever become available. Logs just say

dask_gateway.dask_cli - INFO - Requesting scale to 10 workers from 0

@scottyhq
Copy link
Member Author

scottyhq commented Nov 9, 2020

everything is running the same image (pangeo/pangeo-notebook:2020.10.27 --> ecr.us-west-2.amazonaws.com/pangeo:b364ec4)

I'm not seeing anything obvious in any of the logs, but looks like there is a typo here (shouldn't have 'staging'). I'll see if that fixes things...

(from https://github.com/pangeo-data/pangeo-cloud-federation/pull/846/files)

worker:
extraPodConfig:
schedulerName: icesat2-prod-staging-user-scheduler

@scottyhq
Copy link
Member Author

scottyhq commented Nov 9, 2020

sure enough, it was that typo on the schedulerName. we're back up and running!

@cspencerjones
Copy link
Contributor

Thank you!!!

@TomAugspurger
Copy link
Member

Good catch, sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants