failing to launch dask worker pods on AWS #870

scottyhq · 2020-11-09T18:49:43Z

The last update to prod on AWS is failing to launch dask workers. Not sure what is wrong, but this is the first time we've deployed to prod via github actions. Everything still is working fine on staging.
https://github.com/pangeo-data/pangeo-cloud-federation/runs/1375800740?check_suite_focus=true

on prod, dask-worker pods remain in 'Pending state' with nothing in the logs. No error messages in the jupyter notebook either. Only digging into other dask gateways related pods do I see some error messages. cc @TomAugspurger @rsignell-usgs

kubectl logs -n icesat2-prod traefik-icesat2-prod-dask-gateway-84bd7f7c7-97j5l

time="2020-11-09T18:42:03Z" level=error msg="Cannot create service: subset not found" ingress=dask-3982fed9bac34208997d8323f27de40b namespace=icesat2-prod serviceName=dask-3982fed9bac34208997d8323f27de40b servicePort=8786 providerName=kubernetescrd
time="2020-11-09T18:42:05Z" level=error msg="subset not found for icesat2-prod/dask-3982fed9bac34208997d8323f27de40b" namespace=icesat2-prod ingress=dask-3982fed9bac34208997d8323f27de40b providerName=kubernetescrd
time="2020-11-09T18:42:05Z" level=error msg="Cannot create service: subset not found" servicePort=8786 providerName=kubernetescrd ingress=dask-3982fed9bac34208997d8323f27de40b serviceName=dask-3982fed9bac34208997d8323f27de40b namespace=icesat2-prod

The text was updated successfully, but these errors were encountered:

scottyhq · 2020-11-09T18:59:02Z

perhaps i need to re-apply dask-gateway CRDs, but i thought these things are all wrapped up in the helm deploy step now...? #584 (comment)

TomAugspurger · 2020-11-09T19:56:58Z

Any logs in the dask-gateway API server or controller? What versions of dask-gateway are being used in the singleuser / worker images, and what version is on the cluster?

…

On Nov 9, 2020, at 12:59 PM, Scott Henderson ***@***.***> wrote: perhaps i need to re-apply dask-gateway CRDs, but i thought these things are all wrapped up in the helm deploy step now...? #584 (comment) <#584 (comment)> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#870 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOITPP7HZ5U4JE7CBVN3SPA3YLANCNFSM4TPXFJJA>.

cspencerjones · 2020-11-09T21:05:54Z

I am getting this too. Thanks for opening an issue.

from dask_gateway import Gateway
from dask.distributed import Client

gateway = Gateway()
cluster = gateway.new_cluster()

cluster.scale(10)
cluster

gives

Name: icesat2-prod.fd01ef4c1a934ac7a5e0347a75eb5abc

Dashboard: /services/dask-gateway/clusters/icesat2-prod.fd01ef4c1a934ac7a5e0347a75eb5abc/status

But no workers ever become available. Logs just say

dask_gateway.dask_cli - INFO - Requesting scale to 10 workers from 0

scottyhq · 2020-11-09T21:37:10Z

everything is running the same image (pangeo/pangeo-notebook:2020.10.27 --> ecr.us-west-2.amazonaws.com/pangeo:b364ec4)

I'm not seeing anything obvious in any of the logs, but looks like there is a typo here (shouldn't have 'staging'). I'll see if that fixes things...

(from https://github.com/pangeo-data/pangeo-cloud-federation/pull/846/files)

pangeo-cloud-federation/deployments/icesat2/config/prod.yaml

Lines 41 to 43 in b2c04c5

    
           worker: 
        
             extraPodConfig: 
        
               schedulerName: icesat2-prod-staging-user-scheduler

scottyhq · 2020-11-09T22:03:14Z

sure enough, it was that typo on the schedulerName. we're back up and running!

cspencerjones · 2020-11-09T22:12:18Z

Thank you!!!

TomAugspurger · 2020-11-10T18:51:45Z

Good catch, sorry about that!

scottyhq mentioned this issue Nov 9, 2020

fix worker scheduler name typo #871

Merged

scottyhq closed this as completed in #871 Nov 9, 2020

scottyhq mentioned this issue Nov 9, 2020

AWS fix scheduler name and github actions #872

Merged

TomAugspurger mentioned this issue Nov 11, 2020

Documenting the current GCP deployment #874

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failing to launch dask worker pods on AWS #870

failing to launch dask worker pods on AWS #870

scottyhq commented Nov 9, 2020 •

edited

Loading

scottyhq commented Nov 9, 2020

TomAugspurger commented Nov 9, 2020 via email

cspencerjones commented Nov 9, 2020

scottyhq commented Nov 9, 2020 •

edited

Loading

scottyhq commented Nov 9, 2020

cspencerjones commented Nov 9, 2020

TomAugspurger commented Nov 10, 2020

failing to launch dask worker pods on AWS #870

failing to launch dask worker pods on AWS #870

Comments

scottyhq commented Nov 9, 2020 • edited Loading

scottyhq commented Nov 9, 2020

TomAugspurger commented Nov 9, 2020 via email

cspencerjones commented Nov 9, 2020

scottyhq commented Nov 9, 2020 • edited Loading

scottyhq commented Nov 9, 2020

cspencerjones commented Nov 9, 2020

TomAugspurger commented Nov 10, 2020

scottyhq commented Nov 9, 2020 •

edited

Loading

scottyhq commented Nov 9, 2020 •

edited

Loading