Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting the current GCP deployment #874

Open
TomAugspurger opened this issue Nov 11, 2020 · 0 comments
Open

Documenting the current GCP deployment #874

TomAugspurger opened this issue Nov 11, 2020 · 0 comments

Comments

@TomAugspurger
Copy link
Member

Hi all,

I'm still offline for a bit, but wanted to dump some thoughts on our current setup, as of 2020-11-11. This is primarily focused on the GCP deployment (https://us-central1-b.gcp.pangeo.io/, and http://staging.us-central1-b.gcp.pangeo.io). It's also mainly focused on how things are (especially how they differ from a "stock" JupyterHub / daskhub deployment) rather than how they should be.

The Hub is deployed through CI in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d7deb23224604150b5380946367d0d95d42e45cb/.circleci/config.yml.
The chart in pangeo-deploy is a small wrapper around daskhub, which wraps up Dask Gateway and JupyterHub.

We've customized a few things beyond the standard deployment.

Kubernetes Cluster

In theory the cluster target in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d7deb23224604150b5380946367d0d95d42e45cb/deployments/gcp-uscentral1b/Makefile controls the creation of the Kubernetes cluster (it may be out of date). The most notable things are

  1. A small (autoscaling) core pool for the hub, dask-gateway, and various other service pods (more later).
  2. A auto-provisioning, auto-scaling node pools for the rest. This uses GCP's node-pool auto-provisioning
    feature where node-pools are automatically created based on the Kubernetes taints / tolerations (e.g.
    it'll create a preemptible node pool for Dask workers, since we mark them as preemptible.
  3. A Kubernetes Service Account and Google Service Account, pangeo used for various things (e.g. the scratch bucket, more later)

Otherwise, things probably follow zero-to-jupyterhub pretty closely.

Authentication

Like the AWS deployment, we use auth0 to authenticate with the hubs after they fill out the form: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/.github/workflows/UpdateMembers.yml, https://github.com/pangeo-data/pangeo-cloud-federation/actions?query=workflow%3AUpdateMembers.

Images

The GCP deployment simply uses the Docker images from https://github.com/pangeo-data/pangeo-docker-images with no modifications.
We use dependabot to automatically update our pinned
version as tags are pushed in pangeo-docker-images.

Testing

We have rudimentary integration tests as part of our CI/CD. #753 provides an overview.
The summary is that pushes to staging will

  1. Deploy the new changes
  2. Start a single-users server (for pangeo-bot; we manually created a token for it and stored it as a secret in CI)
  3. Copy a test.py file to the single-user pod and kubectl exec it
  4. Report the result

This should be expanded in a few directions

  1. Run on prod too (would have caught failing to launch dask worker pods on AWS #870, a prod-specific issue)
  2. Better test coverage: Right now we just ensure that we can create a Dask cluster.
  3. Rollback deployments where the tests fail?

Scratch Bucket

Many workloads benefit from having some scratch space to write intermediate results. https://rechunker.readthedocs.io/en/latest/ is
a prime example. We don't want users writing large intermediates to their home directory. This is slow and expensive. So we've
provided them with the cloud-native alternative: a read / write bucket on GCS, pangeo-scratch.

This bucket is created with the scratch target in the Makefile, which uses lifecycle.json to specify that objects are automatically
deleted after 7 days.

On GCP, we use Workload Identity for the
kubernetes pods. If the Kubernetes Service Account is associated with a Google Service Account, the pod is able to do things that
the GSA can do. #610 (comment) has the hopefully up-to-date
commands used to do the association between the KSA and the GSA.

See #610 for background.

One notable downside is that this bucket is globally read/writable by everyone in the pangeo cluster. We set the
PANGEO_SCRATCH environment variable in the pangeo-docker-images to be equal to gs://pangeo-scratch/{JUPYTERHUB_USERNAME},
see https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/common.yaml#L23.

Prometheus / Grafana

We have some monitoring of the clusters at http://grafana.us-central1-b.gcp.pangeo.io/grafana/.
We use prometheus to collect metrics from running pods / nodes, and grafana to visualize the metrics.
Finally, we provide an ingress to access the metrics over the internet. The metrics are public to read.

These are deployed separately from prod and staging, not as part of CI/CD, into the metrics namespace.
The pods are configured to squeeze into the core pool.

We ensure that the dask worker & scheduler pods export metrics, along with the JupyterHub username, at

extra_annotations = {
"hub.jupyter.org/username": user.name,
"prometheus.io/scrape": "true",
"prometheus.io/port": "8787",
}
extra_labels = {
"hub.jupyter.org/username": user.name,
}
.

To configure the ingress, we reserve the static IP of the LoadBalancer in GCP, and then point a DNS entry to it (our dns is through
Hurricane Electric).

MLFlow / Batch Worfklows

There's an incomplete effort to add mlflow / general batch workflow support to our hubs. We have a simple helm chart
for mlflow in the mlflow directory. This has an mlflow deployment / service running MLFlow which is registered
as a JupyterHub service and is accessible at https://{HUB_URL}/services/mlflow/.

Additionally, we set thee MLFLOW_TRACIKING_URI on the singleuser pod so that users can easily log metrics / artifacts.

See https://discourse.pangeo.io/t/pangeo-batch-workflows/804/7 and pangeo-data/pangeo#800 for more.

The biggest outstanding issues are probably around

  1. Image / environment handling (--no-conda kind of works, assuming the single-user env has all the needed packages)
  2. Lack of RBAC in the MLFlow UI (I think every can see (and maybe delete?) everyone else's experiments)

And lots of polish. It's not clear to me if we should continue to go down the MLFlow path, but it is an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant