Deploying NASA Cryo cluster and hub #1768

sgibson91 · 2022-10-13T13:37:26Z

Deployment-related changes in this PR

addresses #1702

Cluster files added:
- Jsonnet files
- SSH keys
- Terraform files
- cluster.yaml file
- Deployer credentials
- Grafana dashboards token
Helm config files added:
- Support components
- Common helm config to be shared between hubs
- Hub configs for staging and prod
CI/CD files updated:
- deploy grafana dashboards
- deploy hubs
- validate hubs

Other changes in this PR

Improvements to AWS deployment docs
Improvement to AWS terraform code and related docs
- Output the correct command to link the deployment credentials to the cluster
Improvement to the AWS-related environment variables the deployer sets/unsets

sgibson91 · 2022-10-13T13:43:31Z

Updated to add

This issue was resolved by #1770

I'm at the point in the documentation where I need to use terraform to generate a CI/CD key: https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html#create-account-with-finely-scoped-permissions-for-automatic-deployment However, AWS terraform now seems to be doing way more than just creating a CI/CD key and I don't know where that is documented. (PR #1767 is my attempt to improve documentation on this whole process but I could use help filling in the stuff I don't know.)

The .tfvars file generate by deployer generate-cluster is not complete as was asked to provide db_instance_identifier on the command line, which I set in the tfvars file because I figured making conditional input variables is hard in terraform and db_enabled should be set to false anyway according to this code block:

infrastructure/terraform/aws/variables.tf

Lines 66 to 72 in 566c2a7

    
           variable "db_enabled" { 
        
             default     = false 
        
             type        = bool 
        
             description = <<-EOT 
        
             Run a database for the hub with AWS RDS 
        
             EOT 
        
           }

But my terraform plan is failing trying to set a root password for a mysql database - as far as I know, I don't want this feature enabled and it shouldn't be running!

$ tf plan -var-file=projects/nasa-cryo.tfvars -out=nasa-cryo-plan
data.aws_eks_cluster.cluster: Reading...
data.aws_partition.current: Reading...
data.aws_caller_identity.current: Reading...
data.aws_partition.current: Read complete after 0s [id=aws]
data.aws_caller_identity.current: Read complete after 0s [id=574251165169]
data.aws_eks_cluster.cluster: Read complete after 1s [id=nasa-cryo]
data.aws_security_group.cluster_nodes_shared_security_group: Reading...
data.aws_subnet.cluster_node_subnet: Reading...
data.aws_iam_policy_document.irsa_role_assume["staging"]: Reading...
data.aws_iam_policy_document.irsa_role_assume["prod"]: Reading...
data.aws_iam_policy_document.irsa_role_assume["prod"]: Read complete after 0s [id=1280574928]
data.aws_iam_policy_document.irsa_role_assume["staging"]: Read complete after 0s [id=3441862523]
data.aws_security_group.cluster_nodes_shared_security_group: Read complete after 0s [id=sg-094f411783a59613d]
data.aws_subnet.cluster_node_subnet: Read complete after 0s [id=subnet-0b5ab806454193c98]
╷
│ Error: Invalid index
│ 
│   on db.tf line 97, in provider "mysql":
│   97:   endpoint = aws_db_instance.db[0].endpoint
│     ├────────────────
│     │ aws_db_instance.db is empty tuple
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
╵
╷
│ Error: Invalid index
│ 
│   on db.tf line 99, in provider "mysql":
│   99:   username = aws_db_instance.db[0].username
│     ├────────────────
│     │ aws_db_instance.db is empty tuple
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
╵
╷
│ Error: Invalid index
│ 
│   on db.tf line 100, in provider "mysql":
│  100:   password = random_password.db_root_password[0].result
│     ├────────────────
│     │ random_password.db_root_password is empty tuple
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
╵

- Don't set up mysql provider when db is not enabled - Make sure all db related resources are conditional on db being enabled - Switch to a maintained fork of the mysql provider Unblocks 2i2c-org#1768

terraform/aws/projects/nasa-cryo.tfvars

sgibson91 · 2022-10-14T09:40:04Z

Output of terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
 <= read (data resources)

Terraform will perform the following actions:

  # data.aws_iam_policy_document.bucket_access["prod.scratch"] will be read during apply
  # (config refers to values not yet known)
 <= data "aws_iam_policy_document" "bucket_access" {
      + id   = (known after apply)
      + json = (known after apply)

      + statement {
          + actions   = [
              + "s3:*",
            ]
          + effect    = "Allow"
          + resources = [
              + (known after apply),
              + (known after apply),
            ]

          + principals {
              + identifiers = [
                  + (known after apply),
                ]
              + type        = "AWS"
            }
        }
    }

  # data.aws_iam_policy_document.bucket_access["staging.scratch-staging"] will be read during apply
  # (config refers to values not yet known)
 <= data "aws_iam_policy_document" "bucket_access" {
      + id   = (known after apply)
      + json = (known after apply)

      + statement {
          + actions   = [
              + "s3:*",
            ]
          + effect    = "Allow"
          + resources = [
              + (known after apply),
              + (known after apply),
            ]

          + principals {
              + identifiers = [
                  + (known after apply),
                ]
              + type        = "AWS"
            }
        }
    }

  # aws_efs_file_system.homedirs will be created
  + resource "aws_efs_file_system" "homedirs" {
      + arn                     = (known after apply)
      + availability_zone_id    = (known after apply)
      + availability_zone_name  = (known after apply)
      + creation_token          = (known after apply)
      + dns_name                = (known after apply)
      + encrypted               = (known after apply)
      + id                      = (known after apply)
      + kms_key_id              = (known after apply)
      + number_of_mount_targets = (known after apply)
      + owner_id                = (known after apply)
      + performance_mode        = (known after apply)
      + size_in_bytes           = (known after apply)
      + tags                    = {
          + "Name" = "hub-homedirs"
        }
      + tags_all                = {
          + "Name" = "hub-homedirs"
        }
      + throughput_mode         = "bursting"
    }

  # aws_efs_mount_target.homedirs will be created
  + resource "aws_efs_mount_target" "homedirs" {
      + availability_zone_id   = (known after apply)
      + availability_zone_name = (known after apply)
      + dns_name               = (known after apply)
      + file_system_arn        = (known after apply)
      + file_system_id         = (known after apply)
      + id                     = (known after apply)
      + ip_address             = (known after apply)
      + mount_target_dns_name  = (known after apply)
      + network_interface_id   = (known after apply)
      + owner_id               = (known after apply)
      + security_groups        = [
          + "sg-094f411783a59613d",
        ]
      + subnet_id              = "subnet-0b5ab806454193c98"
    }

  # aws_iam_access_key.continuous_deployer will be created
  + resource "aws_iam_access_key" "continuous_deployer" {
      + create_date                    = (known after apply)
      + encrypted_secret               = (known after apply)
      + encrypted_ses_smtp_password_v4 = (known after apply)
      + id                             = (known after apply)
      + key_fingerprint                = (known after apply)
      + secret                         = (sensitive value)
      + ses_smtp_password_v4           = (sensitive value)
      + status                         = "Active"
      + user                           = "hub-continuous-deployer"
    }

  # aws_iam_role.irsa_role["prod"] will be created
  + resource "aws_iam_role" "irsa_role" {
      + arn                   = (known after apply)
      + assume_role_policy    = jsonencode(
            {
              + Statement = [
                  + {
                      + Action    = "sts:AssumeRoleWithWebIdentity"
                      + Condition = {
                          + StringEquals = {
                              + "oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B:sub" = "system:serviceaccount:prod:user-sa"
                            }
                        }
                      + Effect    = "Allow"
                      + Principal = {
                          + Federated = "arn:aws:iam::574251165169:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B"
                        }
                      + Sid       = ""
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
      + create_date           = (known after apply)
      + force_detach_policies = false
      + id                    = (known after apply)
      + managed_policy_arns   = (known after apply)
      + max_session_duration  = 3600
      + name                  = "nasa-cryo-prod"
      + name_prefix           = (known after apply)
      + path                  = "/"
      + tags_all              = (known after apply)
      + unique_id             = (known after apply)

      + inline_policy {
          + name   = (known after apply)
          + policy = (known after apply)
        }
    }

  # aws_iam_role.irsa_role["staging"] will be created
  + resource "aws_iam_role" "irsa_role" {
      + arn                   = (known after apply)
      + assume_role_policy    = jsonencode(
            {
              + Statement = [
                  + {
                      + Action    = "sts:AssumeRoleWithWebIdentity"
                      + Condition = {
                          + StringEquals = {
                              + "oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B:sub" = "system:serviceaccount:staging:user-sa"
                            }
                        }
                      + Effect    = "Allow"
                      + Principal = {
                          + Federated = "arn:aws:iam::574251165169:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B"
                        }
                      + Sid       = ""
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
      + create_date           = (known after apply)
      + force_detach_policies = false
      + id                    = (known after apply)
      + managed_policy_arns   = (known after apply)
      + max_session_duration  = 3600
      + name                  = "nasa-cryo-staging"
      + name_prefix           = (known after apply)
      + path                  = "/"
      + tags_all              = (known after apply)
      + unique_id             = (known after apply)

      + inline_policy {
          + name   = (known after apply)
          + policy = (known after apply)
        }
    }

  # aws_iam_user.continuous_deployer will be created
  + resource "aws_iam_user" "continuous_deployer" {
      + arn           = (known after apply)
      + force_destroy = false
      + id            = (known after apply)
      + name          = "hub-continuous-deployer"
      + path          = "/"
      + tags_all      = (known after apply)
      + unique_id     = (known after apply)
    }

  # aws_iam_user_policy.continuous_deployer will be created
  + resource "aws_iam_user_policy" "continuous_deployer" {
      + id     = (known after apply)
      + name   = "eks-readonly"
      + policy = jsonencode(
            {
              + Statement = [
                  + {
                      + Action   = "eks:DescribeCluster"
                      + Effect   = "Allow"
                      + Resource = "*"
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
      + user   = "hub-continuous-deployer"
    }

  # aws_s3_bucket.user_buckets["scratch"] will be created
  + resource "aws_s3_bucket" "user_buckets" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = "nasa-cryo-scratch"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + cors_rule {
          + allowed_headers = (known after apply)
          + allowed_methods = (known after apply)
          + allowed_origins = (known after apply)
          + expose_headers  = (known after apply)
          + max_age_seconds = (known after apply)
        }

      + grant {
          + id          = (known after apply)
          + permissions = (known after apply)
          + type        = (known after apply)
          + uri         = (known after apply)
        }

      + lifecycle_rule {
          + abort_incomplete_multipart_upload_days = (known after apply)
          + enabled                                = (known after apply)
          + id                                     = (known after apply)
          + prefix                                 = (known after apply)
          + tags                                   = (known after apply)

          + expiration {
              + date                         = (known after apply)
              + days                         = (known after apply)
              + expired_object_delete_marker = (known after apply)
            }

          + noncurrent_version_expiration {
              + days = (known after apply)
            }

          + noncurrent_version_transition {
              + days          = (known after apply)
              + storage_class = (known after apply)
            }

          + transition {
              + date          = (known after apply)
              + days          = (known after apply)
              + storage_class = (known after apply)
            }
        }

      + logging {
          + target_bucket = (known after apply)
          + target_prefix = (known after apply)
        }

      + object_lock_configuration {
          + object_lock_enabled = (known after apply)

          + rule {
              + default_retention {
                  + days  = (known after apply)
                  + mode  = (known after apply)
                  + years = (known after apply)
                }
            }
        }

      + replication_configuration {
          + role = (known after apply)

          + rules {
              + delete_marker_replication_status = (known after apply)
              + id                               = (known after apply)
              + prefix                           = (known after apply)
              + priority                         = (known after apply)
              + status                           = (known after apply)

              + destination {
                  + account_id         = (known after apply)
                  + bucket             = (known after apply)
                  + replica_kms_key_id = (known after apply)
                  + storage_class      = (known after apply)

                  + access_control_translation {
                      + owner = (known after apply)
                    }

                  + metrics {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }

                  + replication_time {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }
                }

              + filter {
                  + prefix = (known after apply)
                  + tags   = (known after apply)
                }

              + source_selection_criteria {
                  + sse_kms_encrypted_objects {
                      + enabled = (known after apply)
                    }
                }
            }
        }

      + server_side_encryption_configuration {
          + rule {
              + bucket_key_enabled = (known after apply)

              + apply_server_side_encryption_by_default {
                  + kms_master_key_id = (known after apply)
                  + sse_algorithm     = (known after apply)
                }
            }
        }

      + versioning {
          + enabled    = (known after apply)
          + mfa_delete = (known after apply)
        }

      + website {
          + error_document           = (known after apply)
          + index_document           = (known after apply)
          + redirect_all_requests_to = (known after apply)
          + routing_rules            = (known after apply)
        }
    }

  # aws_s3_bucket.user_buckets["scratch-staging"] will be created
  + resource "aws_s3_bucket" "user_buckets" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = "nasa-cryo-scratch-staging"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + cors_rule {
          + allowed_headers = (known after apply)
          + allowed_methods = (known after apply)
          + allowed_origins = (known after apply)
          + expose_headers  = (known after apply)
          + max_age_seconds = (known after apply)
        }

      + grant {
          + id          = (known after apply)
          + permissions = (known after apply)
          + type        = (known after apply)
          + uri         = (known after apply)
        }

      + lifecycle_rule {
          + abort_incomplete_multipart_upload_days = (known after apply)
          + enabled                                = (known after apply)
          + id                                     = (known after apply)
          + prefix                                 = (known after apply)
          + tags                                   = (known after apply)

          + expiration {
              + date                         = (known after apply)
              + days                         = (known after apply)
              + expired_object_delete_marker = (known after apply)
            }

          + noncurrent_version_expiration {
              + days = (known after apply)
            }

          + noncurrent_version_transition {
              + days          = (known after apply)
              + storage_class = (known after apply)
            }

          + transition {
              + date          = (known after apply)
              + days          = (known after apply)
              + storage_class = (known after apply)
            }
        }

      + logging {
          + target_bucket = (known after apply)
          + target_prefix = (known after apply)
        }

      + object_lock_configuration {
          + object_lock_enabled = (known after apply)

          + rule {
              + default_retention {
                  + days  = (known after apply)
                  + mode  = (known after apply)
                  + years = (known after apply)
                }
            }
        }

      + replication_configuration {
          + role = (known after apply)

          + rules {
              + delete_marker_replication_status = (known after apply)
              + id                               = (known after apply)
              + prefix                           = (known after apply)
              + priority                         = (known after apply)
              + status                           = (known after apply)

              + destination {
                  + account_id         = (known after apply)
                  + bucket             = (known after apply)
                  + replica_kms_key_id = (known after apply)
                  + storage_class      = (known after apply)

                  + access_control_translation {
                      + owner = (known after apply)
                    }

                  + metrics {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }

                  + replication_time {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }
                }

              + filter {
                  + prefix = (known after apply)
                  + tags   = (known after apply)
                }

              + source_selection_criteria {
                  + sse_kms_encrypted_objects {
                      + enabled = (known after apply)
                    }
                }
            }
        }

      + server_side_encryption_configuration {
          + rule {
              + bucket_key_enabled = (known after apply)

              + apply_server_side_encryption_by_default {
                  + kms_master_key_id = (known after apply)
                  + sse_algorithm     = (known after apply)
                }
            }
        }

      + versioning {
          + enabled    = (known after apply)
          + mfa_delete = (known after apply)
        }

      + website {
          + error_document           = (known after apply)
          + index_document           = (known after apply)
          + redirect_all_requests_to = (known after apply)
          + routing_rules            = (known after apply)
        }
    }

  # aws_s3_bucket_lifecycle_configuration.user_bucket_expiry["scratch"] will be created
  + resource "aws_s3_bucket_lifecycle_configuration" "user_bucket_expiry" {
      + bucket = "nasa-cryo-scratch"
      + id     = (known after apply)

      + rule {
          + id     = "delete-after-expiry"
          + status = "Enabled"

          + expiration {
              + days                         = 7
              + expired_object_delete_marker = (known after apply)
            }
        }
    }

  # aws_s3_bucket_lifecycle_configuration.user_bucket_expiry["scratch-staging"] will be created
  + resource "aws_s3_bucket_lifecycle_configuration" "user_bucket_expiry" {
      + bucket = "nasa-cryo-scratch-staging"
      + id     = (known after apply)

      + rule {
          + id     = "delete-after-expiry"
          + status = "Enabled"

          + expiration {
              + days                         = 7
              + expired_object_delete_marker = (known after apply)
            }
        }
    }

  # aws_s3_bucket_policy.user_bucket_access["prod.scratch"] will be created
  + resource "aws_s3_bucket_policy" "user_bucket_access" {
      + bucket = (known after apply)
      + id     = (known after apply)
      + policy = (known after apply)
    }

  # aws_s3_bucket_policy.user_bucket_access["staging.scratch-staging"] will be created
  + resource "aws_s3_bucket_policy" "user_bucket_access" {
      + bucket = (known after apply)
      + id     = (known after apply)
      + policy = (known after apply)
    }

Plan: 13 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + continuous_deployer_creds = (sensitive value)
  + db_helm_config            = (sensitive value)
  + kubernetes_sa_annotations = {
      + prod    = (known after apply)
      + staging = (known after apply)
    }
  + nfs_server_dns            = (known after apply)

sgibson91 · 2022-10-14T10:48:41Z

@yuvipanda @damianavila The next bug is that the deployer credentials created by terraform don't seem to work

$ python deployer use-cluster-credentials nasa-cryo

An error occurred (UnrecognizedClientException) when calling the DescribeCluster operation: The security token included in the request is invalid
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/__main__.py", line 7, in <module>
    cli.main()
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/cli.py", line 199, in main
    use_cluster_credentials(args.cluster_name)
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/deploy_actions.py", line 58, in use_cluster_credentials
    with cluster.auth():
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/cluster.py", line 29, in auth
    yield from self.auth_aws()
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/cluster.py", line 180, in auth_aws
    subprocess.check_call(
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['aws', 'eks', 'update-kubeconfig', '--name=nasa-cryo', '--region=us-west-2']' returned non-zero exit status 254.

I'm trying to deploy the support chart but can't.

I can successfully authenticate against the cluster myself using my environment variables generated from following these docs

$ aws eks update-kubeconfig --name=nasa-cryo --region=us-west-2
Updated context arn:aws:eks:us-west-2:574251165169:cluster/nasa-cryo in /Users/sgibson/.kube/config

But whatever gets exported from terraform output -raw continuous_deployer_creds > ../../config/clusters/nasa-cryo/deployer-credentials.secret.json is not working

GeorgianaElena · 2022-10-14T11:49:53Z

@sgibson91, can you check if the rows in the json that stores the creds are tabs instead of spaces?
I believe I once needed to manually change them.

Update:
Ah, and I believe the sops enc output needs this indentation. I doesn't matter if you have the credentials indented with spaces originally, before encryption. I think sops is the one that turns them into tabs.

sgibson91 · 2022-10-14T12:04:41Z

@sgibson91, can you check if the rows in the json that stores the creds are tabs instead of spaces? I believe I once needed to manually change them.

Update: Ah, and I believe the sops enc output needs this indentation. I doesn't matter if you have the credentials indented with spaces originally, before encryption. I think sops is the one that turns them into tabs.

This is all correct. But also, the deployer opens json files with the json library (instead of yaml) precisely because of the hard tabs. I will dig out the PR where Yuvi reintroduced this.

ETA: This commit from this PR.

damianavila · 2022-10-14T16:59:16Z

But whatever gets exported from terraform output -raw continuous_deployer_creds > ../../config/clusters/nasa-cryo/deployer-credentials.secret.json is not working

AFAIR, the terraform pieces create the deployer IAM user but you still need to give it access manually: https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html#grant-access-to-other-users. Maybe this is the underlying issue?

yuvipanda · 2022-10-14T18:03:27Z

+1 to what @damianavila linked to. Might be the cause. Unfortunately only you can do this, @sgibson91 because of https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html?highlight=access#grant-access-to-other-users.

sgibson91 · 2022-10-14T18:25:46Z

I did that though?

Edited to add: Just checked my command history and apparently I granted it access already

yuvipanda · 2022-10-14T18:28:51Z

@sgibson91 oh damn, that sucks. Did you also do https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html?highlight=access#grant-access-to-other-users? I can't seem to get access.

sgibson91 · 2022-10-14T18:32:33Z

Did you also do infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html?highlight=access#grant-access-to-other-users? I can't seem to get access.

Yes, I looped through everyone's user names that I found here: https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/groups/details/tech?section=users

yuvipanda · 2022-10-14T18:33:22Z

@sgibson91 ok that's super strange! I'll investigate. THANK YOU!

sgibson91 · 2022-10-14T18:33:37Z

yuvipanda · 2022-10-14T18:36:02Z

@sgibson91 ah, my credentials had just expired i think. I can now reproduce your error, getting:

An error occurred (UnrecognizedClientException) when calling the DescribeCluster operation: The security token included in the request is invalid

I'll look at that now.

yuvipanda · 2022-10-14T18:37:04Z

@sgibson91 which terraform workspace is this in? I don't see a nasa-cryo one in:

terraform workspace list
  default
  2i2c-uk
  allen-swdb
  awi-ciroh
  callysto
* carbonplan-aws
  cloudbank
  justiceinnovationlab
  leap
  linked-earth
  m2lines
  meom-ige
  openscapes
  pilot-hubs
  uhackweeks
  utoronto

sgibson91 · 2022-10-14T19:04:14Z

@sgibson91 which terraform workspace is this in? I don't see a nasa-cryo one in:

Oh no, I fucked up and didn't create a new one! 🤦🏻‍♀️ It could be in default since I didn't run any workspace commands after initialising?

sgibson91 · 2022-10-19T12:29:23Z

Two hubs are new up and running so I am marking this PR ready for review

sgibson91 · 2022-10-19T13:52:58Z

For some reason, deployer run-hub-health-check fails for these hubs. The deployment service check can't create a user server

------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------
Starting hub https://staging.cryointhecloud.2i2c.cloud health validation...
Running dask_test_notebook.ipynb test notebook...
Hub https://staging.cryointhecloud.2i2c.cloud not healthy! Stopping further deployments. Exception was jupyterhub server creation timeout=360 [s].
--------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------
ERROR    jhub_client.api:api.py:119 jupyterhub server creation timeout=360 [s]
============================================================================== short test summary info ==============================================================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - TimeoutError: jupyterhub server creation timeout=360 [s]
1 failed in 364.77s (0:06:04)
Health check failed!

Applying the "usual" hack to fix it (below) didn't work

infrastructure/deployer/tests/test_hub_health.py

Lines 44 to 48 in bf6ae45

    
           # Temporary fix for https://github.com/2i2c-org/infrastructure/issues/1611 
        
           # FIXME: Remove this once https://github.com/jupyterhub/kubespawner/pull/631 gets merged 
        
           user_options = None 
        
           if ("openscapes" in hub_url) or ("carbonplan" in hub_url): 
        
               user_options = {"profile": "small", "image": "python"}

sgibson91 · 2022-10-19T14:25:54Z

RE: #1768 (comment)

This isn't health check related, I can't spawn a server either. Something about the cloud-user-sa in these logs:

[E 2022-10-19 14:23:30.543 JupyterHub pages:371] Previous spawn for sgibson91 failed: (403)
    Reason: error
    HTTP response headers: <CIMultiDictProxy('Audit-Id': '1db3b5c1-15ef-42b7-9846-0bc9dbb1e1b0', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '989e69a0-c16c-40f3-9d8c-b1547c6f3eb6', 'X-Kubernetes-Pf-Prioritylevel-Uid': '5b7375f9-48aa-4f11-a3a7-4b8261af8961', 'Date': 'Wed, 19 Oct 2022 14:23:30 GMT', 'Content-Length': '306')>
    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"jupyter-sgibson91\" is forbidden: error looking up service account staging/cloud-user-sa: serviceaccount \"cloud-user-sa\" not found","reason":"Forbidden","details":{"name":"jupyter-sgibson91","kind":"pods"},"code":403}

ETA: I removed serviceAccountName from common config, and now I'm allowed to attempt to spawn servers. This must have been a specialisation for the openscapes cluster that was copy-pasta. (All the more reason for templates!!)

I think this was copy-pasta from openscapes and was causing user servers to not spawn

sgibson91 · 2022-10-19T14:39:23Z

Now my spawn is just hanging

$ k describe pod jupyter-sgibson91
[...]
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  14s (x8 over 7m25s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.

I think it might be trying to auto-scale but can't? Not seeing the kind of scaling failed messages I'd expect though.

[I 2022-10-19 14:28:17.997 JupyterHub app:3095] Private Hub API connect url http://hub:8081/hub/
[I 2022-10-19 14:28:17.997 JupyterHub app:3104] Starting managed service jupyterhub-idle-culler
[I 2022-10-19 14:28:17.997 JupyterHub service:385] Starting service 'jupyterhub-idle-culler': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10']
[I 2022-10-19 14:28:17.999 JupyterHub service:133] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10
[I 2022-10-19 14:28:18.004 JupyterHub app:3104] Starting managed service configurator at http://configurator:10101
[I 2022-10-19 14:28:18.004 JupyterHub service:385] Starting service 'configurator': ['python3', '-m', 'jupyterhub_configurator.app', '--Configurator.config_file=/usr/local/etc/jupyterhub-configurator/jupyterhub_configurator_config.py']
[I 2022-10-19 14:28:18.007 JupyterHub service:133] Spawning python3 -m jupyterhub_configurator.app --Configurator.config_file=/usr/local/etc/jupyterhub-configurator/jupyterhub_configurator_config.py
[I 2022-10-19 14:28:18.132 JupyterHub log:186] 200 GET /hub/api/ ([email protected]) 17.04ms
[I 2022-10-19 14:28:18.145 JupyterHub log:186] 200 GET /hub/api/users?state=[secret] ([email protected]) 11.70ms
[I 2022-10-19 14:28:19.025 JupyterHub app:3113] Adding external service dask-gateway at http://traefik-staging-dask-gateway.staging
[I 2022-10-19 14:28:19.028 JupyterHub app:3113] Adding external service hub-health
[I 2022-10-19 14:28:19.031 JupyterHub app:3162] JupyterHub is now running, internal Hub API at http://hub:8081/hub/
[I 2022-10-19 14:28:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.12ms
User sgibson91 is part of teams WhitakerLab:labmembers 2i2c-org:tech-team WhitakerLab:turingregcollabs
Allowing profile Small: m5.large for user sgibson91 based on team membership
Allowing profile Medium: m5.xlarge for user sgibson91 based on team membership
Allowing profile Large: m5.2xlarge for user sgibson91 based on team membership
Allowing profile Huge: m5.8xlarge for user sgibson91 based on team membership
[I 2022-10-19 14:29:13.554 JupyterHub log:186] 200 GET /hub/spawn/sgibson91 ([email protected]) 58.97ms
[I 2022-10-19 14:29:13.555 JupyterHub reflector:274] watching for pods with label selector='component=singleuser-server' in namespace staging
[I 2022-10-19 14:29:13.560 JupyterHub reflector:274] watching for events with field selector='involvedObject.kind=Pod' in namespace staging
[I 2022-10-19 14:29:15.693 JupyterHub provider:651] Creating oauth client jupyterhub-user-sgibson91
[I 2022-10-19 14:29:15.709 JupyterHub log:186] 302 POST /hub/spawn/sgibson91 -> /hub/spawn-pending/sgibson91 ([email protected]) 36.51ms
[I 2022-10-19 14:29:15.727 JupyterHub spawner:2469] Attempting to create pod jupyter-sgibson91, with timeout 3
[I 2022-10-19 14:29:15.884 JupyterHub pages:394] sgibson91 is pending spawn
[I 2022-10-19 14:29:15.888 JupyterHub log:186] 200 GET /hub/spawn-pending/sgibson91 ([email protected]) 6.81ms
[I 2022-10-19 14:29:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.34ms
[I 2022-10-19 14:30:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.33ms
[I 2022-10-19 14:31:16.105 JupyterHub log:186] 302 GET / -> /hub/ (@192.168.21.95) 0.68ms
[I 2022-10-19 14:31:16.171 JupyterHub log:186] 302 GET /hub/ -> /hub/login?next=%2Fhub%2F (@192.168.21.95) 0.92ms
[I 2022-10-19 14:31:16.252 JupyterHub log:186] 200 GET /hub/login?next=/hub/ (@192.168.21.95) 15.87ms
[I 2022-10-19 14:31:32.913 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 5.09ms
[I 2022-10-19 14:32:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.92ms
[I 2022-10-19 14:33:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 5.03ms

sgibson91 · 2022-10-19T14:41:09Z

Exactly which quotas are we supposed to request increases for?

GeorgianaElena · 2022-10-19T15:15:01Z

Pinging @damianavila in case he might be able to help with ☝🏼

eksctl/nasa-cryo.jsonnet

yuvipanda · 2022-10-19T17:52:25Z

@sgibson91 @GeorgianaElena on EKS, the cluster autoscaler needs to be explicitly enabled by us in our support chart. I did so in fedcfb2 and it's on now!

The cloud-user-sa is also openscapes and carbonplan specific, and predates the work in https://infrastructure.2i2c.org/en/latest/howto/features/cloud-access.html that involves annotations and stuff.

We should definitely have templates here for sure 100%, including for support charts.

yuvipanda · 2022-10-19T17:56:02Z

I opened #1800 to get rid of cloud-user-sa (

yuvipanda · 2022-10-19T18:03:40Z

As for quotas, we need mostly EC2 quotas (https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas) for new nodes to come up. In particular:

All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests is what is used for dask instances (as they are spot instances), and Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances is what is used for notebook and core instances. The values are 'total CPUs', so bigger nodes consume more quota.

…port for AWS clusters

sgibson91 · 2022-10-20T11:02:51Z

Thanks @yuvipanda - I updated our docs around cluster-autoscaler and quotas as well. It's probably fine for now, and I'll give flow a rethink when I begin tackling #1757

yuvipanda

Yay awesome!

sgibson91 · 2022-10-20T14:47:52Z

Will hold off merging this until Tasha has set the CNAME they would like to use :)

sgibson91 · 2022-10-21T09:53:49Z

I updated the teams to match the capitalisation in the slugs not the display names: #1702 (comment)

I also updated the domains so the hubs are available at the desired CNAMEs of the community.

Merging now!

github-actions · 2022-10-21T09:54:26Z

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/3296485343

sgibson91 requested review from damianavila and yuvipanda October 13, 2022 13:47

sgibson91 self-assigned this Oct 13, 2022

sgibson91 force-pushed the nasa-cryo branch from aab0b99 to c97c507 Compare October 13, 2022 13:57

yuvipanda mentioned this pull request Oct 13, 2022

Cleanup mysql shared db support on AWS #1770

Merged

yuvipanda reviewed Oct 13, 2022

View reviewed changes

terraform/aws/projects/nasa-cryo.tfvars Outdated Show resolved Hide resolved

sgibson91 added 2 commits October 14, 2022 10:35

Add jsonnet file and ssh keys for nasa-cryo eksctl cluster

8c398dc

Add generated .tfvars file for nasa-cryo

f3e48e5

sgibson91 force-pushed the nasa-cryo branch from c97c507 to f3e48e5 Compare October 14, 2022 09:35

Remove db_instance_identifier from .tfvars file

92ea205

sgibson91 added 3 commits October 14, 2022 10:50

Add cluster creds for nasa-cryo

99b5329

Add a minimal cluster.yaml file for nasa-cryo

6038f97

Update cluster deployer credentials

a635dd9

Add support chart config

bad6fcc

sgibson91 added 2 commits October 19, 2022 11:44

Add config for staging hub

793c6e7

Add config for prod hub

3bbd847

sgibson91 marked this pull request as ready for review October 19, 2022 12:29

sgibson91 requested a review from a team October 19, 2022 12:29

Add new cluster to deploy and validate workflow files

008b03d

Remove serviceAccountName from common config

559c1e3

I think this was copy-pasta from openscapes and was causing user servers to not spawn

sgibson91 mentioned this pull request Oct 19, 2022

Create deployer.generate_hub command to initialise a new hub from templates #1798

Open

GeorgianaElena reviewed Oct 19, 2022

View reviewed changes

eksctl/nasa-cryo.jsonnet Show resolved Hide resolved

GeorgianaElena reviewed Oct 19, 2022

View reviewed changes

eksctl/nasa-cryo.jsonnet Show resolved Hide resolved

Enable autoscaler for the nasa-cryo cluster

fedcfb2

sgibson91 added 2 commits October 20, 2022 11:44

Add note on AWS quotas to docs

c46e9b5

Include warning about enabling the cluster-autoscaler subchart in sup…

95bae94

…port for AWS clusters

yuvipanda approved these changes Oct 20, 2022

View reviewed changes

sgibson91 added 2 commits October 21, 2022 10:38

Update team names to correct capitalisation

e21b25f

Update domains to what the community would like to use

f6ba2bd

sgibson91 merged commit 4496070 into 2i2c-org:master Oct 21, 2022

sgibson91 deleted the nasa-cryo branch October 21, 2022 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying NASA Cryo cluster and hub #1768

Deploying NASA Cryo cluster and hub #1768

sgibson91 commented Oct 13, 2022 •

edited

Loading

sgibson91 commented Oct 13, 2022 •

edited

Loading

sgibson91 commented Oct 14, 2022

sgibson91 commented Oct 14, 2022 •

edited

Loading

GeorgianaElena commented Oct 14, 2022 •

edited

Loading

sgibson91 commented Oct 14, 2022 •

edited

Loading

damianavila commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022 •

edited

Loading

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022

sgibson91 commented Oct 19, 2022

sgibson91 commented Oct 19, 2022 •

edited

Loading

sgibson91 commented Oct 19, 2022 •

edited

Loading

sgibson91 commented Oct 19, 2022

sgibson91 commented Oct 19, 2022

GeorgianaElena commented Oct 19, 2022

yuvipanda commented Oct 19, 2022

yuvipanda commented Oct 19, 2022

yuvipanda commented Oct 19, 2022

sgibson91 commented Oct 20, 2022

yuvipanda left a comment

sgibson91 commented Oct 20, 2022

sgibson91 commented Oct 21, 2022

github-actions bot commented Oct 21, 2022

Deploying NASA Cryo cluster and hub #1768

Deploying NASA Cryo cluster and hub #1768

Conversation

sgibson91 commented Oct 13, 2022 • edited Loading

Deployment-related changes in this PR

Other changes in this PR

sgibson91 commented Oct 13, 2022 • edited Loading

Updated to add

sgibson91 commented Oct 14, 2022

sgibson91 commented Oct 14, 2022 • edited Loading

GeorgianaElena commented Oct 14, 2022 • edited Loading

sgibson91 commented Oct 14, 2022 • edited Loading

damianavila commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022 • edited Loading

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

sgibson91 commented Oct 14, 2022

sgibson91 commented Oct 19, 2022

sgibson91 commented Oct 19, 2022 • edited Loading

sgibson91 commented Oct 19, 2022 • edited Loading

sgibson91 commented Oct 19, 2022

sgibson91 commented Oct 19, 2022

GeorgianaElena commented Oct 19, 2022

yuvipanda commented Oct 19, 2022

yuvipanda commented Oct 19, 2022

yuvipanda commented Oct 19, 2022

sgibson91 commented Oct 20, 2022

yuvipanda left a comment

Choose a reason for hiding this comment

sgibson91 commented Oct 20, 2022

sgibson91 commented Oct 21, 2022

github-actions bot commented Oct 21, 2022

sgibson91 commented Oct 13, 2022 •

edited

Loading

sgibson91 commented Oct 13, 2022 •

edited

Loading

sgibson91 commented Oct 14, 2022 •

edited

Loading

GeorgianaElena commented Oct 14, 2022 •

edited

Loading

sgibson91 commented Oct 14, 2022 •

edited

Loading

sgibson91 commented Oct 14, 2022 •

edited

Loading

sgibson91 commented Oct 19, 2022 •

edited

Loading

sgibson91 commented Oct 19, 2022 •

edited

Loading