Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying NASA Cryo cluster and hub #1768

Merged
merged 30 commits into from
Oct 21, 2022
Merged

Conversation

sgibson91
Copy link
Member

@sgibson91 sgibson91 commented Oct 13, 2022

Deployment-related changes in this PR

addresses #1702

  • Cluster files added:
    • Jsonnet files
    • SSH keys
    • Terraform files
    • cluster.yaml file
    • Deployer credentials
    • Grafana dashboards token
  • Helm config files added:
    • Support components
    • Common helm config to be shared between hubs
    • Hub configs for staging and prod
  • CI/CD files updated:
    • deploy grafana dashboards
    • deploy hubs
    • validate hubs

Other changes in this PR

  • Improvements to AWS deployment docs
  • Improvement to AWS terraform code and related docs
    • Output the correct command to link the deployment credentials to the cluster
  • Improvement to the AWS-related environment variables the deployer sets/unsets

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 13, 2022

Updated to add

This issue was resolved by #1770


I'm at the point in the documentation where I need to use terraform to generate a CI/CD key: https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html#create-account-with-finely-scoped-permissions-for-automatic-deployment However, AWS terraform now seems to be doing way more than just creating a CI/CD key and I don't know where that is documented. (PR #1767 is my attempt to improve documentation on this whole process but I could use help filling in the stuff I don't know.)

The .tfvars file generate by deployer generate-cluster is not complete as was asked to provide db_instance_identifier on the command line, which I set in the tfvars file because I figured making conditional input variables is hard in terraform and db_enabled should be set to false anyway according to this code block:

variable "db_enabled" {
default = false
type = bool
description = <<-EOT
Run a database for the hub with AWS RDS
EOT
}

But my terraform plan is failing trying to set a root password for a mysql database - as far as I know, I don't want this feature enabled and it shouldn't be running!

$ tf plan -var-file=projects/nasa-cryo.tfvars -out=nasa-cryo-plan
data.aws_eks_cluster.cluster: Reading...
data.aws_partition.current: Reading...
data.aws_caller_identity.current: Reading...
data.aws_partition.current: Read complete after 0s [id=aws]
data.aws_caller_identity.current: Read complete after 0s [id=574251165169]
data.aws_eks_cluster.cluster: Read complete after 1s [id=nasa-cryo]
data.aws_security_group.cluster_nodes_shared_security_group: Reading...
data.aws_subnet.cluster_node_subnet: Reading...
data.aws_iam_policy_document.irsa_role_assume["staging"]: Reading...
data.aws_iam_policy_document.irsa_role_assume["prod"]: Reading...
data.aws_iam_policy_document.irsa_role_assume["prod"]: Read complete after 0s [id=1280574928]
data.aws_iam_policy_document.irsa_role_assume["staging"]: Read complete after 0s [id=3441862523]
data.aws_security_group.cluster_nodes_shared_security_group: Read complete after 0s [id=sg-094f411783a59613d]
data.aws_subnet.cluster_node_subnet: Read complete after 0s [id=subnet-0b5ab806454193c98]
╷
│ Error: Invalid index
│ 
│   on db.tf line 97, in provider "mysql":
│   97:   endpoint = aws_db_instance.db[0].endpoint
│     ├────────────────
│     │ aws_db_instance.db is empty tuple
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
╵
╷
│ Error: Invalid index
│ 
│   on db.tf line 99, in provider "mysql":
│   99:   username = aws_db_instance.db[0].username
│     ├────────────────
│     │ aws_db_instance.db is empty tuple
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
╵
╷
│ Error: Invalid index
│ 
│   on db.tf line 100, in provider "mysql":
│  100:   password = random_password.db_root_password[0].result
│     ├────────────────
│     │ random_password.db_root_password is empty tuple
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
╵

@sgibson91 sgibson91 self-assigned this Oct 13, 2022
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Oct 13, 2022
- Don't set up mysql provider when db is not enabled
- Make sure all db related resources are conditional on db being
  enabled
- Switch to a maintained fork of the mysql provider

Unblocks 2i2c-org#1768
@sgibson91
Copy link
Member Author

Output of terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
 <= read (data resources)

Terraform will perform the following actions:

  # data.aws_iam_policy_document.bucket_access["prod.scratch"] will be read during apply
  # (config refers to values not yet known)
 <= data "aws_iam_policy_document" "bucket_access" {
      + id   = (known after apply)
      + json = (known after apply)

      + statement {
          + actions   = [
              + "s3:*",
            ]
          + effect    = "Allow"
          + resources = [
              + (known after apply),
              + (known after apply),
            ]

          + principals {
              + identifiers = [
                  + (known after apply),
                ]
              + type        = "AWS"
            }
        }
    }

  # data.aws_iam_policy_document.bucket_access["staging.scratch-staging"] will be read during apply
  # (config refers to values not yet known)
 <= data "aws_iam_policy_document" "bucket_access" {
      + id   = (known after apply)
      + json = (known after apply)

      + statement {
          + actions   = [
              + "s3:*",
            ]
          + effect    = "Allow"
          + resources = [
              + (known after apply),
              + (known after apply),
            ]

          + principals {
              + identifiers = [
                  + (known after apply),
                ]
              + type        = "AWS"
            }
        }
    }

  # aws_efs_file_system.homedirs will be created
  + resource "aws_efs_file_system" "homedirs" {
      + arn                     = (known after apply)
      + availability_zone_id    = (known after apply)
      + availability_zone_name  = (known after apply)
      + creation_token          = (known after apply)
      + dns_name                = (known after apply)
      + encrypted               = (known after apply)
      + id                      = (known after apply)
      + kms_key_id              = (known after apply)
      + number_of_mount_targets = (known after apply)
      + owner_id                = (known after apply)
      + performance_mode        = (known after apply)
      + size_in_bytes           = (known after apply)
      + tags                    = {
          + "Name" = "hub-homedirs"
        }
      + tags_all                = {
          + "Name" = "hub-homedirs"
        }
      + throughput_mode         = "bursting"
    }

  # aws_efs_mount_target.homedirs will be created
  + resource "aws_efs_mount_target" "homedirs" {
      + availability_zone_id   = (known after apply)
      + availability_zone_name = (known after apply)
      + dns_name               = (known after apply)
      + file_system_arn        = (known after apply)
      + file_system_id         = (known after apply)
      + id                     = (known after apply)
      + ip_address             = (known after apply)
      + mount_target_dns_name  = (known after apply)
      + network_interface_id   = (known after apply)
      + owner_id               = (known after apply)
      + security_groups        = [
          + "sg-094f411783a59613d",
        ]
      + subnet_id              = "subnet-0b5ab806454193c98"
    }

  # aws_iam_access_key.continuous_deployer will be created
  + resource "aws_iam_access_key" "continuous_deployer" {
      + create_date                    = (known after apply)
      + encrypted_secret               = (known after apply)
      + encrypted_ses_smtp_password_v4 = (known after apply)
      + id                             = (known after apply)
      + key_fingerprint                = (known after apply)
      + secret                         = (sensitive value)
      + ses_smtp_password_v4           = (sensitive value)
      + status                         = "Active"
      + user                           = "hub-continuous-deployer"
    }

  # aws_iam_role.irsa_role["prod"] will be created
  + resource "aws_iam_role" "irsa_role" {
      + arn                   = (known after apply)
      + assume_role_policy    = jsonencode(
            {
              + Statement = [
                  + {
                      + Action    = "sts:AssumeRoleWithWebIdentity"
                      + Condition = {
                          + StringEquals = {
                              + "oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B:sub" = "system:serviceaccount:prod:user-sa"
                            }
                        }
                      + Effect    = "Allow"
                      + Principal = {
                          + Federated = "arn:aws:iam::574251165169:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B"
                        }
                      + Sid       = ""
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
      + create_date           = (known after apply)
      + force_detach_policies = false
      + id                    = (known after apply)
      + managed_policy_arns   = (known after apply)
      + max_session_duration  = 3600
      + name                  = "nasa-cryo-prod"
      + name_prefix           = (known after apply)
      + path                  = "/"
      + tags_all              = (known after apply)
      + unique_id             = (known after apply)

      + inline_policy {
          + name   = (known after apply)
          + policy = (known after apply)
        }
    }

  # aws_iam_role.irsa_role["staging"] will be created
  + resource "aws_iam_role" "irsa_role" {
      + arn                   = (known after apply)
      + assume_role_policy    = jsonencode(
            {
              + Statement = [
                  + {
                      + Action    = "sts:AssumeRoleWithWebIdentity"
                      + Condition = {
                          + StringEquals = {
                              + "oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B:sub" = "system:serviceaccount:staging:user-sa"
                            }
                        }
                      + Effect    = "Allow"
                      + Principal = {
                          + Federated = "arn:aws:iam::574251165169:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/3D5422BE1F96E1111E1C0B882B64A16B"
                        }
                      + Sid       = ""
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
      + create_date           = (known after apply)
      + force_detach_policies = false
      + id                    = (known after apply)
      + managed_policy_arns   = (known after apply)
      + max_session_duration  = 3600
      + name                  = "nasa-cryo-staging"
      + name_prefix           = (known after apply)
      + path                  = "/"
      + tags_all              = (known after apply)
      + unique_id             = (known after apply)

      + inline_policy {
          + name   = (known after apply)
          + policy = (known after apply)
        }
    }

  # aws_iam_user.continuous_deployer will be created
  + resource "aws_iam_user" "continuous_deployer" {
      + arn           = (known after apply)
      + force_destroy = false
      + id            = (known after apply)
      + name          = "hub-continuous-deployer"
      + path          = "/"
      + tags_all      = (known after apply)
      + unique_id     = (known after apply)
    }

  # aws_iam_user_policy.continuous_deployer will be created
  + resource "aws_iam_user_policy" "continuous_deployer" {
      + id     = (known after apply)
      + name   = "eks-readonly"
      + policy = jsonencode(
            {
              + Statement = [
                  + {
                      + Action   = "eks:DescribeCluster"
                      + Effect   = "Allow"
                      + Resource = "*"
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
      + user   = "hub-continuous-deployer"
    }

  # aws_s3_bucket.user_buckets["scratch"] will be created
  + resource "aws_s3_bucket" "user_buckets" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = "nasa-cryo-scratch"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + cors_rule {
          + allowed_headers = (known after apply)
          + allowed_methods = (known after apply)
          + allowed_origins = (known after apply)
          + expose_headers  = (known after apply)
          + max_age_seconds = (known after apply)
        }

      + grant {
          + id          = (known after apply)
          + permissions = (known after apply)
          + type        = (known after apply)
          + uri         = (known after apply)
        }

      + lifecycle_rule {
          + abort_incomplete_multipart_upload_days = (known after apply)
          + enabled                                = (known after apply)
          + id                                     = (known after apply)
          + prefix                                 = (known after apply)
          + tags                                   = (known after apply)

          + expiration {
              + date                         = (known after apply)
              + days                         = (known after apply)
              + expired_object_delete_marker = (known after apply)
            }

          + noncurrent_version_expiration {
              + days = (known after apply)
            }

          + noncurrent_version_transition {
              + days          = (known after apply)
              + storage_class = (known after apply)
            }

          + transition {
              + date          = (known after apply)
              + days          = (known after apply)
              + storage_class = (known after apply)
            }
        }

      + logging {
          + target_bucket = (known after apply)
          + target_prefix = (known after apply)
        }

      + object_lock_configuration {
          + object_lock_enabled = (known after apply)

          + rule {
              + default_retention {
                  + days  = (known after apply)
                  + mode  = (known after apply)
                  + years = (known after apply)
                }
            }
        }

      + replication_configuration {
          + role = (known after apply)

          + rules {
              + delete_marker_replication_status = (known after apply)
              + id                               = (known after apply)
              + prefix                           = (known after apply)
              + priority                         = (known after apply)
              + status                           = (known after apply)

              + destination {
                  + account_id         = (known after apply)
                  + bucket             = (known after apply)
                  + replica_kms_key_id = (known after apply)
                  + storage_class      = (known after apply)

                  + access_control_translation {
                      + owner = (known after apply)
                    }

                  + metrics {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }

                  + replication_time {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }
                }

              + filter {
                  + prefix = (known after apply)
                  + tags   = (known after apply)
                }

              + source_selection_criteria {
                  + sse_kms_encrypted_objects {
                      + enabled = (known after apply)
                    }
                }
            }
        }

      + server_side_encryption_configuration {
          + rule {
              + bucket_key_enabled = (known after apply)

              + apply_server_side_encryption_by_default {
                  + kms_master_key_id = (known after apply)
                  + sse_algorithm     = (known after apply)
                }
            }
        }

      + versioning {
          + enabled    = (known after apply)
          + mfa_delete = (known after apply)
        }

      + website {
          + error_document           = (known after apply)
          + index_document           = (known after apply)
          + redirect_all_requests_to = (known after apply)
          + routing_rules            = (known after apply)
        }
    }

  # aws_s3_bucket.user_buckets["scratch-staging"] will be created
  + resource "aws_s3_bucket" "user_buckets" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = "nasa-cryo-scratch-staging"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + cors_rule {
          + allowed_headers = (known after apply)
          + allowed_methods = (known after apply)
          + allowed_origins = (known after apply)
          + expose_headers  = (known after apply)
          + max_age_seconds = (known after apply)
        }

      + grant {
          + id          = (known after apply)
          + permissions = (known after apply)
          + type        = (known after apply)
          + uri         = (known after apply)
        }

      + lifecycle_rule {
          + abort_incomplete_multipart_upload_days = (known after apply)
          + enabled                                = (known after apply)
          + id                                     = (known after apply)
          + prefix                                 = (known after apply)
          + tags                                   = (known after apply)

          + expiration {
              + date                         = (known after apply)
              + days                         = (known after apply)
              + expired_object_delete_marker = (known after apply)
            }

          + noncurrent_version_expiration {
              + days = (known after apply)
            }

          + noncurrent_version_transition {
              + days          = (known after apply)
              + storage_class = (known after apply)
            }

          + transition {
              + date          = (known after apply)
              + days          = (known after apply)
              + storage_class = (known after apply)
            }
        }

      + logging {
          + target_bucket = (known after apply)
          + target_prefix = (known after apply)
        }

      + object_lock_configuration {
          + object_lock_enabled = (known after apply)

          + rule {
              + default_retention {
                  + days  = (known after apply)
                  + mode  = (known after apply)
                  + years = (known after apply)
                }
            }
        }

      + replication_configuration {
          + role = (known after apply)

          + rules {
              + delete_marker_replication_status = (known after apply)
              + id                               = (known after apply)
              + prefix                           = (known after apply)
              + priority                         = (known after apply)
              + status                           = (known after apply)

              + destination {
                  + account_id         = (known after apply)
                  + bucket             = (known after apply)
                  + replica_kms_key_id = (known after apply)
                  + storage_class      = (known after apply)

                  + access_control_translation {
                      + owner = (known after apply)
                    }

                  + metrics {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }

                  + replication_time {
                      + minutes = (known after apply)
                      + status  = (known after apply)
                    }
                }

              + filter {
                  + prefix = (known after apply)
                  + tags   = (known after apply)
                }

              + source_selection_criteria {
                  + sse_kms_encrypted_objects {
                      + enabled = (known after apply)
                    }
                }
            }
        }

      + server_side_encryption_configuration {
          + rule {
              + bucket_key_enabled = (known after apply)

              + apply_server_side_encryption_by_default {
                  + kms_master_key_id = (known after apply)
                  + sse_algorithm     = (known after apply)
                }
            }
        }

      + versioning {
          + enabled    = (known after apply)
          + mfa_delete = (known after apply)
        }

      + website {
          + error_document           = (known after apply)
          + index_document           = (known after apply)
          + redirect_all_requests_to = (known after apply)
          + routing_rules            = (known after apply)
        }
    }

  # aws_s3_bucket_lifecycle_configuration.user_bucket_expiry["scratch"] will be created
  + resource "aws_s3_bucket_lifecycle_configuration" "user_bucket_expiry" {
      + bucket = "nasa-cryo-scratch"
      + id     = (known after apply)

      + rule {
          + id     = "delete-after-expiry"
          + status = "Enabled"

          + expiration {
              + days                         = 7
              + expired_object_delete_marker = (known after apply)
            }
        }
    }

  # aws_s3_bucket_lifecycle_configuration.user_bucket_expiry["scratch-staging"] will be created
  + resource "aws_s3_bucket_lifecycle_configuration" "user_bucket_expiry" {
      + bucket = "nasa-cryo-scratch-staging"
      + id     = (known after apply)

      + rule {
          + id     = "delete-after-expiry"
          + status = "Enabled"

          + expiration {
              + days                         = 7
              + expired_object_delete_marker = (known after apply)
            }
        }
    }

  # aws_s3_bucket_policy.user_bucket_access["prod.scratch"] will be created
  + resource "aws_s3_bucket_policy" "user_bucket_access" {
      + bucket = (known after apply)
      + id     = (known after apply)
      + policy = (known after apply)
    }

  # aws_s3_bucket_policy.user_bucket_access["staging.scratch-staging"] will be created
  + resource "aws_s3_bucket_policy" "user_bucket_access" {
      + bucket = (known after apply)
      + id     = (known after apply)
      + policy = (known after apply)
    }

Plan: 13 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + continuous_deployer_creds = (sensitive value)
  + db_helm_config            = (sensitive value)
  + kubernetes_sa_annotations = {
      + prod    = (known after apply)
      + staging = (known after apply)
    }
  + nfs_server_dns            = (known after apply)

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 14, 2022

@yuvipanda @damianavila The next bug is that the deployer credentials created by terraform don't seem to work

$ python deployer use-cluster-credentials nasa-cryo

An error occurred (UnrecognizedClientException) when calling the DescribeCluster operation: The security token included in the request is invalid
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/__main__.py", line 7, in <module>
    cli.main()
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/cli.py", line 199, in main
    use_cluster_credentials(args.cluster_name)
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/deploy_actions.py", line 58, in use_cluster_credentials
    with cluster.auth():
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/cluster.py", line 29, in auth
    yield from self.auth_aws()
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/cluster.py", line 180, in auth_aws
    subprocess.check_call(
  File "/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['aws', 'eks', 'update-kubeconfig', '--name=nasa-cryo', '--region=us-west-2']' returned non-zero exit status 254.

I'm trying to deploy the support chart but can't.

I can successfully authenticate against the cluster myself using my environment variables generated from following these docs

$ aws eks update-kubeconfig --name=nasa-cryo --region=us-west-2
Updated context arn:aws:eks:us-west-2:574251165169:cluster/nasa-cryo in /Users/sgibson/.kube/config

But whatever gets exported from terraform output -raw continuous_deployer_creds > ../../config/clusters/nasa-cryo/deployer-credentials.secret.json is not working

@GeorgianaElena
Copy link
Member

GeorgianaElena commented Oct 14, 2022

@sgibson91, can you check if the rows in the json that stores the creds are tabs instead of spaces?
I believe I once needed to manually change them.

Update:
Ah, and I believe the sops enc output needs this indentation. I doesn't matter if you have the credentials indented with spaces originally, before encryption. I think sops is the one that turns them into tabs.

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 14, 2022

@sgibson91, can you check if the rows in the json that stores the creds are tabs instead of spaces? I believe I once needed to manually change them.

Update: Ah, and I believe the sops enc output needs this indentation. I doesn't matter if you have the credentials indented with spaces originally, before encryption. I think sops is the one that turns them into tabs.

This is all correct. But also, the deployer opens json files with the json library (instead of yaml) precisely because of the hard tabs. I will dig out the PR where Yuvi reintroduced this.

ETA: This commit from this PR.

@damianavila
Copy link
Contributor

But whatever gets exported from terraform output -raw continuous_deployer_creds > ../../config/clusters/nasa-cryo/deployer-credentials.secret.json is not working

AFAIR, the terraform pieces create the deployer IAM user but you still need to give it access manually: https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html#grant-access-to-other-users. Maybe this is the underlying issue?

@yuvipanda
Copy link
Member

+1 to what @damianavila linked to. Might be the cause. Unfortunately only you can do this, @sgibson91 because of https://infrastructure.2i2c.org/en/latest/howto/operate/new-cluster/aws.html?highlight=access#grant-access-to-other-users.

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 14, 2022

I did that though?

Edited to add: Just checked my command history and apparently I granted it access already

@yuvipanda
Copy link
Member

@sgibson91
Copy link
Member Author

@yuvipanda
Copy link
Member

@sgibson91 ok that's super strange! I'll investigate. THANK YOU!

@sgibson91
Copy link
Member Author

Screenshot 2022-10-14 at 19 33 19

@yuvipanda
Copy link
Member

@sgibson91 ah, my credentials had just expired i think. I can now reproduce your error, getting:

An error occurred (UnrecognizedClientException) when calling the DescribeCluster operation: The security token included in the request is invalid

I'll look at that now.

@yuvipanda
Copy link
Member

@sgibson91 which terraform workspace is this in? I don't see a nasa-cryo one in:

terraform workspace list
  default
  2i2c-uk
  allen-swdb
  awi-ciroh
  callysto
* carbonplan-aws
  cloudbank
  justiceinnovationlab
  leap
  linked-earth
  m2lines
  meom-ige
  openscapes
  pilot-hubs
  uhackweeks
  utoronto

@sgibson91
Copy link
Member Author

@sgibson91 which terraform workspace is this in? I don't see a nasa-cryo one in:

Oh no, I fucked up and didn't create a new one! 🤦🏻‍♀️ It could be in default since I didn't run any workspace commands after initialising?

@sgibson91
Copy link
Member Author

Two hubs are new up and running so I am marking this PR ready for review

@sgibson91 sgibson91 marked this pull request as ready for review October 19, 2022 12:29
@sgibson91 sgibson91 requested a review from a team October 19, 2022 12:29
@sgibson91
Copy link
Member Author

sgibson91 commented Oct 19, 2022

For some reason, deployer run-hub-health-check fails for these hubs. The deployment service check can't create a user server

------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------
Starting hub https://staging.cryointhecloud.2i2c.cloud health validation...
Running dask_test_notebook.ipynb test notebook...
Hub https://staging.cryointhecloud.2i2c.cloud not healthy! Stopping further deployments. Exception was jupyterhub server creation timeout=360 [s].
--------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------
ERROR    jhub_client.api:api.py:119 jupyterhub server creation timeout=360 [s]
============================================================================== short test summary info ==============================================================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - TimeoutError: jupyterhub server creation timeout=360 [s]
1 failed in 364.77s (0:06:04)
Health check failed!

Applying the "usual" hack to fix it (below) didn't work

# Temporary fix for https://github.com/2i2c-org/infrastructure/issues/1611
# FIXME: Remove this once https://github.com/jupyterhub/kubespawner/pull/631 gets merged
user_options = None
if ("openscapes" in hub_url) or ("carbonplan" in hub_url):
user_options = {"profile": "small", "image": "python"}

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 19, 2022

RE: #1768 (comment)

This isn't health check related, I can't spawn a server either. Something about the cloud-user-sa in these logs:

[E 2022-10-19 14:23:30.543 JupyterHub pages:371] Previous spawn for sgibson91 failed: (403)
    Reason: error
    HTTP response headers: <CIMultiDictProxy('Audit-Id': '1db3b5c1-15ef-42b7-9846-0bc9dbb1e1b0', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '989e69a0-c16c-40f3-9d8c-b1547c6f3eb6', 'X-Kubernetes-Pf-Prioritylevel-Uid': '5b7375f9-48aa-4f11-a3a7-4b8261af8961', 'Date': 'Wed, 19 Oct 2022 14:23:30 GMT', 'Content-Length': '306')>
    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"jupyter-sgibson91\" is forbidden: error looking up service account staging/cloud-user-sa: serviceaccount \"cloud-user-sa\" not found","reason":"Forbidden","details":{"name":"jupyter-sgibson91","kind":"pods"},"code":403}

ETA: I removed serviceAccountName from common config, and now I'm allowed to attempt to spawn servers. This must have been a specialisation for the openscapes cluster that was copy-pasta. (All the more reason for templates!!)

I think this was copy-pasta from openscapes and was causing user servers
to not spawn
@sgibson91
Copy link
Member Author

Now my spawn is just hanging

Screenshot 2022-10-19 at 15 35 17

$ k describe pod jupyter-sgibson91
[...]
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  14s (x8 over 7m25s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.

I think it might be trying to auto-scale but can't? Not seeing the kind of scaling failed messages I'd expect though.

[I 2022-10-19 14:28:17.997 JupyterHub app:3095] Private Hub API connect url http://hub:8081/hub/
[I 2022-10-19 14:28:17.997 JupyterHub app:3104] Starting managed service jupyterhub-idle-culler
[I 2022-10-19 14:28:17.997 JupyterHub service:385] Starting service 'jupyterhub-idle-culler': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10']
[I 2022-10-19 14:28:17.999 JupyterHub service:133] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10
[I 2022-10-19 14:28:18.004 JupyterHub app:3104] Starting managed service configurator at http://configurator:10101
[I 2022-10-19 14:28:18.004 JupyterHub service:385] Starting service 'configurator': ['python3', '-m', 'jupyterhub_configurator.app', '--Configurator.config_file=/usr/local/etc/jupyterhub-configurator/jupyterhub_configurator_config.py']
[I 2022-10-19 14:28:18.007 JupyterHub service:133] Spawning python3 -m jupyterhub_configurator.app --Configurator.config_file=/usr/local/etc/jupyterhub-configurator/jupyterhub_configurator_config.py
[I 2022-10-19 14:28:18.132 JupyterHub log:186] 200 GET /hub/api/ ([email protected]) 17.04ms
[I 2022-10-19 14:28:18.145 JupyterHub log:186] 200 GET /hub/api/users?state=[secret] ([email protected]) 11.70ms
[I 2022-10-19 14:28:19.025 JupyterHub app:3113] Adding external service dask-gateway at http://traefik-staging-dask-gateway.staging
[I 2022-10-19 14:28:19.028 JupyterHub app:3113] Adding external service hub-health
[I 2022-10-19 14:28:19.031 JupyterHub app:3162] JupyterHub is now running, internal Hub API at http://hub:8081/hub/
[I 2022-10-19 14:28:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.12ms
User sgibson91 is part of teams WhitakerLab:labmembers 2i2c-org:tech-team WhitakerLab:turingregcollabs
Allowing profile Small: m5.large for user sgibson91 based on team membership
Allowing profile Medium: m5.xlarge for user sgibson91 based on team membership
Allowing profile Large: m5.2xlarge for user sgibson91 based on team membership
Allowing profile Huge: m5.8xlarge for user sgibson91 based on team membership
[I 2022-10-19 14:29:13.554 JupyterHub log:186] 200 GET /hub/spawn/sgibson91 ([email protected]) 58.97ms
[I 2022-10-19 14:29:13.555 JupyterHub reflector:274] watching for pods with label selector='component=singleuser-server' in namespace staging
[I 2022-10-19 14:29:13.560 JupyterHub reflector:274] watching for events with field selector='involvedObject.kind=Pod' in namespace staging
[I 2022-10-19 14:29:15.693 JupyterHub provider:651] Creating oauth client jupyterhub-user-sgibson91
[I 2022-10-19 14:29:15.709 JupyterHub log:186] 302 POST /hub/spawn/sgibson91 -> /hub/spawn-pending/sgibson91 ([email protected]) 36.51ms
[I 2022-10-19 14:29:15.727 JupyterHub spawner:2469] Attempting to create pod jupyter-sgibson91, with timeout 3
[I 2022-10-19 14:29:15.884 JupyterHub pages:394] sgibson91 is pending spawn
[I 2022-10-19 14:29:15.888 JupyterHub log:186] 200 GET /hub/spawn-pending/sgibson91 ([email protected]) 6.81ms
[I 2022-10-19 14:29:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.34ms
[I 2022-10-19 14:30:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.33ms
[I 2022-10-19 14:31:16.105 JupyterHub log:186] 302 GET / -> /hub/ (@192.168.21.95) 0.68ms
[I 2022-10-19 14:31:16.171 JupyterHub log:186] 302 GET /hub/ -> /hub/login?next=%2Fhub%2F (@192.168.21.95) 0.92ms
[I 2022-10-19 14:31:16.252 JupyterHub log:186] 200 GET /hub/login?next=/hub/ (@192.168.21.95) 15.87ms
[I 2022-10-19 14:31:32.913 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 5.09ms
[I 2022-10-19 14:32:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 4.92ms
[I 2022-10-19 14:33:32.912 JupyterHub log:186] 200 GET /hub/metrics (@192.168.12.26) 5.03ms

@sgibson91
Copy link
Member Author

Exactly which quotas are we supposed to request increases for?

Screenshot 2022-10-19 at 15 40 43

@GeorgianaElena
Copy link
Member

Pinging @damianavila in case he might be able to help with ☝🏼

@yuvipanda
Copy link
Member

@sgibson91 @GeorgianaElena on EKS, the cluster autoscaler needs to be explicitly enabled by us in our support chart. I did so in fedcfb2 and it's on now!

The cloud-user-sa is also openscapes and carbonplan specific, and predates the work in https://infrastructure.2i2c.org/en/latest/howto/features/cloud-access.html that involves annotations and stuff.

We should definitely have templates here for sure 100%, including for support charts.

@yuvipanda
Copy link
Member

I opened #1800 to get rid of cloud-user-sa (

@yuvipanda
Copy link
Member

As for quotas, we need mostly EC2 quotas (https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas) for new nodes to come up. In particular:

All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests is what is used for dask instances (as they are spot instances), and Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances is what is used for notebook and core instances. The values are 'total CPUs', so bigger nodes consume more quota.

@sgibson91
Copy link
Member Author

Thanks @yuvipanda - I updated our docs around cluster-autoscaler and quotas as well. It's probably fine for now, and I'll give flow a rethink when I begin tackling #1757

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay awesome!

@sgibson91
Copy link
Member Author

Will hold off merging this until Tasha has set the CNAME they would like to use :)

@sgibson91
Copy link
Member Author

I updated the teams to match the capitalisation in the slugs not the display names: #1702 (comment)

I also updated the domains so the hubs are available at the desired CNAMEs of the community.

Merging now!

@sgibson91 sgibson91 merged commit 4496070 into 2i2c-org:master Oct 21, 2022
@sgibson91 sgibson91 deleted the nasa-cryo branch October 21, 2022 09:53
@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/3296485343

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants