From 5934da51705fa79f56db8c9ab187a84499fac108 Mon Sep 17 00:00:00 2001 From: Poornima Krishnasamy Date: Mon, 15 Jan 2024 12:02:15 +0000 Subject: [PATCH] Upgrade lychee link-checker (#5174) * Upgrade lychee link-checker * Fix links * Commit changes made by code formatters --------- Co-authored-by: github-actions[bot] --- .github/workflows/link-checker.yml | 6 ++- .ignore-links => .lycheeignore | 0 .../006-Use-github-as-user-directory.md | 4 +- .../012-One-cluster-for-dev-staging-prod.md | 16 ++++---- .../021-Multi-cluster.md | 38 +++++++++---------- architecture-decision-record/023-Logging.md | 24 ++++++------ .../034-EKS-Fargate.md | 10 ++--- ...add-new-receiver-alert-manager.html.md.erb | 4 +- .../add-nodes-to-the-eks-cluster.html.md.erb | 2 +- .../disaster-recovery-scenarios.html.md.erb | 4 +- runbooks/source/eks-cluster.html.md.erb | 2 +- runbooks/source/how-we-work.html.md.erb | 3 +- runbooks/source/joiners-guide.html.md.erb | 2 +- runbooks/source/leavers-guide.html.md.erb | 9 ++--- runbooks/source/on-call.html.md.erb | 2 +- runbooks/source/tips-and-tricks.html.md.erb | 5 +-- .../upgrade-terraform-version.html.md.erb | 7 ++-- 17 files changed, 66 insertions(+), 72 deletions(-) rename .ignore-links => .lycheeignore (100%) diff --git a/.github/workflows/link-checker.yml b/.github/workflows/link-checker.yml index 286656a5..f7cedc9b 100644 --- a/.github/workflows/link-checker.yml +++ b/.github/workflows/link-checker.yml @@ -2,6 +2,8 @@ name: Links on: repository_dispatch: + pull_request: + types: [opened, synchronize, reopened] workflow_dispatch: schedule: - cron: "00 18 * * *" @@ -13,9 +15,9 @@ jobs: - uses: actions/checkout@v2 - name: Link Checker - uses: lycheeverse/lychee-action@v1.0.9 + uses: lycheeverse/lychee-action@v1.9.1 with: - args: --verbose --no-progress **/*.md **/*.html **/*.erb --exclude-file .ignore-links --accept 200,429,403,400,301,302 --exclude-mail + args: --verbose --no-progress **/*.md **/*.html **/*.erb --accept 200,429,403,400,301,302,401 --exclude-mail env: GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}} diff --git a/.ignore-links b/.lycheeignore similarity index 100% rename from .ignore-links rename to .lycheeignore diff --git a/architecture-decision-record/006-Use-github-as-user-directory.md b/architecture-decision-record/006-Use-github-as-user-directory.md index 2f6ae99d..a34d1b8b 100644 --- a/architecture-decision-record/006-Use-github-as-user-directory.md +++ b/architecture-decision-record/006-Use-github-as-user-directory.md @@ -10,7 +10,7 @@ Date 09/04/2018 As part of our [planning principles](https://docs.google.com/document/d/1kHaghp-68ooK-NwxozYkScGZThYJVrdOGWf4_K8Wo6s/edit) we highlighted "Building in access control" as a key principle for planning our building our new cloud platform. -Making this work for the new cloud platform means implementing ways that our users — mainly developers — can access the various bits of the new infrastructure. This is likely to include access to Kubernetes (CLI and API), AWS (things like S3, RDS), GitHub, and any tooling we put on top of Kubernetes that users will access as part of running their apps (e.g. ELK, [Prometheus](https://github.com/ministryofjustice/cloud-platform/blob/master/architecture-decision-record/005-Use-Promethus-For-Monitoring.md), [Concourse](https://github.com/ministryofjustice/cloud-platform/blob/master/architecture-decision-record/003-Use-Concourse-CI.md)). +Making this work for the new cloud platform means implementing ways that our users — mainly developers — can access the various bits of the new infrastructure. This is likely to include access to Kubernetes (CLI and API), AWS (things like S3, RDS), GitHub, and any tooling we put on top of Kubernetes that users will access as part of running their apps (e.g. ELK, [Prometheus](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/026-Managed-Prometheus.md#choice-of-prometheus), [Concourse](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/003-Use-Concourse-CI.md)). At the current time there is no consistent access policy for tooling. We use a mixture of the Google domain, GitHub and AWS accounts to access and manage the various parts of our infrastructure. This makes it hard for users to make sure that they have the correct permissions to do what they need to do, resulting in lots of requests for permissions. It also makes it harder to manage the user lifecycle (adding, removing, updating user permissions) and to track exactly who has access to what. @@ -18,14 +18,12 @@ We are proposing that we aim for a "single sign on" approach where users can use The current most complete source of this information for people who will be the first users of the cloud platform is GitHub. So our proposal is to use GitHub as our initial user directory - authentication for the new services that we are building will be through GitHub. - ## Decision We will use GitHub as the identify provider for the cloud platform. We will design and build the new cloud platform with the assumption that users will login to all components using a single GitHub id. - ## Consequences We will define users and groups in GitHub and use GitHub's integration tools to provide access to other tools that require authentication. diff --git a/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md b/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md index d9ceeb77..19bc6721 100644 --- a/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md +++ b/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md @@ -8,7 +8,7 @@ Date: 01/06/2019 **June 2020 Update** The CP team is now in the habit of spinning up a [test cluster](https://github.com/ministryofjustice/cloud-platform-infrastructure/pull/742) to develop and test each change to the platform, before it is deployed to the main cluster (live). So although the main cluster still has dev/staging namespaces for service teams, this work is confined to their namespaces, and there's little concern that they would disrupt other namespaces. These user dev/staging namespaces could simply be seen as benefiting from the high service level offered for the cluster, due to it hosting the production namespaces. -**May 2021 Update** We're looking to move on from this ADR decision, and have different clusters for non-prod namespaces - see [021-Multi-cluster](021-Multi-cluster.html) +**May 2021 Update** We're looking to move on from this ADR decision, and have different clusters for non-prod namespaces - see [021-Multi-cluster](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/021-Multi-cluster.md) ## Context @@ -22,10 +22,10 @@ After consideration of the pros and cons of each approach we went with one clust Some important reasons behind this move were: -* A single k8s cluster can be made powerful enough to run all of our workloads -* Managing a single cluster keeps our operational overhead and costs to a minimum. -* Namespaces and RBAC keep different workloads isolated from each other. -* It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments +- A single k8s cluster can be made powerful enough to run all of our workloads +- Managing a single cluster keeps our operational overhead and costs to a minimum. +- Namespaces and RBAC keep different workloads isolated from each other. +- It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments To clarify the last point; to be useful, a development cluster must be as similar as possible to the production cluster. However, given multiple clusters, with different security and other constraints, some 'drift' is inevitable - e.g. the development cluster might be upgraded to a newer kubernetes version before staging and production, or it could have different connectivity into private networks, or different performance constraints from the production cluster. @@ -39,6 +39,6 @@ If namespace segregation is not sufficient for this, then the whole cloud platfo Having a single cluster to maintain works well for us. -* Service teams know that their development environments accurately reflect the production environments they will eventually create -* There is no duplication of effort, maintaining multiple, slightly different clusters -* All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies) +- Service teams know that their development environments accurately reflect the production environments they will eventually create +- There is no duplication of effort, maintaining multiple, slightly different clusters +- All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies) diff --git a/architecture-decision-record/021-Multi-cluster.md b/architecture-decision-record/021-Multi-cluster.md index 23abf688..b7b46621 100644 --- a/architecture-decision-record/021-Multi-cluster.md +++ b/architecture-decision-record/021-Multi-cluster.md @@ -8,27 +8,27 @@ Date: 2021-05-11 ## What’s proposed -We host user apps across *more than one* Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster *may* be further isolated by placing them in separate VPCs or separate AWS accounts. +We host user apps across _more than one_ Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster _may_ be further isolated by placing them in separate VPCs or separate AWS accounts. ## Context -Service teams' apps currently run on [one Kubernetes cluster](012-One-cluster-for-dev-staging-prod.html). That includes their dev/staging/prod environments - they are not split off. The key reasoning was: +Service teams' apps currently run on [one Kubernetes cluster](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md). That includes their dev/staging/prod environments - they are not split off. The key reasoning was: -* Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments -* Maintaining clusters for each environment is a cost in effort -* You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod. +- Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments +- Maintaining clusters for each environment is a cost in effort +- You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod. (We also have clusters for other purposes: a 'management' cluster for Cloud Platform team's CI/CD and ephemeral 'test' clusters for the Cloud Platform team to test changes to the cluster.) However we have seen some problems with using one cluster, and advantages to moving to multi-cluster: -* Scaling limits -* Single point of failure -* Derisk upgrading of k8s -* Reduce blast radius for security -* Reduce blast radius of accidental deletion -* Pre-prod cluster -* Cattle not pets +- Scaling limits +- Single point of failure +- Derisk upgrading of k8s +- Reduce blast radius for security +- Reduce blast radius of accidental deletion +- Pre-prod cluster +- Cattle not pets ### Scaling limits @@ -40,11 +40,11 @@ Running everything on a single cluster is a 'single point of failure', which is Several elements in the cluster are a single point of failure: -* ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls)) -* external-dns -* cert manager -* kiam -* OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58)) +- ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls)) +- external-dns +- cert manager +- kiam +- OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58)) ### Derisk upgrading of k8s @@ -76,8 +76,8 @@ Multi-cluster will allow us to put pre-prod environments on a separate cluster t If we were to create a fresh cluster, and an app is moved onto it, then there are a lot of impacts: -* **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl. -* **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated. +- **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl. +- **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated. ## Steps to achieve it diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md index e56ad61e..1eb16149 100644 --- a/architecture-decision-record/023-Logging.md +++ b/architecture-decision-record/023-Logging.md @@ -12,12 +12,12 @@ Cloud Platform's existing strategy for logs has been to **centralize** them in a Concerns with existing ElasticSearch logging: -* ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes) -* CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs` -* Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged -* Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES +- ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes) +- CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs` +- Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged +- Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES -With these concerns in mind, and the [migration to EKS](022-EKS.html) meaning we'd need to reimplement log shipping, we reevaluate this strategy. +With these concerns in mind, and the [migration to EKS](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md) meaning we'd need to reimplement log shipping, we reevaluate this strategy. ## User needs @@ -37,11 +37,11 @@ Rather than centralized logging in ES, we'll evaluate different logging solution **AWS services for logging** - with the cluster now in EKS, it wouldn't be too much of a leap to centralizing logs in CloudWatch and make use of the AWS managed tools. One one hand it's proprietary to AWS, so adds cost of switching away. But it might be preferable to the cost of running ES, and related tools like GuardDuty and Security Hub, with use across Modernization Platform, is attractive. -### Observing apps** +### Observing apps\*\* -* Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it *requires* heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability. -* CloudWatch Logs - possible and low operational overhead - needs further evaluation. -* Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging. +- Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it _requires_ heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability. +- CloudWatch Logs - possible and low operational overhead - needs further evaluation. +- Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging. ### Observing the platform @@ -53,9 +53,9 @@ TBD ### Security -* MLAP was designed for this, but it is stalled, so probably best to manage it ourselves. -* ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period. -* AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous. +- MLAP was designed for this, but it is stalled, so probably best to manage it ourselves. +- ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period. +- AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous. ## Next steps diff --git a/architecture-decision-record/034-EKS-Fargate.md b/architecture-decision-record/034-EKS-Fargate.md index 2a2060a3..21f306d6 100644 --- a/architecture-decision-record/034-EKS-Fargate.md +++ b/architecture-decision-record/034-EKS-Fargate.md @@ -14,8 +14,8 @@ Move from EKS managed nodes to EKS Fargate. This is really attractive because: -* to reduce our operational overhead -* improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container). +- to reduce our operational overhead +- improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container). However there’s plenty of things we’d need to tackle, to achieve this (copied from [ADR022 EKS - Fargate considerations](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md#future-fargate-considerations)): @@ -23,13 +23,13 @@ However there’s plenty of things we’d need to tackle, to achieve this (copie **Daemonset functionality** - needs replacement: -* fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. -* prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network +- fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. +- prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network **No EBS support** - Prometheus will run still in a managed node group. Likely other workloads too to consider. **How people check the status of their deployments** - to be investigated -**Ingress can't be nginx? - just the load balancer in front** - to be investigated. Would be fine with [ADR032 Managed ingress](032-Managed-ingress) +**Ingress can't be nginx? - just the load balancer in front** - to be investigated. Would be fine with AWS Managed Ingress If we don't use Fargate then we should take advantage of Spot instances for reduced costs. However Fargate is the priority, because the main driver here is engineer time, not EC2 cost. diff --git a/runbooks/source/add-new-receiver-alert-manager.html.md.erb b/runbooks/source/add-new-receiver-alert-manager.html.md.erb index cfc74af6..2707e713 100644 --- a/runbooks/source/add-new-receiver-alert-manager.html.md.erb +++ b/runbooks/source/add-new-receiver-alert-manager.html.md.erb @@ -1,7 +1,7 @@ --- title: Add a new Alertmanager receiver and a slack webhook weight: 85 -last_reviewed_on: 2023-11-20 +last_reviewed_on: 2024-01-12 review_in: 6 months --- @@ -22,7 +22,7 @@ You must have the below details from the development team. ## Creating a new receiver set -1. Fill in the template with the details provided from development team and add the array to [`terraform.tfvars`](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/cloud-platform-components/terraform.tfvars) file. +1. Fill in the template with the details provided from development team and add the array to [`terraform.tfvars`](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars) file. The `terraform.tfvars` file is encrypted so you have to `git-crypt unlock` to view the contents of the file. Check [git-crypt documentation in user guide](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/other-topics/git-crypt-setup.html#git-crypt) for more information on how to setup git-crypt. diff --git a/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb b/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb index 3fdbe1b4..5a21d28d 100644 --- a/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb +++ b/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb @@ -21,7 +21,7 @@ This can address the problem of CPU high usage/load ### Cluster configuration: -#### [cluster.tf](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/cloud-platform-eks/cluster.tf) +#### [cluster.tf](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf) Use diff --git a/runbooks/source/disaster-recovery-scenarios.html.md.erb b/runbooks/source/disaster-recovery-scenarios.html.md.erb index 2f2e791b..c17895ce 100644 --- a/runbooks/source/disaster-recovery-scenarios.html.md.erb +++ b/runbooks/source/disaster-recovery-scenarios.html.md.erb @@ -253,9 +253,7 @@ Plan: 7 to add, 0 to change, 0 to destroy. In this scenario, terraform state can be restored from the remote_state stored in the terraform backend S3 bucket. -For example [eks-components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components) state is stored in "aws-accounts/cloud-platform-aws/vpc/eks/components" s3 bucket as defined [here-eks](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/main.tf/#L5-L14). - -or [kops-components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/kops/components) state is stored in "aws-accounts/cloud-platform-aws/vpc/kops/components" s3 bucket as defined [here-kops](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/kops/components/main.tf/#L1-L11). +For example [eks/components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components) state is stored in "aws-accounts/cloud-platform-aws/vpc/eks/components" s3 bucket as defined [here-eks](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/main.tf/#L5-L14). Access the S3 bucket where the effected terraform state is stored. From the list of terraform.tfstate file versions, identify the file before the state got removed and download as terraform.tfstate. Upload the file again, this will set uploaded file as latest version. diff --git a/runbooks/source/eks-cluster.html.md.erb b/runbooks/source/eks-cluster.html.md.erb index 5407b40e..e1f79a8a 100644 --- a/runbooks/source/eks-cluster.html.md.erb +++ b/runbooks/source/eks-cluster.html.md.erb @@ -32,7 +32,7 @@ Alternatively, using the `create-cluster` script. See the file [example.env.create-cluster](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/example.env.create-cluster) in the infrastructure repo. This shows examples of the environment variables which must be set in order to run the `create-cluster.rb` script to create a new cluster. -You can get the auth0 values from the `terraform-provider-auth0` application on [auth0](https://manage.auth0.com/dashboard/eu/justice-cloud-platform/applications). +You can get the auth0 values from the `terraform-provider-auth0` application on [justice-cloud-platform - auth0](https://auth0.com/docs/authenticate/login). or you can export these env vars in your shell: diff --git a/runbooks/source/how-we-work.html.md.erb b/runbooks/source/how-we-work.html.md.erb index ef4f825a..d1a24254 100644 --- a/runbooks/source/how-we-work.html.md.erb +++ b/runbooks/source/how-we-work.html.md.erb @@ -116,7 +116,8 @@ Instead, when not answering queries and reviewing PRs, the Hammer should work on Most of our user-facing documentation is in the [user guide], and documentation for the team is in the [runbooks] site. -There are also a lot of important `README.md` files like [this one](https://github.com/ministryofjustice/cloud-platform#ministry-of-justice-cloud-platform-master-repo), especially for our terraform modules. We also have code samples like [this](https://github.com/ministryofjustice/cloud-platform-terraform-rds-instance/blob/main/example/rds-postgresql.tf) for each of our terraform modules. +There are also a lot of important `README.md` files like [this one](https://github.com/ministryofjustice/cloud-platform#ministry-of-justice-cloud-platform-master-repo), especially for our terraform modules. +We also have code samples like [this](https://github.com/ministryofjustice/cloud-platform-terraform-rds-instance/blob/main/examples/rds-postgresql.tf) for each of our terraform modules. It is important to keep all of this up to date as the underlying code changes, so please remember to factor this in when estimating and working on tickets. diff --git a/runbooks/source/joiners-guide.html.md.erb b/runbooks/source/joiners-guide.html.md.erb index e7645b3a..520afdb4 100644 --- a/runbooks/source/joiners-guide.html.md.erb +++ b/runbooks/source/joiners-guide.html.md.erb @@ -64,7 +64,7 @@ git config --global init.templateDir ~/.git-templates/git-secrets * Github access - (to be done by new starter) * Explain github for RBAC -* Invite as an admin on [Auth0](https://manage.auth0.com/dashboard/eu/justice-cloud-platform/users) +* Invite as an admin on [justice-cloud-platform - Auth0](https://auth0.com/docs/authenticate/login) * (Switch to 'justice-cloud-platform tenant, then use the drop-down and select "Invite an admin") * New starter github user to be added to MOJ github organisation and WebOps team * Add to [MoJ 1Password](https://ministryofjustice.1password.eu/) diff --git a/runbooks/source/leavers-guide.html.md.erb b/runbooks/source/leavers-guide.html.md.erb index 3af1165b..d5afba5f 100644 --- a/runbooks/source/leavers-guide.html.md.erb +++ b/runbooks/source/leavers-guide.html.md.erb @@ -1,8 +1,7 @@ --- title: Leavers Guide weight: 9100 -last_reviewed_on: 2023-12-13 -review_in: 3 months +last_reviewed_on: 2024-01-12 --- # Leavers Guide @@ -32,7 +31,7 @@ When CP team members leave, follow this guide, and log completion in a ticket. #### Slack account deactivation - Cloud Platform maintain a list of webhooks for [Alertmanager Notifications](https://api.slack.com/apps/ABFSJLD8W/incoming-webhooks). When the slack account is deactivated, + Cloud Platform maintain a list of webhooks for Alertmanager Notifications - Incoming Webhooks. When the slack account is deactivated, these webhooks will still be active. Hence, no action is needed. Some apps that member installed which require member-specific permissions may be atomatically deactivated. @@ -64,9 +63,9 @@ Below are the list of 3rd party accounts that need to be removed when a member l 1. Request Password Management removal - [1Password](https://1password.com/) -2. [Auth0 justice-cloud-platform](https://manage.auth0.com/dashboard/eu/justice-cloud-platform/users) +2. [Auth0 justice-cloud-platform](https://auth0.com/docs/authenticate/login) -3. [Auth0 moj-cloud-platforms](https://manage.auth0.com/dashboard/eu/moj-cloud-platforms-dev/users) +3. [Auth0 moj-cloud-platforms](https://auth0.com/docs/authenticate/login) 4. [Pagerduty](https://moj-digital-tools.pagerduty.com/users) diff --git a/runbooks/source/on-call.html.md.erb b/runbooks/source/on-call.html.md.erb index 7611f201..803f842d 100644 --- a/runbooks/source/on-call.html.md.erb +++ b/runbooks/source/on-call.html.md.erb @@ -31,7 +31,7 @@ Cloud Platform team members provide support out of hours, as detailed in [Cloud 2. Get production access to supported services. 3. Get access to our on-call tools: * [Pingdom](https://my.pingdom.com/) - * [Pagerduty](https://moj-digital-tools.pagerduty.com/) (and configure your contact details and notifications, this is the single source of truth for when you are on call.) + * [Pagerduty- moj-digital-tools.pagerduty.com](https://identity.pagerduty.com/global/authn/authentication/PagerDutyGlobalLogin/subdomain) (and configure your contact details and notifications, this is the single source of truth for when you are on call.) * [AWS](https://mojdsd.signin.aws.amazon.com/) * the MOJDS VPN (and configure it to “send all traffic over VPN connection”) 4. Do a dry-run of an incident. diff --git a/runbooks/source/tips-and-tricks.html.md.erb b/runbooks/source/tips-and-tricks.html.md.erb index 0f1e346f..bab5944f 100644 --- a/runbooks/source/tips-and-tricks.html.md.erb +++ b/runbooks/source/tips-and-tricks.html.md.erb @@ -1,7 +1,7 @@ --- title: Tips and Tricks weight: 9200 -last_reviewed_on: 2024-05-21 +last_reviewed_on: 2024-01-12 review_in: 6 months --- @@ -84,9 +84,6 @@ Paste this into the search field on [Prometheus]: ``` max by(node) (max by(instance) (kubelet_running_pod_count{job="kubelet",metrics_path="/metrics"}) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) ``` -## Output all records from Route53 as a CSV file - -Use [this script](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/bin/route53-to-csv.rb) ## Add more RSS feeds to `#cloud-platform-rss` channel diff --git a/runbooks/source/upgrade-terraform-version.html.md.erb b/runbooks/source/upgrade-terraform-version.html.md.erb index 977bb4f5..03c15c20 100644 --- a/runbooks/source/upgrade-terraform-version.html.md.erb +++ b/runbooks/source/upgrade-terraform-version.html.md.erb @@ -118,7 +118,9 @@ When all namespaces in the cloud-platform-environments repository are using the - [Remove](https://github.com/ministryofjustice/cloud-platform-environments/commit/b11b0372fe71289e51739395664355014df0e655) the conditional logic in the apply library. ### Infrastructure state files -The Infrastructure state we have in the Cloud Platform is structured in a tree related to its dependency, so for example, the [components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/cloud-platform-components) state (in the output below) relies heavily on the directory above and so on. Here is a snapshot of how our directory looks but this is likely to change: +The Infrastructure state we have in the Cloud Platform is structured in a tree related to its dependency, +so for example, the [components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components) state (in the output below) relies heavily on the directory above and so on. +Here is a snapshot of how our directory looks but this is likely to change: ``` aws-accounts @@ -128,7 +130,6 @@ aws-accounts │ ├── eks # Holding EKS, workspaces for individual clusters. │ │ └── components # EKS components. Workspaces for individual clusters │ └── kops # Holding KOPS, workspaces for individual clusters. -│ └── components # KOPS components. Workspaces for individual clusters ├── cloud-platform-dsd │ └── main.tf ├── cloud-platform-ephemeral-test @@ -136,8 +137,6 @@ aws-accounts │ └── vpc │ ├── eks │ │ └── components -│ └── kops -│ └── components └── README.md ```