From 2b85196f9f6c6568bd79d28f7e62f2be550bc07e Mon Sep 17 00:00:00 2001 From: Tim Cheung <152907271+timckt@users.noreply.github.com> Date: Tue, 16 Jan 2024 15:38:16 +0000 Subject: [PATCH 1/3] udpate concourse runbook to reminder teammates not to deploy boostrap in their test cluster --- runbooks/source/add-concourse-to-cluster.html.md.erb | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/runbooks/source/add-concourse-to-cluster.html.md.erb b/runbooks/source/add-concourse-to-cluster.html.md.erb index bd70381b..f22a5c27 100644 --- a/runbooks/source/add-concourse-to-cluster.html.md.erb +++ b/runbooks/source/add-concourse-to-cluster.html.md.erb @@ -50,7 +50,7 @@ terraform workspace select terraform plan terraform apply -target=module.concourse ``` -- Clone the concourse [repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse). +- Clone the Concourse [repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse). - Login to Concourse @@ -77,6 +77,12 @@ Follow the URL this command outputs, choose to login with Username/Password, and - Apply your pipeline +Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) in your test cluster. It is +for production level deployment and may trigger false alarms to our Slack Channel. + +To ensure an isolated testing environment, please create a new folder on your local machine and start with a simple pipeline. You may use [this link](https://concourse-ci.org/tutorial-hello-world.html) as reference +to deploy the first pipeline in your test cluster. + ``` fly --target david-test1 set-pipeline \ --pipeline plan-pipeline \ From ed48ac83d6a566e42eec0fe0853f205eecfa23ad Mon Sep 17 00:00:00 2001 From: Tim Cheung <152907271+timckt@users.noreply.github.com> Date: Tue, 16 Jan 2024 15:57:11 +0000 Subject: [PATCH 2/3] udpate concourse runbook to reminder teammates not to deploy boostrap in their test cluster --- runbooks/source/add-concourse-to-cluster.html.md.erb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/runbooks/source/add-concourse-to-cluster.html.md.erb b/runbooks/source/add-concourse-to-cluster.html.md.erb index f22a5c27..dc0cce85 100644 --- a/runbooks/source/add-concourse-to-cluster.html.md.erb +++ b/runbooks/source/add-concourse-to-cluster.html.md.erb @@ -77,11 +77,11 @@ Follow the URL this command outputs, choose to login with Username/Password, and - Apply your pipeline -Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) in your test cluster. It is +Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) into your test cluster. It is for production level deployment and may trigger false alarms to our Slack Channel. To ensure an isolated testing environment, please create a new folder on your local machine and start with a simple pipeline. You may use [this link](https://concourse-ci.org/tutorial-hello-world.html) as reference -to deploy the first pipeline in your test cluster. +to deploy the first pipeline into your test cluster and not the one under `manager/main`. ``` fly --target david-test1 set-pipeline \ From afd89e6ee12218c90092b36946daf31a9b527bff Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Tue, 16 Jan 2024 16:25:46 +0000 Subject: [PATCH 3/3] Commit changes made by code formatters --- .../006-Use-github-as-user-directory.md | 2 -- .../012-One-cluster-for-dev-staging-prod.md | 14 ++++---- .../021-Multi-cluster.md | 36 +++++++++---------- architecture-decision-record/023-Logging.md | 22 ++++++------ .../034-EKS-Fargate.md | 8 ++--- .../add-concourse-to-cluster.html.md.erb | 6 ++-- 6 files changed, 43 insertions(+), 45 deletions(-) diff --git a/architecture-decision-record/006-Use-github-as-user-directory.md b/architecture-decision-record/006-Use-github-as-user-directory.md index 2f6ae99d..894b09d7 100644 --- a/architecture-decision-record/006-Use-github-as-user-directory.md +++ b/architecture-decision-record/006-Use-github-as-user-directory.md @@ -18,14 +18,12 @@ We are proposing that we aim for a "single sign on" approach where users can use The current most complete source of this information for people who will be the first users of the cloud platform is GitHub. So our proposal is to use GitHub as our initial user directory - authentication for the new services that we are building will be through GitHub. - ## Decision We will use GitHub as the identify provider for the cloud platform. We will design and build the new cloud platform with the assumption that users will login to all components using a single GitHub id. - ## Consequences We will define users and groups in GitHub and use GitHub's integration tools to provide access to other tools that require authentication. diff --git a/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md b/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md index d9ceeb77..2060e804 100644 --- a/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md +++ b/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md @@ -22,10 +22,10 @@ After consideration of the pros and cons of each approach we went with one clust Some important reasons behind this move were: -* A single k8s cluster can be made powerful enough to run all of our workloads -* Managing a single cluster keeps our operational overhead and costs to a minimum. -* Namespaces and RBAC keep different workloads isolated from each other. -* It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments +- A single k8s cluster can be made powerful enough to run all of our workloads +- Managing a single cluster keeps our operational overhead and costs to a minimum. +- Namespaces and RBAC keep different workloads isolated from each other. +- It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments To clarify the last point; to be useful, a development cluster must be as similar as possible to the production cluster. However, given multiple clusters, with different security and other constraints, some 'drift' is inevitable - e.g. the development cluster might be upgraded to a newer kubernetes version before staging and production, or it could have different connectivity into private networks, or different performance constraints from the production cluster. @@ -39,6 +39,6 @@ If namespace segregation is not sufficient for this, then the whole cloud platfo Having a single cluster to maintain works well for us. -* Service teams know that their development environments accurately reflect the production environments they will eventually create -* There is no duplication of effort, maintaining multiple, slightly different clusters -* All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies) +- Service teams know that their development environments accurately reflect the production environments they will eventually create +- There is no duplication of effort, maintaining multiple, slightly different clusters +- All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies) diff --git a/architecture-decision-record/021-Multi-cluster.md b/architecture-decision-record/021-Multi-cluster.md index 23abf688..4754b784 100644 --- a/architecture-decision-record/021-Multi-cluster.md +++ b/architecture-decision-record/021-Multi-cluster.md @@ -8,27 +8,27 @@ Date: 2021-05-11 ## What’s proposed -We host user apps across *more than one* Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster *may* be further isolated by placing them in separate VPCs or separate AWS accounts. +We host user apps across _more than one_ Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster _may_ be further isolated by placing them in separate VPCs or separate AWS accounts. ## Context Service teams' apps currently run on [one Kubernetes cluster](012-One-cluster-for-dev-staging-prod.html). That includes their dev/staging/prod environments - they are not split off. The key reasoning was: -* Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments -* Maintaining clusters for each environment is a cost in effort -* You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod. +- Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments +- Maintaining clusters for each environment is a cost in effort +- You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod. (We also have clusters for other purposes: a 'management' cluster for Cloud Platform team's CI/CD and ephemeral 'test' clusters for the Cloud Platform team to test changes to the cluster.) However we have seen some problems with using one cluster, and advantages to moving to multi-cluster: -* Scaling limits -* Single point of failure -* Derisk upgrading of k8s -* Reduce blast radius for security -* Reduce blast radius of accidental deletion -* Pre-prod cluster -* Cattle not pets +- Scaling limits +- Single point of failure +- Derisk upgrading of k8s +- Reduce blast radius for security +- Reduce blast radius of accidental deletion +- Pre-prod cluster +- Cattle not pets ### Scaling limits @@ -40,11 +40,11 @@ Running everything on a single cluster is a 'single point of failure', which is Several elements in the cluster are a single point of failure: -* ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls)) -* external-dns -* cert manager -* kiam -* OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58)) +- ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls)) +- external-dns +- cert manager +- kiam +- OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58)) ### Derisk upgrading of k8s @@ -76,8 +76,8 @@ Multi-cluster will allow us to put pre-prod environments on a separate cluster t If we were to create a fresh cluster, and an app is moved onto it, then there are a lot of impacts: -* **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl. -* **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated. +- **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl. +- **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated. ## Steps to achieve it diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md index e56ad61e..ebb67245 100644 --- a/architecture-decision-record/023-Logging.md +++ b/architecture-decision-record/023-Logging.md @@ -12,10 +12,10 @@ Cloud Platform's existing strategy for logs has been to **centralize** them in a Concerns with existing ElasticSearch logging: -* ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes) -* CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs` -* Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged -* Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES +- ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes) +- CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs` +- Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged +- Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES With these concerns in mind, and the [migration to EKS](022-EKS.html) meaning we'd need to reimplement log shipping, we reevaluate this strategy. @@ -37,11 +37,11 @@ Rather than centralized logging in ES, we'll evaluate different logging solution **AWS services for logging** - with the cluster now in EKS, it wouldn't be too much of a leap to centralizing logs in CloudWatch and make use of the AWS managed tools. One one hand it's proprietary to AWS, so adds cost of switching away. But it might be preferable to the cost of running ES, and related tools like GuardDuty and Security Hub, with use across Modernization Platform, is attractive. -### Observing apps** +### Observing apps\*\* -* Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it *requires* heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability. -* CloudWatch Logs - possible and low operational overhead - needs further evaluation. -* Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging. +- Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it _requires_ heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability. +- CloudWatch Logs - possible and low operational overhead - needs further evaluation. +- Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging. ### Observing the platform @@ -53,9 +53,9 @@ TBD ### Security -* MLAP was designed for this, but it is stalled, so probably best to manage it ourselves. -* ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period. -* AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous. +- MLAP was designed for this, but it is stalled, so probably best to manage it ourselves. +- ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period. +- AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous. ## Next steps diff --git a/architecture-decision-record/034-EKS-Fargate.md b/architecture-decision-record/034-EKS-Fargate.md index 2a2060a3..cfb4eac5 100644 --- a/architecture-decision-record/034-EKS-Fargate.md +++ b/architecture-decision-record/034-EKS-Fargate.md @@ -14,8 +14,8 @@ Move from EKS managed nodes to EKS Fargate. This is really attractive because: -* to reduce our operational overhead -* improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container). +- to reduce our operational overhead +- improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container). However there’s plenty of things we’d need to tackle, to achieve this (copied from [ADR022 EKS - Fargate considerations](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md#future-fargate-considerations)): @@ -23,8 +23,8 @@ However there’s plenty of things we’d need to tackle, to achieve this (copie **Daemonset functionality** - needs replacement: -* fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. -* prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network +- fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. +- prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network **No EBS support** - Prometheus will run still in a managed node group. Likely other workloads too to consider. diff --git a/runbooks/source/add-concourse-to-cluster.html.md.erb b/runbooks/source/add-concourse-to-cluster.html.md.erb index dc0cce85..278773a3 100644 --- a/runbooks/source/add-concourse-to-cluster.html.md.erb +++ b/runbooks/source/add-concourse-to-cluster.html.md.erb @@ -77,12 +77,12 @@ Follow the URL this command outputs, choose to login with Username/Password, and - Apply your pipeline -Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) into your test cluster. It is +Please do not deploy the bootstrap pipeline in the [Concourse repository](https://github.com/ministryofjustice/cloud-platform-terraform-concourse/tree/main/pipelines/manager/main) into your test cluster. It is for production level deployment and may trigger false alarms to our Slack Channel. -To ensure an isolated testing environment, please create a new folder on your local machine and start with a simple pipeline. You may use [this link](https://concourse-ci.org/tutorial-hello-world.html) as reference +To ensure an isolated testing environment, please create a new folder on your local machine and start with a simple pipeline. You may use [this link](https://concourse-ci.org/tutorial-hello-world.html) as reference to deploy the first pipeline into your test cluster and not the one under `manager/main`. - + ``` fly --target david-test1 set-pipeline \ --pipeline plan-pipeline \