Skip to content

Commit

Permalink
Upgrade lychee link-checker (#5174)
Browse files Browse the repository at this point in the history
* Upgrade lychee link-checker

* Fix links

* Commit changes made by code formatters

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
1 parent 276f9bb commit 5934da5
Show file tree
Hide file tree
Showing 17 changed files with 66 additions and 72 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/link-checker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ name: Links

on:
repository_dispatch:
pull_request:
types: [opened, synchronize, reopened]
workflow_dispatch:
schedule:
- cron: "00 18 * * *"
Expand All @@ -13,9 +15,9 @@ jobs:
- uses: actions/checkout@v2

- name: Link Checker
uses: lycheeverse/lychee-action@v1.0.9
uses: lycheeverse/lychee-action@v1.9.1
with:
args: --verbose --no-progress **/*.md **/*.html **/*.erb --exclude-file .ignore-links --accept 200,429,403,400,301,302 --exclude-mail
args: --verbose --no-progress **/*.md **/*.html **/*.erb --accept 200,429,403,400,301,302,401 --exclude-mail
env:
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,20 @@ Date 09/04/2018

As part of our [planning principles](https://docs.google.com/document/d/1kHaghp-68ooK-NwxozYkScGZThYJVrdOGWf4_K8Wo6s/edit) we highlighted "Building in access control" as a key principle for planning our building our new cloud platform.

Making this work for the new cloud platform means implementing ways that our users &mdash; mainly developers &mdash; can access the various bits of the new infrastructure. This is likely to include access to Kubernetes (CLI and API), AWS (things like S3, RDS), GitHub, and any tooling we put on top of Kubernetes that users will access as part of running their apps (e.g. ELK, [Prometheus](https://github.com/ministryofjustice/cloud-platform/blob/master/architecture-decision-record/005-Use-Promethus-For-Monitoring.md), [Concourse](https://github.com/ministryofjustice/cloud-platform/blob/master/architecture-decision-record/003-Use-Concourse-CI.md)).
Making this work for the new cloud platform means implementing ways that our users &mdash; mainly developers &mdash; can access the various bits of the new infrastructure. This is likely to include access to Kubernetes (CLI and API), AWS (things like S3, RDS), GitHub, and any tooling we put on top of Kubernetes that users will access as part of running their apps (e.g. ELK, [Prometheus](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/026-Managed-Prometheus.md#choice-of-prometheus), [Concourse](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/003-Use-Concourse-CI.md)).

At the current time there is no consistent access policy for tooling. We use a mixture of the Google domain, GitHub and AWS accounts to access and manage the various parts of our infrastructure. This makes it hard for users to make sure that they have the correct permissions to do what they need to do, resulting in lots of requests for permissions. It also makes it harder to manage the user lifecycle (adding, removing, updating user permissions) and to track exactly who has access to what.

We are proposing that we aim for a "single sign on" approach where users can use a single logon to access different resources. For this we will need a directory where we can store users and their permissions, including what teams they belong to and what roles they have.

The current most complete source of this information for people who will be the first users of the cloud platform is GitHub. So our proposal is to use GitHub as our initial user directory - authentication for the new services that we are building will be through GitHub.


## Decision

We will use GitHub as the identify provider for the cloud platform.

We will design and build the new cloud platform with the assumption that users will login to all components using a single GitHub id.


## Consequences

We will define users and groups in GitHub and use GitHub's integration tools to provide access to other tools that require authentication.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Date: 01/06/2019

**June 2020 Update** The CP team is now in the habit of spinning up a [test cluster](https://github.com/ministryofjustice/cloud-platform-infrastructure/pull/742) to develop and test each change to the platform, before it is deployed to the main cluster (live). So although the main cluster still has dev/staging namespaces for service teams, this work is confined to their namespaces, and there's little concern that they would disrupt other namespaces. These user dev/staging namespaces could simply be seen as benefiting from the high service level offered for the cluster, due to it hosting the production namespaces.

**May 2021 Update** We're looking to move on from this ADR decision, and have different clusters for non-prod namespaces - see [021-Multi-cluster](021-Multi-cluster.html)
**May 2021 Update** We're looking to move on from this ADR decision, and have different clusters for non-prod namespaces - see [021-Multi-cluster](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/021-Multi-cluster.md)

## Context

Expand All @@ -22,10 +22,10 @@ After consideration of the pros and cons of each approach we went with one clust

Some important reasons behind this move were:

* A single k8s cluster can be made powerful enough to run all of our workloads
* Managing a single cluster keeps our operational overhead and costs to a minimum.
* Namespaces and RBAC keep different workloads isolated from each other.
* It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments
- A single k8s cluster can be made powerful enough to run all of our workloads
- Managing a single cluster keeps our operational overhead and costs to a minimum.
- Namespaces and RBAC keep different workloads isolated from each other.
- It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments

To clarify the last point; to be useful, a development cluster must be as similar as possible to the production cluster. However, given multiple clusters, with different security and other constraints, some 'drift' is inevitable - e.g. the development cluster might be upgraded to a newer kubernetes version before staging and production, or it could have different connectivity into private networks, or different performance constraints from the production cluster.

Expand All @@ -39,6 +39,6 @@ If namespace segregation is not sufficient for this, then the whole cloud platfo

Having a single cluster to maintain works well for us.

* Service teams know that their development environments accurately reflect the production environments they will eventually create
* There is no duplication of effort, maintaining multiple, slightly different clusters
* All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies)
- Service teams know that their development environments accurately reflect the production environments they will eventually create
- There is no duplication of effort, maintaining multiple, slightly different clusters
- All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies)
38 changes: 19 additions & 19 deletions architecture-decision-record/021-Multi-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,27 @@ Date: 2021-05-11

## What’s proposed

We host user apps across *more than one* Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster *may* be further isolated by placing them in separate VPCs or separate AWS accounts.
We host user apps across _more than one_ Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster _may_ be further isolated by placing them in separate VPCs or separate AWS accounts.

## Context

Service teams' apps currently run on [one Kubernetes cluster](012-One-cluster-for-dev-staging-prod.html). That includes their dev/staging/prod environments - they are not split off. The key reasoning was:
Service teams' apps currently run on [one Kubernetes cluster](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md). That includes their dev/staging/prod environments - they are not split off. The key reasoning was:

* Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments
* Maintaining clusters for each environment is a cost in effort
* You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod.
- Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments
- Maintaining clusters for each environment is a cost in effort
- You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod.

(We also have clusters for other purposes: a 'management' cluster for Cloud Platform team's CI/CD and ephemeral 'test' clusters for the Cloud Platform team to test changes to the cluster.)

However we have seen some problems with using one cluster, and advantages to moving to multi-cluster:

* Scaling limits
* Single point of failure
* Derisk upgrading of k8s
* Reduce blast radius for security
* Reduce blast radius of accidental deletion
* Pre-prod cluster
* Cattle not pets
- Scaling limits
- Single point of failure
- Derisk upgrading of k8s
- Reduce blast radius for security
- Reduce blast radius of accidental deletion
- Pre-prod cluster
- Cattle not pets

### Scaling limits

Expand All @@ -40,11 +40,11 @@ Running everything on a single cluster is a 'single point of failure', which is

Several elements in the cluster are a single point of failure:

* ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls))
* external-dns
* cert manager
* kiam
* OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58))
- ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls))
- external-dns
- cert manager
- kiam
- OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58))

### Derisk upgrading of k8s

Expand Down Expand Up @@ -76,8 +76,8 @@ Multi-cluster will allow us to put pre-prod environments on a separate cluster t

If we were to create a fresh cluster, and an app is moved onto it, then there are a lot of impacts:

* **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl.
* **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated.
- **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl.
- **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated.

## Steps to achieve it

Expand Down
24 changes: 12 additions & 12 deletions architecture-decision-record/023-Logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ Cloud Platform's existing strategy for logs has been to **centralize** them in a

Concerns with existing ElasticSearch logging:

* ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes)
* CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs`
* Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged
* Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES
- ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes)
- CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs`
- Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged
- Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES

With these concerns in mind, and the [migration to EKS](022-EKS.html) meaning we'd need to reimplement log shipping, we reevaluate this strategy.
With these concerns in mind, and the [migration to EKS](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md) meaning we'd need to reimplement log shipping, we reevaluate this strategy.

## User needs

Expand All @@ -37,11 +37,11 @@ Rather than centralized logging in ES, we'll evaluate different logging solution

**AWS services for logging** - with the cluster now in EKS, it wouldn't be too much of a leap to centralizing logs in CloudWatch and make use of the AWS managed tools. One one hand it's proprietary to AWS, so adds cost of switching away. But it might be preferable to the cost of running ES, and related tools like GuardDuty and Security Hub, with use across Modernization Platform, is attractive.

### Observing apps**
### Observing apps\*\*

* Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it *requires* heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability.
* CloudWatch Logs - possible and low operational overhead - needs further evaluation.
* Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging.
- Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it _requires_ heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability.
- CloudWatch Logs - possible and low operational overhead - needs further evaluation.
- Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging.

### Observing the platform

Expand All @@ -53,9 +53,9 @@ TBD

### Security

* MLAP was designed for this, but it is stalled, so probably best to manage it ourselves.
* ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period.
* AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous.
- MLAP was designed for this, but it is stalled, so probably best to manage it ourselves.
- ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period.
- AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous.

## Next steps

Expand Down
10 changes: 5 additions & 5 deletions architecture-decision-record/034-EKS-Fargate.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,22 @@ Move from EKS managed nodes to EKS Fargate.

This is really attractive because:

* to reduce our operational overhead
* improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container).
- to reduce our operational overhead
- improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container).

However there’s plenty of things we’d need to tackle, to achieve this (copied from [ADR022 EKS - Fargate considerations](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md#future-fargate-considerations)):

**Pod limits** - there is a quota limit of [500 Fargate pods per region per AWS Account](https://aws.amazon.com/about-aws/whats-new/2020/09/aws-fargate-increases-default-resource-count-service-quotas/) which could be an issue, considering we currently run ~2000 pods. We can request AWS raise the limit - not currently sure what scope there is. With Multi-cluster stage 5, the separation of loads into different AWS accounts will settle this issue.

**Daemonset functionality** - needs replacement:

* fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch.
* prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network
- fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch.
- prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network

**No EBS support** - Prometheus will run still in a managed node group. Likely other workloads too to consider.

**How people check the status of their deployments** - to be investigated

**Ingress can't be nginx? - just the load balancer in front** - to be investigated. Would be fine with [ADR032 Managed ingress](032-Managed-ingress)
**Ingress can't be nginx? - just the load balancer in front** - to be investigated. Would be fine with AWS Managed Ingress

If we don't use Fargate then we should take advantage of Spot instances for reduced costs. However Fargate is the priority, because the main driver here is engineer time, not EC2 cost.
Loading

0 comments on commit 5934da5

Please sign in to comment.