-
Notifications
You must be signed in to change notification settings - Fork 4
Postgres databases
We run a Postgres 15 database using Amazon RDS.
The RDS instance itself is configured in our Terraform code.
For Postgres users and extensions, these are configured using our DB setup script. This script is idempotent and is run by Concourse every time we deploy our Terraform code.
We recommend using pgcli as your database client. Run these commands in your terminal to install it:
brew tap dbcli/tap
brew install pgcli
To install the db-connect script, clone the notifications-aws repo and follow the install instructions.
You can then connect to the database with an appropriate role:
Developers should use this role by default. You'll only be able to read data, not write it. The postgres readonly
user is automatically assumed by using the readonly
AWS role.
gds aws notify-preview -- db-connect.sh notifydb -- pgcli
Developers should only use this role if you need to write data. When doing so, you should get any queries you run reviewed by another developer. The postgres write
user is automatically assumed by using the -admin
AWS role.
gds aws notify-preview-admin -- db-connect.sh notifydb -- pgcli
Developers should only use this role in emergencies, when no other role has sufficient permissions. An example would be for running a vacuum
statement.
gds aws notify-preview-admin -- db-connect.sh --role admin notifydb -- pgcli
We have 5 different Postgres users.
-
readonly
- Can only read data, can't write it. This user should only be used by developers running manual queries -
write
- Can read and write. This user should only be used by developers running manual queries. -
app-username-here
- This is used by all our apps (including our db-migration app which runs our schema changes) and can read and write data. We may decided to split up our apps postgres users in the future to give them more granular permissions. Developers shouldn't need to use this user. -
notify_schema_owner
- You don't log in as this user, but when you log in as thewrite
orapp-username-here
user you will automatically assume this role. This means that any schema changes done by thewrite
orapp-username-here
user will all be owned by a single owner -notify_schema_owner
. This avoids any problems with schema changes being owned by different Postgres users. -
notifydb
- AWS creates this user when it creates the database. We also call this theadmin
user. Developers can log in as this user if they need to, but should use thewrite
user instead if possible.
Apart from the notifydb
role, we create the rest ourselves using our DB setup script.
Logs are found in AWS Cloudwatch, in the /aws/rds/instance/notifydb/postgresql
log group.
You can query them by going to the log group and clicking 'Search all log streams'. This is because there are 4 log streams because RDS writes logs in parallel.
The general
postgres logs are logged.
All queries run by the readonly
, write
and "admin" (the notifydb
postgres) user are logged.
TODO: Document slow query logging
These are available in the AWS RDS dashboard in Grafana. You can also see the full set of RDS metrics in the AWS console.
We have enabled Performance Insights, which is a performance tuning and monitoring feature that helps you quickly assess the load on a database. It can show you why the bottlenecks on our database and the queries that may be causing them.
We have warning and critical level alerting for CPU, memory and database connections. Warning level alerts will be sent via email. Critical alerts will set off Pagerduty.
Notify backups are stored for 31 days. All transaction logs are kept over this time, meaning that you can restore the database to any point in time between 31 days and 5 minutes ago.
To restore a database you need to create a new RDS instance from a snapshot. This can be done using the AWS console or via the terraform.
To restore a database from the console you need to:
- Go to the RDS console
- Select the database you want to restore
- Click
Actions
->Restore to point in time
- Select 'Latest restorable time' or the custom time you want to restore to
- Enter a new DB instance identifier
- Change Availability Zone to
No Preference
- Review the rest of the settings to ensure they are the same as the original database
- Click
Restore DB Instance
To restore a database from terraform you need to:
- Go to the environment folder you want to restore. Open the
new-migration-resources.tf
file. - Copy the module "rds" block and paste it just below.
- Add/Replace the following config to the new module
module "rds_restore" {
...
db_identifier = "notifydb-restore"
dedicated_kms_key = false
external_kms_key = module.rds.kms_id
apply_immediately = true
restore_to_point_in_time = {
source_db_instance_identifier = "notifydb"
restore_time = "2023-10-02T10:00:00Z"
}
}
Adjust the 'restore_time' to meet your needs. For the latest use use_latest_restorable_time: true
.
Ensure you allow access from the bastion by adjusting the allowed_services_sg_ids in the bastion module.
module "bastion" {
...
allowed_services_sg_ids = [
...
{ "name" : "rds_restore", "sg_id" : module.rds_restore.db_security_group_id, "port" : 5432 },
]
}
- Use the concourse pipelines or run terraform manually to deploy.
This step is optional and should only be used in the worst case, such as when there is serious data loss or corruption in the database. In most cases you should be able to use pg_dump to extract the data you need from the restored database to restore to the original database.
Currently the application connects to the database via a CNAME. This means that we can change the CNAME to point to the new database and the application will automatically connect to the new database.
However, be warned that changing the DNS will not disconnect any existing connections to the old database. You'll need to ensure that all connections to the old database are closed before switching over. This can be achieved by stopping the old database. If you choose to stop the application instead, you will need to wait for the DNS TTL before starting the application again.
This can be updated via the aws console or via terraform.
To update via the aws console:
- Go to the RDS console
- Select the database you want to move away from
- Click
Actions
follwed byStop temporaily
(This is to ensure no existing connections are open) - Make a not of the database endpoint
- Go to the route53 console
- Select the hosted zone for the environment
- Select the
db
CNAME. - Change the value to the new database endpoint
- The application will automatically connect to the new database, after the dns ttl has expired
To update via terraform:
- Go to the environment folder you want to restore. Open the
new-migration-resources.tf
file. - Update the module "rds_route53" and update the record to point to the new module.rds_restore.host. E.g.
module "rds_route53" {
...
subdomain = "db"
records = [
{ "name" : "notify", "host" : module.rds_restore.host, "weight" : 100 }
]
}
- Update the output
security_group_ids
and refer to the new rds module. E.g.
output "security_group_ids" {
value = {
...
"rds" = module.rds_restore.db_security_group_id
}
}
- Run both the environment terraform and the application terraform, ideally via concourse.
If you have made changes via the aws console. All changes will need to be imported into the terraform state. This can be done via tfmigrate or manually.
To ensure our Disaster Recovery plan is effective, we have implemented a policy that restricts Admins from deleting RDS instances in the production account. This policy is in place to prevent accidental deletion of RDS instances by users, as well as to prevent compromised user accounts from deleting RDS instances. The policy is enforced at the top level AWS account, and you can find more information about it at the following link:
If you need to delete an RDS instance, you will need to either have this service control policy removed to delete the RDS instance.
AWS will force us to upgrade our Postgres minor version occasionally. We've done this once before so you can use the following as a guide (but is OK to deviate as you see fit).
- Change the value of
auto_minor_version_upgrade
value fromfalse
totrue
. Make sure this is just for the preview database. - Wait for the maintenance window to happen, confirm that the upgrade was successful and preview is still working.
- Change the value of
auto_minor_version_upgrade
back tofalse
for preview. Also change the RDS engine version to the new minor version for preview (otherwise Terraform will error).
This test will be useful for checking roughly how much downtime we might expect. Rather than waiting for the maintenance window to happen, we will manually trigger the upgrade using the AWS user interface whilst doing a basic load test to see what happens
- Start a low volume load test running against the notify API status endpoint (consider using something like
vegeta
from your local machine) - Manually click the upgrade button in AWS to upgrade staging
- Once the upgrade has finished, you can end the load test and see the results for how much downtime there was by looking at the error logs in Logit staging. We expect there to be less than a minute for minor upgrades. If there is more, then discuss with the team what to do
- Change the RDS engine version to the new minor version for staging (otherwise Terraform will error).
- If preview and staging have upgraded successfully, pick which week you want the upgrade to happen for production. It should be at least 2 weeks in the future so we can give our users at least 2 weeks notice. You should let the team know and anyone who is on support the night of the upgrade.
- Set up a scheduled maintenance on statuspage for the production upgrade. You should be able to copy the message from https://status.notifications.service.gov.uk/incidents/cw4c73pz8hd5. Automate as much of the scheduling as possible using statuspage so you don't need to manually update anything when the maintenance window starts. You should send notifications to announce the scheduled maintenance but the in progress and completed notifications are optional.
- A few days before the expected day of the production upgrade, check in the AWS interface that pending maintenance is set for the production database, and it is upgrading to the version you expect. Then you should change the value for
auto_minor_version_upgrade
to betrue
for production. - Production will then upgrade automatically in our maintenance window. We expect it to pagerduty the person on call who will need to make sure that it recovers successfully.
- Change the value of
auto_minor_version_upgrade
back tofalse
for production. Also change the RDS engine version to the new minor version for production (otherwise Terraform will error). This can be done the morning after the upgrade.
You might want to use a copy of one of our environment's databases locally, for example to test a database migration.
You can grab this data using pg_dump. This command takes at least half an hour to run.
gds aws notify-staging -- db-connect.sh notifydb -- pg_dump \
-Fc \
-f staging-db-dump \
--exclude-schema=hint_plan \
--exclude-schema=pglogical \
--exclude-table=notifications \
--exclude-table=notification_history \
--exclude-table=ft_billing \
--exclude-table=ft_notification_status
This grabs a copy of the tables, excluding the two schemas (which produce errors as pg_dump does not have permission to read them - they're related to plugins so not critical) and four mentioned tables (our four biggest tables, which make up 99% of the 440GB staging db size - without them, the dump file is 400mb).
Note that by excluding them, this includes the schema as well as the data! You'll need to run an appropriate
create table
command later if you want to use and repopulate those tables
You can restore it to a local copy by running:
psql -c "create database local_test"
pg_restore -v -d local_test staging-db-dump
export SQLALCHEMY_DATABASE_URI="postgresql://localhost/local_test"
# you can now run commands like `flask db upgrade` against this db