-
Notifications
You must be signed in to change notification settings - Fork 2
Database Backups and Restores
- The ability to restore data from a previous backup is a fundamental non-functional requirement for operational and security reasons.
- Backups should be performed at least daily (in production).
- Backups should be stored for at least a month (in production).
- A documented and tested process should exist to restore from a backup.
- It should be possible to restore from a point in time just before significant changes.
AWS RDS performs automated backups called snapshots on a daily basis. The timing when snapshots are created can be specified. This is normally scheduled overnight, after regular business operations have completed. Snapshots can be retained for up to 35 days before AWS removes them. Storage costs are incurred for keeping snapshots.
Snapshots can also be created manually at any time with the AWS administration console.
AWS Aurora supports point-in-time recovery (incremental backup) as per https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-pitr.html
Snapshots created at 0200 UTC (~7pm BC). Snapshots retained for 35 days. Enable "Copy tags to snapshots" setting.
A database snapshot is created as part of the CI/CD pipeline just prior to deploying and running flyway database migrations. See https://github.com/bcgov/nr-forests-access-management/blob/dev/server/flyway.tf. This snapshot is recreated on each pipeline deployment, so it only useful for recovery purposes until the next pipeline execution. These are not deleted automatically and must be deleted manually.
This triggered snapshot is strictly speaking not necessary since point-in-time recovery can be used. However, note that AWS warns that point-in-time recoveries can take several hours to complete depending on the volume of transaction logs. While FAM will likely have low volumes of transaction logs, this is an argument for keeping this explicit database snapshot as part of the CI/CD pipeline.
There are a number of challenges that the restore strategy must deal with:
- Restores from a backup must be done into a new database instance and cannot be done into the existing database
- The restore process must deal with the database proxy and database configuration information used by application components (lambda functions).
- Terraform state as part of our CI/CD pipeline is tracking the existing database - subsequent deployments need to work against the recovered database.
- Renaming a database instance causes an outage for that instance and can take some time (unlike cluster renames which are fast).
- The restore process does not create a multi-AZ database. There will be a single writer instance with no replication. When Terraform runs it will modify the infrastructure to be multi-AZ with a writer instance and an additional reader instance in a separate AZ.
The recommended approach is to first rename the existing database, then restore into a new cluster/instance named the same as the existing. Since renames cause outages and take some time, this is faster than restoring in a new instance and renaming both old and new. **WARNING: no terraform deployments (apply commands) can be performed while the renames are happening. **
An alternative approach is to restore into a new database, then rename the existing database, then rename the restored database to have the same name as the existing database. This has an extra rename so will extend the time, but only impacts users once (for both renames), in contrast with the prior approach which impacts users at the start for the first rename and at the end for the switch to the restored instance.
Another option considered is to simply restore into a new instance, then update terraform state to point to it. This might be the best option from the standpoint of minimizing outage duration (since no renames are required). It appears it is possible to update terraform state via https://www.terraform.io/cli/commands/state/mv, but this would take further investigation.
One rejected option is deleting the existing database, then restoring from backup into the same database instance name. This is strongly not recommended for several reasons:
-
Point-in-time recovery must be done from the existing instance.
-
Existing configuration settings are lost when the existing database is deleted. The restore process might need to check what existing settings were.
-
This is not a safe process, if something goes wrong with the restore process, there's no existing database to use.
-
Optional: stop the application to prevent new transactions from being created that won't be part of the restored database. For FAM this can be done by pulling up the lambda in AWS and choosing the "Throttle" button , which sets the reserved concurrency to zero and prevents any executions. After the restore process is complete this can be reversed by choosing the "Edit Concurrency" button and choosing the unreserved concurrency option.
-
Notify team to avoid any terraform deployments to the environment until the restore process is completed.
The restore process duration will depend on the size of the database backup and the # of transactions to apply (likely few in the case of FAM) based on how far away the point-in-time is from the applicable snapshot. Given that the various renames also take time (at least 5 - 10 minutes each), the overall restore process should be expected to take at least one hour in duration.
- Navigate to AWS RDS, Databases view. Select cluster to restore (e.g. famdb-cluster-first-griffon), choose Actions > Restore to point in time
- Choose point in time to restore from. (Choosing a time close to an existing snapshot will recover faster.)
- DB instance identifier: famdb-cluster-restore (perhaps with suffix using date of snapshot).
- For most options just accept defaults (this is an advantage of this approach via restoring straight from a snapshot).
- Instance configuration capacity range: Min 0.5 ACUs, Max 1 ACUs (match what is specified for the existing instance on the Configuration tab).
- Additional configuration:
- Initial database name: famdb
- DB cluster parameter group: choose same as existing e.g. "famdb-cluster-first-griffon-cluster-parameter-group"
- DB parameter group: choose same as existing e.g. "famdb-cluster-first-griffon-parameter-group"
This outlines the process to restore directly from a snapshot instead of doing a point-in-time recovery. There are more configuration settings to manually set up this way, but the restore can be faster because no transaction logs need to be reapplied.
Choose snapshot and select "Restore snapshot" from Action menu On Restore Snapshot screen choose settings same as existing Aurora instance (select Modify on existing instance to see current settings, and/or select Configuration tab)
- DB engine version: Aurora PosgreSQL (Compatible with PostgreSQL 13.6)
- DB cluster identifier: famdb-cluster-restore (perhaps with suffix using date of snapshot)
- VPC: accept default (e.g. Dev_vpc, will be only one choice)
- DB Subnet group: accept default
- Public access: no
- VPC security group: Choose existing - Data_sg
- Availability zone - No preference
- Instance configuration:
- DB instance class - Serverless v2
- Capacity range: Min 0.5 ACUs, Max 1 ACUs (match what is specified for the instance on the Configuration tab).
- Database authentication: leave both IAM and Kerberos authentication disabled
- Encryption: accept default (yes with default AWS KMS key)
- Additional configuration
- DB cluster parameter group: choose same as existing e.g. "famdb-cluster-first-griffon-cluster-parameter-group"
- DB parameter group: choose same as existing e.g. "famdb-cluster-first-griffon-parameter-group"
- Select "Restore DB Cluster"
Note: this step is not required as the Terraform will re-point the RDS Proxy target when it is re-run. The advantage here is that the Terraform run can effectively be delayed. This may be preferable as the Terraform run may cause downtime and the operations team may choose to run Terraform at a later time when the system is not in heavy use.
- After the new restored database cluster is running, modify the database proxy:
- Select RDS | Proxies, choose the proxy (e.g. famdb-cluster-first-griffon-fam-api-proxy-api),
- Scroll to target group section, select default target group, and select Edit
- Change Database to "famdb-cluster-restore" and select Save changes.ll
At this point lambdas using the proxy should be functional - e.g. through the application. (The deployment pipeline is not yet functional - terraform and the flyway lambda are still pointing at the existing database.)
Two clusters cannot have the same name in AWS. The restored cluster (and instances) need to match the name specified in the Terraform scripts in order to be aligned with the Infrastructure as Code specification. In order to achieve this:
- Rename original database cluster adding 'old-' as a prefix. Also remove deletion protection if it is enabled. Make sure to select "apply immediately".
- Rename each database instance of original cluster adding 'old-' as a prefix. Also remove deletion protection if it is enabled. This causes the instance to be rebooted and takes a while. Make sure to select "apply immediately".
- Rename the restored cluster to the original name of the original cluster. It needs to match the original name of the original cluster (specified by Terraform). If in doubt, check the config (prod config is at here on the main branch and the config item is 'famdb_cluster_name'). Make sure to select "apply immediately".
- Rename the database instance of the restored cluster to match the original. There will only be a single writer instance in the restored cluster. Rename it to "-one". Make sure to select "apply immediately".
- Stop the original database cluster (the cluster cannot be stopped earlier because renames can only be done on a running database).
The database does not yet align with the Terraform-specified infrastructure. Specifically, it is missing the multi-AZ configuration for the database instances. This is not necessarily a serious production issue, but has implications for availability (in the case of a disaster situation in the datacenter that houses the database instance itself). Optionally, create the reader instance manually (see below for instructions).
- Carefully identify the version of the GitHub repository that is currently in production. You do not want to introduce changes during this process. Refer to the history of deployments to the production environment. If you click on the latest successful deployment, the exact git commit that was deployed is displayed next to the GitHub user ID of the person who triggered the deploy. The goal is to re-run the pipeline from that commit.
- Identify the correct branch for the re-release. Unfortunately, the GitHub Actions pipeline and the Flyway Lambda function do not correctly run when you kick off a workflow pipeline from a tag. This is something we need to fix for better release management. As a workaround, we will have to create a branch for each release (and keep that branch from being updated accidentally). This means that if we are in production, we would have already deployed from that branch and we can re-run the workflow from the same branch (just make sure it is the right one as per step #1!).
- Refer to the detailed instructions in Release Management to run the GitHub Actions workflow to redeploy to production.
There are other small things that will be applied when the terraform apply is run. During testing, some of the differences found included:
- For database cluster
- Different cluster resource id
- Missing enabled cloudwatch log exports
- Wrong maintenance window
- Missing tag managed-by=terraform
- For database instance one/two:
- Different database instance resource id
- Wrong promotion tier (one instead of zero) - only happened for instance one
- Wrong maintenance window (sat 7:02-7:32 instead of sun 05-06)
- Missing tag managed-by=terraform
During testing, initial execution of terraform apply failed, apparently because the database cluster change was still happening when the database instance changes were requested (Error: updating RDS Cluster Instance (famdb-cluster-first-griffon-one): InvalidDBInstanceState: Database instance is not in available state. status code: 400, request id: ac53499c-427f-4aa6-b9d4-49687529d1aa). Database instance must be available to have changes made to it. The solution is to re-run the pipeline.
Note that when Terraform applies database instance changes, it does not pick the 'apply immediately' option, so the database changes are not necessarily in effect until either a reboot is done or the next scheduled maintenance window. Therefore, the last step is to trigger a reboot of each instance. "Apply immediately" is now in aurora-v2.tf so a manual reboot should not be necessary.
Cleanup step - once confirmed that operations are fully restored to normal, the old database cluster can be deleted.
If you really want a multi-AZ cluster and you don't want to run Terraform for some reason, you can create the reader instance manually. This is somewhat low-value as it really only supports disaster recovery (which we can realistically live without until the next Terraform apply).
Next step - Add a reader instance to the restored database. (AWS only allows restores to be done into a single new instance)
- Select the database cluster and choose Add reader
- For DB instance identifier specify the name of the existing instance replacing '-one' with '-two'.
- Under Additional Configuration under DB parameter group choose the cluster parameter group (e.g. famdb-cluster-first-griffon-parameter-group)
- Failover priority set to tier-0
- Set performance insights to same value as existing instance (currently disabled).
- Choose a different AZ if you want multi-AZ support
- If the original database had multiple reader instances then repeat
- Environment Management
- Release Management
- Creating a Release
- Database Backups and Restores
- OIDC Client Testing
- FAM Onboarding Ops Guide
- Setup AWS CloudWatch
- Setup AWS EC2 instance to connect to RDS Postgres Database
- Technical Troubleshooting
- Managing Terraform State
- Enable Cloudwatch Logs for API Gateway
- Update AWS CloudFront Certificate
- Verify IDIM BCeID Client SOAP Web Service