update terraform readme

pubpub · May 20, 2024 · dab3faf · dab3faf
1 parent 062b2f8
commit dab3faf
Showing 1 changed file with 55 additions and 13 deletions.
diff --git a/infrastructure/terraform/README.md b/infrastructure/terraform/README.md
@@ -6,27 +6,69 @@
 
 You must have some way of storing terraform state files.
 We use and recommend the s3 backend, but you can change
-that configuration. In order to keep this code generic, however,
-we omit the specific bucket config.
+that configuration. See `./environments` for examples of
+configuring backend.
 
-You can fill in your own backend config in a file such as
-`demo-env.s3.tfbackend` and supply this config on command-line:
+We store our terraform state in an S3 bucket created by
+the `./environments/global_aws` directory, which has an
+interactive setup, see that readme for more info.
 
-```bash
-terraform init -backend-config=demo-env.s3.tfbackend
-```
+## Change management and workflows
 
-This file is only needed for `terraform init`. Once that is done,
-you don't need to supply the backend config to `terraform plan/apply`.
-If you need to change the backend, update this file and `init` again.
+This code is here to make infrastructure declarative rather than imperative.
+It secondarily includes modularization to make it hard for configuration to drift
+between preproduction / production or open source deployments.
+These are two separate concerns.
 
-### Vars file
+Declarative code changes are still managed imperatively with `terraform apply`,
+which can be made partially or fully automatic.
 
-the module exposes its configuration area, but those configurations
-need to be supplied at plan/apply time using the flag `-var-file=demo-env.tfvars`.
+In general, production changes are applied manually after we are satisfied with
+preproduction, which may or may not be automatic. Developers should  expect a flow like:
+
+1. make a change to a shared module code
+2. make matching change to configuration in ALL environment directories, so they can be reviewed together
+3. apply this new SHA to staging and do validation as desired
+4. apply this same SHA to production.
+
+Now there is no drift between code, staging, and production - we are converged.
+
+## Rollbacks
+
+Generally, rollbacks are done in emergencies and are done first in prod. (If done first in staging, this is
+really no different a process than a roll-forward).  Rollbacks are the only situation in which we should expect
+to deploy production from an off-main branch. Changes may be infrastructure or code. In the infrastructure workflow:
+
+1. make changes to terraform code that seems to fix  the issue and apply it  to production
+2. if it resolves the issue, figure out how it needs to be applied to pre-prod  for consistency and open a PR
+3. when  this PR is merged, it deploys to pre-prod and we are converged.
+
+In general, code rollbacks can be done without a re-build, by deploying an old SHA, but it is preferable
+if there is time, to do a revert & roll-forward flow,
+because some operations (primarily database migrations) operate on assumptions of monotonic time. Additionally
+this flow makes it easier for rollbacks to include reverts of  specific changes in the middle of the commit history
+without reverting everything more recent.
+
+## Adding/updating variables and configuration
+
+Variables are the things that distinguish one environment from another. These include container variables and
+certain extra values such as infrastructure scaling / footprint parameters. There is a tradeoff between ease
+of configuration change and strength of guarantees given by similarity between staging and prod. First decide if
+your change should be applied identically to each environment, or warrants an increase in drift.
+
+To add a variable, modify some terraform resource that depends on it and then thread your way back up. The most
+common case will be to add an environment variable to a container so will use that as example here:
+
+1. modify `modules/deployment/main.tf` to add a variable to the appropriate invocation of `container-generic`.
+1. modify `modules/deployment/variables.tf` to add the variable declaration. (This step is not needed if your new env var can be computed based on changes to the upstream infrastructure, such as a database URL.)
+1. modify each invocation in `environments/*/main.tf` to add this new variable.
+
+Proceed as above. Note that changes to task  definitions (which include container configs) are not actually applied until you then trigger a new `deploy` using `act`/`mask` or the Github console.
 
 ## Adding secrets
 
+Secrets are a special variety of environment variable, whose process is just like the above but with an extra step after `terraform apply` and before `mask ecs deploy`:
+
 To provide secrets to ECS containers, you should put them in AWS Secrets Manager.
 To do this, replicate the setup in `modules/core-services/main.tf`: create a resource
 that declares the existence of the secret. Since the purpose of this model is to