Hackney Data Platform Infrastructure and Code
The Data Dictionary & Playbook can be found on the Document Site and it's related GitHub Repository
We use Architecture Decision Records (ADRs) to document architecture decisions that we make. They can be found in the Data Platform - Playbook
We use Jupyter Notebooks to prototype glue jobs. These can be hosted either locally or in AWS sagemaker. The notebooks are stored in the Data Platform Notebooks GitHub repository.
The Terraform will be deployed, using GitHub Actions, on push to main / when a Pull Request is merged into main
The terraform/core directory contains the majority of the infrastructure and at the time of writing has one Github action to deploy to staging and one to deploy to production.
The terraform/etl directory contains the ETL and Glue related infrastructure and at the time of writing has one Github action to deploy to staging and one to deploy to production.
The terraform/networking directory contains the networking aspect of the data platform (see Networking
) and at the time of writing has one Github action to deploy to staging and one to deploy to production.
The terraform/backend-setup directory is just for Dev’s bucket deployment, so it does not need a Github action. The terraform state for this area is maintained in the repo and we run it locally.
- Create a env.tfvars file for local deployment, this can be done by running
cp terraform/config/env.tfvars.example terraform/config/env.tfvars
from the project root directory. - Update the following required variables in the newly created file:
environment
- Environment you're working in (this is normallydev
)aws_api_account_id
- API AWS Account number to deploy RDS Export Lambda to, ie. the number after the #, beneath the public account name (for development purposes this is normally the account named DevScratch)aws_deploy_account_id
Primary AWS Account number to deploy to (for development purposes this is normally the account named DataPlatform-Development)aws_deploy_iam_role_name
- This is the role that will be used to deploy the infrastructure (for development purposes this is normallyLBH_Automation_Deployment_Role
)google_project_id
- The Google Project to create service accounts in (for DevScratchdataplatform-dev0
)
- For local deployment AWS needs a AWS CLI profile (assumed to be called
hackney-dataplatform-development
) in the profile configuration file (which can be set in~/.aws/config
). Read documentation on Named Profiles for more guidance setting up AWS credentials and named profiles.
To set up this profile, you can use the AWS CLI using the following command:
$ aws configure sso
In the terminal, add the start URL and the region as below
SSO start URL [None]: https://hackney.awsapps.com/start
SSO Region [None]: eu-west-1
Your terminal should look like this:
Attempting to automatically open the SSO authorization page in your default browser.
If the browser does not open or you wish to use a different device to authorize this request, open the following URL:
https://device.sso.eu-west-1.amazonaws.com/
Then enter the code:
LDHD-CKXW
There are {number} AWS accounts available to you.
Below the final line there will be a box which can be scrolled through using the arrow keys, select with Enter the account which corresponds to the aws_api_account_id
used in step 2 above (normally DevScratch)
Using the account ID {number}
If you have access to multiple roles within that account, select the appropriate role (this is assumed to be AWSAdministratorAccess
below), press Enter to select the default options for client Region
, output format
and profile name
Using the role name "AWSAdministratorAccess"
CLI default client Region [eu-west-2]:
CLI default output format [json]:
CLI profile name [AWSAdministratorAccess-{number}]:
hackney-dataplatform-development
Install AWS-Vault
Mac
$ brew install --cask aws-vault
Windows
- Install chocolatey using the following guide: Link
- Install aws-vault using chocolatey:
choco install aws-vault
Generate Credentials for Vault
$ aws-vault exec hackney-dataplatform-development -- aws sts get-caller-identity
Ensure that GNU Make is installed on your computer. The full commands for the below instructions can be found in /terraform/core/Makefile
& /terraform/etl/Makefile
Initialise the Project
-
Before you run, ensure:
- You remove hackney-dataplatform-development aws credentials if they exist in your AWS credentials file
- You are in the project's
terraform/core
directory - You remove the .terraform directory, and the .terraform.lock.hcl file if they exist in the project's terraform/core & terraform/etl directories
-
Please complete the below make commands in both the terraform/core & terraform/etl directories
$ make init
Initialise your Workspace (note capitalisation)
$ WORKSPACE={developer} make workspace-new
Select your Workspace (note capitalisation)
$ WORKSPACE={developer} make workspace-select
- Set up Google credentials
- Run
brew install --cask google-cloud-sdk
to install Google Cloud SDK - Log in into Google Cloud by running
gcloud auth application-default login
- The full path of where the file is saved will be displayed, for example
/Users/*/.config/gcloud/application_default_credentials.json
- Copy this file to the root of the project by running the following command in the root of the project
cp /Users/*/.config/gcloud/application_default_credentials.json ./google_service_account_creds.json
- Copy this file to the root of the project by running the following command in the root of the project
-
Next run
make init
in the/terraform/core
directory. This will initialize terraform/core using the AWS profilehackney-dataplatform-development
. Before you run, ensure:- You remove hackney-dataplatform-development aws credentials if they exist in your AWS credentials file
- You remove the .terraform directory, and the .terraform.lock.hcl file if they exist in the project's terraform/core directory
-
Next run
make init
in the/terraform/etl
directory. This will initialize terraform/etl using the AWS profilehackney-dataplatform-development
. Before you run, ensure:- You remove hackney-dataplatform-development aws credentials if they exist in your AWS credentials file
- You remove the .terraform directory, and the .terraform.lock.hcl file if they exist in the project's terraform/etl directory
In order to apply Terraform some dependencies must be in place first.
To build all necessary dependencies:
- Navigate to
external-lib
folder and runmake all
- Run
make package-helpers
in the project root directory
Please note there are other dependencies, such as Python, Maven etc., that must be in place in the development environment before the above commands can be executed successfully, but those can be resolved separately as and when required.
After running, you can run make plan
, make apply
and make destroy
to run the Terraform deploy/destroy commands with the development env.tfvars
set for you.
As part of the Terraform deployment, we deploy out a VPC which connects to Hackney's Hub account. Hackney use a Hub & Spoke pattern for interconnecting their accounts and to provide internet access to those spoke accounts. External traffic is routed to the Hub account and scanned before being passed to the Spoke accounts. Outbound traffic follows the opposite flow.
Due to the above described setup only a single VPC can be created within the AWS Development account. As a consequence we have created a separate Terraform module which will deploy the VPC out to each account. The VPC is then referenced by the main Terraform script by providing the VPC ID as an input variable.
For new developers coming onto project, the VPC should have already been established and associated with the default workspace state stored in S3. The ID of the VPC has also already been included in the example env.tsvars file. Unless you intend to modify the network configuration, you should not need to modify or deploy the network to run your own personal development environment.
All resources in a given workspace can be destroyed with make destroy
command. However some resources have deletion protection enabled. In order to destroy the resources you cen remove those protections by replacing all prevent_destroy = true
statements with prevent_destroy = false
.
Please ensure you are in the correct workspace before deleting resources and also make sure you destroy the workspaces in the correct order since they have dependecies to each other. Destroy ETL first and then Core. Please don't destroy Networking since it's using default/shared workspace and is used by all developers.
If you receive any timeout errors when destroying resources you can simply run destroy again until it succeeds.