Dockerizing and streamlining #45
TangoYankee
started this conversation in
Show and tell
Replies: 1 comment
-
@TylerMatteo , I am wondering how this lines up with your review of the data flow |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description
The GitHub Action dependency on a external database cluster is not technically necessary. Ubuntu VMs allow for "services" within the runner, which may include postgres databases. Moving from an external database cluster to an Action service provides two key benefits.
Benefits
Simpler database state
First, each run of the data flow should be self-contained and deterministic. It is not meant as a long-lived service. Self-containment is undermined by persisting the database across several runs. These concerns are partially alleviated by "state checks" that prevent errors related to "double actions", ie)
CREATE TABLE IF NOT EXISTS
. However, there are countless ways for a database to get into a weird state and it's a nightmare to debug apps with complex state. If we instead leverage a database service within an action, the database is created and destroyed for each run. We are ensured of a fresh state with each run.Simpler infrastructure maintenance
Second, it reduces infrastructure overhead. We eliminate the maintenance burden of an entire database cluster.
Example
I made a template workflow that leverage a database service. It is on ty/docker-service-flow. Please ignore that it's triggered on push; that was a decision to help me test faster.
Action workflow
The workflow spins up a postgis database service. The database is accessible to the main runner through 'localhost/127.0.0.1'. The updated flow diagram reflects the ephemeral nature of the databases. This does require switching from a macOS to ubuntu-based VM.
Local workflow
In addition to updating the github action, there is also a Dockerfile for a local runner environment. The dockerfile starts with an Ubuntu image. This is intended to match the local environment to the action environment as closely as practical. However, the Action Ubuntu VM comes with more applications pre-installed and different permission levels. Ultimately, the environments are similar but differ in subtle but key ways, complicating any translation from the local docker to the action vm.
The data-flow and zoning api compose files are updated to create a development environment that manages all three services. This is done by creating a 'data' network in the zoning api compose that is accessible to the data-flow
runner
anddatabase
services. To run the data flow, first set up the zoning api to the point that the database has the required schema. Assuming these steps included running the zoning api db withdocker compose up
, it will have automatically created a data network. We can then run the data-flow 'docker compose up'. It will create a 'runner' and a 'db' service that will join the 'data' network. We can run the etl commands against therunner
container using the syntax in the readme. Alternatively, you can exec into the running container with an interactive terminal and run commands without thedocker exec bash
prefix.The volumes configurations are also updated for local development. First, the database volume is removed from the compose. As outlined above, state should not be persisted across multiple runs of the data flow. If we want to pause in the middle of a data run, we can use 'docker compose stop' and it will persist the database state until we are ready to start working again. If we want to wipe the database and start over, we can run 'docker compose down db' and then 'docker compose up db'. This will wipe the container and let the data flow start fresh. This should also obviate the need for the setup_local_db.sh script to delete the volume. Second, the runner has volumes with the custom scripts and code. This means updates will automatically be shared in the runner environment, without needing to rebuild the image.
Note on the db image
The local database image was using the postgis image. This can cause performance bottlenecks on ARM-based platforms. We solved this with zoning-api by switching to a standard postgres image and then installing postgis in the Dockerfile. I also applied this change to the example branch
Beta Was this translation helpful? Give feedback.
All reactions