Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialization infrastructure and dev environment improvements (proposal #1) #248

Closed
guidorice opened this issue May 18, 2020 · 8 comments
Closed
Labels
enhancement New feature or request

Comments

@guidorice
Copy link
Contributor

guidorice commented May 18, 2020

This issue recaps the planning call earlier today, and lists the steps forward for moving the nasa-apt project from a prototype to maintainable production quality.

Background

This is focused on the serialization pipeline (database -> latex -> pdf, html) and improving the development environment. The current serialization and dev environment has some issues that need to be fixed, and issues make it difficult to develop on as well:

  1. Serialization is triggered by aws lambda triggered on s3 buckets file changes. It is unknown to me if this is mechanism supported by localstack, but is definitely not implemented in the local development environment. This means when running locally the UI cannot download pdf files and it cannot get information about the status of serialization.
  2. In staging and production, the serialization happens every time a user views an ATBD page, which is wasteful. We want to serialize only when the document has changed, whether it's a content change, status change, or version change.
  3. The UI does not display errors to inform the user when serialization has failed. Only the editor/administrator should really care about this, and for regular users it should just work as if the pdf output was static content.
  4. Serialization containers copy their source code scripts and templates from s3 buckets on every Docker ENTRYPOINT run. This is inefficient, repetitive, and not great from a security perspective. It is best practice to save the scripts into each of the serialization Docker images at build-time. It is interesting to note the reason for the current design was somewhat of a workaround for the LaTeX Docker image is ~1.3GB which can be time consuming to build and push to ECR.
  5. Testing is needed. We want to ensure "serialization works accounting for edge cases, and that it works with the schema provided by the server" (from LaTeX template refinements #221)

Additional background is in Technical lessons from nasa-apt.

There are some improvements that could be made to the CI/CD and Cloudformation deployment script, and integration of the two, however that is not covered in this issue.

Proposal

The key thing is we need to move the business logic for serializing ATBDs as close to the database as possible. It should not be triggered by the UI.

One proposal was to wrap the PostgREST service with a new API endpoint, that would control the serialization process and also act as a facade to the PostgREST api. That would be an improvement, but I am concerned it adds complexity by introducing a new service plus maintaining a facade for the existing service.

The serialization business logic can be moved even closer to the database by using PostgresSQL external notification via LISTEN/NOTIFY. The pg-bridge container could invoke a webhook on a new service with payload of the ATBD id each time the ATBD has been edited.

Data model addition

  • Add to the PostGREST data model a serialization table which contains several values: the SHA checksum for the related ATBD and the datetime when it was calculated. The SHA checksum is used as a path and key into the rest of the serialization process: both the UI and the serialization worker will use it to to locate to the latest serialization output, and the status (pending, errored out, etc).
  • Add a datetime to theatdb_versions table. This should enable the manual triggering of serialization by causing the checksum to change.
  • Add LISTEN/NOTIFY support to the schema.

Serialization worker

Create a new docker container implementing a webhook endpoint (the receiver of pg-bridge calls) and that also has access to the PostgREST API. The serialization webhook responsibilities are:

  • calculates SHA checksum for current state of ATBD- via querying the existing PostgREST API.
  • updates the database with SHA and the datetime into the serialization table via PostgREST API (note: pg-bridge and/or the webhook should ignore changes to the serialization table itself to avoid a loop).
  • write the ATBD id and SHA checksum into a local work queue. A local Redis in-memory database would be a good fit for a local work queue like this.
  • worker process will run continually, polling the local Redis, and performing the serialization by calling docker exec (or whatever the equivalent is in an ECS environment) to invoke the json -> latex -> html, pdf conversions. The outputs and the status of the pipeline will be stored in s3 under the key of the SHA checksum.
  • Garbage collector: by serializing upon every ATBD change, many outdated PDFs will get created and stored on s3. If we want to prevent people bookmarking outdated PDFs, then a GC worker could scan though the s3 bucket and look for and delete old PDFs. If we want to preserve old versions, it would be fine to skip this, at the expense of more s3 storage.

How the above problems 1-5 are addressed by proposal

  1. Serialization is triggered by Postgres LISTEN/NOTIFY when the content changes, so it is much harder to mess up the business logic. We could automatically support changes coming from outside of the PostgREST API as well, e.g. batch ingest or merges of databases.
  2. For any SHA checksum the serialization will only happen once. Re-serialization can be triggered by updating the datetime in the atbd_versions table.
  3. The status and the errors are available in a json doc in the s3 bucket right alongside the serialization output files.
  4. Script files are built into the docker images, following best practices. Because we are supporting local development again, this is not an impediment to iterating quickly.
  5. Testing. By creating the infrastructure for predictable serialization which also has a clear path for propagating errors to s3 and to the UI, it sets the stage for doing more unit testing and integration testing.
@guidorice guidorice added the enhancement New feature or request label May 18, 2020
@guidorice guidorice self-assigned this May 18, 2020
@guidorice
Copy link
Contributor Author

@danielfdsilva @sharkinsspatial @olafveerman please add your feedback to this. thanks

@guidorice
Copy link
Contributor Author

An additional requirement of the Serialization Worker is that it would need to handle missing events (e.g. when the listener client is disconnected)

Limitations of Listen and Notify
It is crucial that an application using the PostgreSQL notification capabilities are capable of missing events. Notifications are only sent to connected client connections.
Any queueing mechanism requires that event accumulated when there’s no worker connected are kept available until next connection, and replication is a special case of event queueing. It is not possible to implement queueing correctly with PostgreSQL listen/notify feature.
A cache maintenance service really is the perfect use case for this functionality, because it’s easy to reset the cache at service start or restart.

This means the Serialization Worker would need to, at startup, scan the database (single source of truth) to find ATBDs which have not yet been serialized, and where the SHA checksum does not match the current version in the database. The same local work queue as described above could be used to process though the backlog.

@danielfdsilva
Copy link
Collaborator

@guidorice fantastic issue, really. 👏
I do like the listen/notify idea, however we do need to wary of multiple database changes. The way the api is setup makes it that every piece of the ATBD form is saved individually triggering a DB change. This would leave us with several useless pdfs.

Debouncing the requests and serializing after some time sounds like a good approach, however it is difficult to know what this "time" is. Filling out form can take quite some time, or be very quick if it just a typo. I guess this could be part of the reason why the serialization is done on view.

Do you know what the cost of serialization is (money/time/resources)? If it is low enough maybe it is not a problem to serialize often and then clean up with a GC.

@guidorice
Copy link
Contributor Author

Notes from standup today:

re: debouncing we agreed that debouncing the notifications before serializing is needed because the ATBD editing form has many parts that are accessed in a wizard style prev/next interaction, and it could be wasteful to continually serialize as the user goes through. The severity of this would depend on cost/time to serialize though:

re: serialization cost/time: I am going to do a test case of a complex ATBD and get a ballpark estimate.

re: cleanup of old pdfs: we agreed that a garbage collection script is needed too, because when an ATBD is published, it is no longer editable, and therefore there will be a 1:1 correspondence between an ATBD and it's current version and the PDF. So we only need to preserve the current PDF.

Also Alyssa responded to my email inquiry, and she basically just confirmed that 1) her mode of development was to run commands in local docker container e.g. pdflatex XYZ.tex, and 2) the serialization pipeline was never implemented end-to-end in the local development environment.

cc @danielfdsilva @sharkinsspatial

@guidorice
Copy link
Contributor Author

guidorice commented May 19, 2020

Rough benchmarking results (of the old prototype serialization pipeline)

I locally modified ecs/tex/entrypoint.sh to 1) use bash and 2) report the $SECONDS variable for the 3 different sections: all the initialization at the top, just around python script run, and the tail (copying to s3 to finish)

Example benchmark command (noting also ecs/README.md has wrong instructions re: parameters and environment variable names) with a time for the whole run command (locally on a laptop)

time docker run \
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
tex \
nasa-apt-staging-json \
ATBD_2v1_9ad56ce0-99da-11ea-b5d4-fbcf4afd022a.json \
nasa-apt-staging-atbd \
nasa-apt-staging-scripts
initialization section entrypoint.sh takes 00:00:06
serialize.py takes 00:00:01
finish of entrypoint.sh takes 00:00:02
...

real    0m9.182s

conclusion

The python script (the actual serialization) takes only ~1 second. All the rest of the stuff takes ~8 seconds. The s3 copying, pip install etc. at the head of the script is going to get refactored out and so I do not see any problem at all having this run frequently (at every edit of the ATBD).

Debouncing the serialization worker events from postgresql listen/notify is not a priority.

Assumption: the html and pdf serialization will take roughly the same time as the json -> latex serialization step.

cc @danielfdsilva

@guidorice
Copy link
Contributor Author

performing the serialization by calling docker exec (or whatever the equivalent is in an ECS environment)

ECS ContainerDefinitions can have an entrypoint and command. In a cloudformation deployment, then, the relevant command to execute a task in another Docker container would be like:

aws ecs run-task ... --overrides '{"containerOverrides": [{"name": "whatever", "command": ["foo", "bar"}]}' source: stackoverflow

This would unfortunately mean local development using docker-compose would have to use docker run or docker exec commands, while the aws cloudformation deployment would have to use aws ecs run-task scripts. It is not ideal to be maintaining two sets of scripts, more complexity for testing as well, but the alternatives don't seem good either:

Possible alternatives:

  • Convert the ecs/[tex,pdf,html] docker containers to be http services instead of command entrypoints. I do not really like this idea- it adds overhead and introduces new long running daemons and limitations of http, like possible command timeouts.

  • use amazon ecs cli, to manage the deployments, which works at a higher level than aws cloudformation deploy... scripts. This is a tool I've never used and as a high-level tool, it's possible we would lose functionality that is in the current cloudformation.yaml.

The Amazon ECS CLI supports Docker Compose, an open-source tool for defining and running multi-container applications. You can apply the same Compose definition used to define a multi-container application on your development machine as well as in production.

@sharkinsspatial if I can get your feedback/advice on the thread thus far, that would be much appreciated.
cc @danielfdsilva

@guidorice guidorice changed the title serialization infrastructure and dev environment improvements serialization infrastructure and dev environment improvements (proposal #1) May 20, 2020
@guidorice
Copy link
Contributor Author

guidorice commented May 20, 2020

Summary of the this proposal versus the current implementation

Implementation Caching of PDFs Speed of serialization Developer friendly
prototype (current in nasa-apt repo) No cache (saved to s3 but always re-serialized) Slow: staging takes ~20 seconds to generate PDF even after system is warmed up local dev environment is broken
simpler proposal #250 cache Published ATBDs only Goal: Fast, serialization to PDF takes < 1 second Goal: support docker-compose locally with all functionality
#248 (this proposal) always cache with smart cache invalidation Goal: Fast, serialization to PDF takes < 1 second Goal: support docker-compose locally with all functionality

cc @danielfdsilva

@guidorice
Copy link
Contributor Author

Superseded by #250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants