-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialization infrastructure and dev environment improvements (proposal #1) #248
Comments
@danielfdsilva @sharkinsspatial @olafveerman please add your feedback to this. thanks |
An additional requirement of the Serialization Worker is that it would need to handle missing events (e.g. when the listener client is disconnected)
This means the Serialization Worker would need to, at startup, scan the database (single source of truth) to find ATBDs which have not yet been serialized, and where the SHA checksum does not match the current version in the database. The same local work queue as described above could be used to process though the backlog. |
@guidorice fantastic issue, really. 👏 Debouncing the requests and serializing after some time sounds like a good approach, however it is difficult to know what this "time" is. Filling out form can take quite some time, or be very quick if it just a typo. I guess this could be part of the reason why the serialization is done on view. Do you know what the cost of serialization is (money/time/resources)? If it is low enough maybe it is not a problem to serialize often and then clean up with a GC. |
Notes from standup today: re: debouncing we agreed that debouncing the notifications before serializing is needed because the ATBD editing form has many parts that are accessed in a wizard style prev/next interaction, and it could be wasteful to continually serialize as the user goes through. The severity of this would depend on cost/time to serialize though: re: serialization cost/time: I am going to do a test case of a complex ATBD and get a ballpark estimate. re: cleanup of old pdfs: we agreed that a garbage collection script is needed too, because when an ATBD is published, it is no longer editable, and therefore there will be a 1:1 correspondence between an ATBD and it's current version and the PDF. So we only need to preserve the current PDF. Also Alyssa responded to my email inquiry, and she basically just confirmed that 1) her mode of development was to run commands in local docker container e.g. |
Rough benchmarking results (of the old prototype serialization pipeline)I locally modified Example benchmark command (noting also time docker run \
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
tex \
nasa-apt-staging-json \
ATBD_2v1_9ad56ce0-99da-11ea-b5d4-fbcf4afd022a.json \
nasa-apt-staging-atbd \
nasa-apt-staging-scripts
conclusionThe python script (the actual serialization) takes only ~1 second. All the rest of the stuff takes ~8 seconds. The s3 copying, Debouncing the serialization worker events from postgresql listen/notify is not a priority. Assumption: the html and pdf serialization will take roughly the same time as the json -> latex serialization step. |
ECS ContainerDefinitions can have an entrypoint and command. In a cloudformation deployment, then, the relevant command to execute a task in another Docker container would be like:
This would unfortunately mean local development using Possible alternatives:
@sharkinsspatial if I can get your feedback/advice on the thread thus far, that would be much appreciated. |
Summary of the this proposal versus the current implementation
|
Superseded by #250 |
This issue recaps the planning call earlier today, and lists the steps forward for moving the nasa-apt project from a prototype to maintainable production quality.
Background
This is focused on the serialization pipeline (database -> latex -> pdf, html) and improving the development environment. The current serialization and dev environment has some issues that need to be fixed, and issues make it difficult to develop on as well:
ENTRYPOINT
run. This is inefficient, repetitive, and not great from a security perspective. It is best practice to save the scripts into each of the serialization Docker images at build-time. It is interesting to note the reason for the current design was somewhat of a workaround for the LaTeX Docker image is ~1.3GB which can be time consuming to build and push to ECR.Additional background is in Technical lessons from nasa-apt.
There are some improvements that could be made to the CI/CD and Cloudformation deployment script, and integration of the two, however that is not covered in this issue.
Proposal
The key thing is we need to move the business logic for serializing ATBDs as close to the database as possible. It should not be triggered by the UI.
One proposal was to wrap the PostgREST service with a new API endpoint, that would control the serialization process and also act as a facade to the PostgREST api. That would be an improvement, but I am concerned it adds complexity by introducing a new service plus maintaining a facade for the existing service.
The serialization business logic can be moved even closer to the database by using PostgresSQL external notification via LISTEN/NOTIFY. The pg-bridge container could invoke a webhook on a new service with payload of the ATBD id each time the ATBD has been edited.
Data model addition
serialization
table which contains several values: the SHA checksum for the related ATBD and the datetime when it was calculated. The SHA checksum is used as a path and key into the rest of the serialization process: both the UI and the serialization worker will use it to to locate to the latest serialization output, and the status (pending, errored out, etc).atdb_versions
table. This should enable the manual triggering of serialization by causing the checksum to change.Serialization worker
Create a new docker container implementing a webhook endpoint (the receiver of
pg-bridge
calls) and that also has access to the PostgREST API. The serialization webhook responsibilities are:serialization
table via PostgREST API (note:pg-bridge
and/or the webhook should ignore changes to theserialization
table itself to avoid a loop).docker exec
(or whatever the equivalent is in an ECS environment) to invoke the json -> latex -> html, pdf conversions. The outputs and the status of the pipeline will be stored ins3
under the key of the SHA checksum.How the above problems 1-5 are addressed by proposal
atbd_versions
table.The text was updated successfully, but these errors were encountered: