The Document Reader stack let you deploy a serverless infrastructure to build OCR application that produce spoken text extracted from images you can provide throught REST API. It follows the producer/consumer schema presented here and uses a bunch of services from AWS:
- AWS API Gateway, to provide pre-signed URL for upload documents;
- AWS Lambda, to backend for computation;
- AWS Simple Storage Service, to provide storage for input and output;
- AWS Simple Queue Service, to decouple the uploading part from the working one;
- AWS DynamoDB, to persist references of jobs completed;
- AWS Rekognition, to provide engine as a service and extract text from document;
- AWS Polly, to provide text-to-speach functionality;
This stack inherits from the Producer/Consumer stack the same logic regarding the request of URL and pushing of the message: the extension is around an Lambda that act as a consumer of the object uploaded and by using Rekognition and Polly create speach of text recognized in the images. Below a schema of the stack as is:
Another schema with the parts in common with the Producer/Consumer and the Upload Form stacks.
A blog post is available here.
To use the stack / modify it, just clone the repository and move to the templates/document-reader
folder starting from the root of the repository, like this:
git clone https://github.com/made2591/immutable.templates
cd immutable.templates/templates/document-reader
# start deploy (see later)
The user asks to API Gateway (1) for a pre-signed URL to upload a document. The API Gateway will trigger a Lambda function (2) that will invoke the getSignedUrl URL action by using the S3 API (3) and provides back the URL to API Gateway (4) - that will forwards it directly to the user (6). The user is now able to push his document to S3 with the provided URL (7). When the document is uploaded, S3 will put a message over an SQS queue (8). The consumer will be able to retrieve the reference to the document sent by polling the SQS Queue (9). With this message, the consumer can ask to API Gateway the permission to retrieve the original document produced (10). Once the pre-signed URL is generated and sent back (11), it can retrieve safely the content of the message directly from S3 (12). A consumer Lambda provides the document retrieved (12) to Rekognition service (13) and get back the extracted text (13) if any. After that, it sends this text to Polly (15) and gets back an AudioStream (14) ready to be uploaded to S3. Before going ahead, it saves the references of the document, the extracted test, and the produced output to a DynamoDB table (17). Finally, it saves the AudioStream as a .mp3 file to S3 (18), where the document was originally stored by the user.
The only needed tool is Node.js - ≥ 8.11.x - and the AWS Cloud Development Kit - AWS CDK. You can install it by running
$ npm i -g aws-cdk
Just as any other CDK stack, this are the main commands that can help you with the most common actions:
npm run build
compile typescript to jsnpm run watch
watch for changes and compilecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template
Please read CONTRIBUTING.md for details on how to contact me.
Almost all the stacks proposed in this repository, and their implementation, are deeply discussed between people below:
- Matteo Madeddu - Design, Implementation - Github, LinkedIn
- Guido Nebiolo - Design, Implementation - Github, LinkedIn
Thank you for your interest!
This project is licensed under the MIT License - see the LICENSE.md file for details.
- Fix architecture schemas
- Inspiration
- etc