AWS Transcribe allows to automaticaly convert speech to text using their own Machine Learning trained model.
Using it, I created a project to generate and synchronize subtitles from a given video as an input file.
This repo contains the Terraform templates in order to deploy the solution in AWS, as well as the code used for the Lambdas and the code used by the ECS (on Fargate) from a home-made Docker image uploaded to ECR.
- Put a video file as input in a S3 folder.
- Get the result as a
.srt
file.
- The user puts the input video file into the
app_bucket
underinputs/
. - This triggers the
input_to_sqs
Lambda which will send the key path of the input file into thesqs_input
queue. - A message received in this queue triggers the
trigger_ecs_task
Lambda. The function will- read and parse the message from the SQS Queue.
- trigger an ECS task and passing the values (key pahth and Bucket name) fetched from the SQS to it.
- The ECS task will download the input file into its local FS, extract the sound from it and upload the
.mp3
result under/tmp
of theapp_bucket
. - Once a message is put under
tmp/
thetrigger_transcribe_job
starts the Transcribe job and send to it the key path of the extracted sound as well as the Bucket name. - The Transcribe job starts with the arguments given to it (key path of the
.mp3
file and the Bucket name). - Once the Transcribe job is done, its result is uploaded into the
transcribe_result_bucket
. - This result needs to be parsed into a
.srt
format. This is the job of theparse_transcribe_result
which is triggered by a Bucket notification when a file is uploaded into the root of thetranscribe_result_bucket
. - Finally, the parsed and synchronized
.srt
file from the uploaded input video is uploaded into thetranscribe_result_bucket
underresults/
.
-
./code
The code directory is composed if 3 sub-directories: docker, lambdas and local.-
/lambdas
This directory contains the Python code used by the AWS Lambdas. -
/local
This folder was my starting point, and was used to validate my initial idea.
It contains the Python code to locally test the Transcribe job. It takes a video path as an input and calls the AWS API to receive the .srt final result.
To use it, export your AWS profile into the shell, create a S3 Bucket, fill-up config.json and execute the Transcribe job -python3 transcribe.py
. -
/docker
This part contains the Python code which is used by the ECS task to extract the sound from the video. The Dockerfile is used to built the Docker container which needs to be pused to the ECR repo.
-
-
./infrastructure
This directory contains all the necessary templates and resources to deploy the infrastructure on AWS.-
/compostions
Logical units of Terraform code. Each parts define some Terraformmodules
which call a group of Terraformresources
defined in ./infrastructure/resources. For example, inbuckets
we can find the code used to deploy each S3 Buckets used by the solution. Since I want all the Buckets to be encrypted, I can re-use the samemodule
structure I defined for all of them. -
/ecs_definition
JSON templates defining the ECS task definition. This template is populated by the ecs_defintion.tf Terraform template file. -
/policies
All the policies used by the different components. These policies are templated using the same technique as the one used inecs_definition
module. -
/resources
Terraformresources
logically grouped together and called from the ./composition part. For exanple, a S3 Bucket is being defined as a group ofaws_s3_bucket
,aws_s3_bucket_policy
and aaws_s3_bucket_public_access_block
Terraform resource.
-
cd infrastructure/compositions/terraform_backend
- Comment the S3 part of the
providers.tf
file fromterraform_backend
:
terraform {
required_version = ">= 0.12"
// backend "s3" {
// }
}
terraform init --backend-config=backend.config
terraform plan
terraform apply
. Optionallyterraform apply --auto-approve
- Uncomment the S3 part
terraform {
required_version = ">= 0.12"
backend "s3" {
}
}
terraform init --backend-config=backend.config
and type yes to copy the local state into the deployed remote state Bucket.- Remove any
.tfstate
or.tfstate.backup
file from the current dir.
cd infrastructure/compositions/networking
terraform init --backend-config=backend.config
terraform plan
terraform apply
. Optionallyterraform apply --auto-approve
- Apply same command for
infrastructure/compositions/buckets
andinfrastructure/compositions/media_processing
- Build and upload to ECR the Docker image used by the ECS task:
With fish shell:
eval (aws ecr get-login --no-include-email --region <region>)
docker build -t ecr_media_processing .
docker tag ecr_media_processing:latest <account_id>.dkr.ecr.<region>.amazonaws.com/ecr_media_processing:latest
docker push <account_id>.dkr.ecr.<region>.amazonaws.com/ecr_media_processing:latest
However, for some very close pronunciation cases, the model could be not accurate enough (although constantly improving).
In the F.R.I.E.N.D.S extract I used as a test, Phoebe says:
We went to a self-defense class
Which is translated by
Way went to a self-defense class
However annoying, this can be easily fixed by editing the resulting
.srt
file with a simple text editor.
-
All-in with Lambda
My plan was to used only Lambdas function to do everything. I have been quickly limited because of the following reasons:- I needed to locally download the inout video to extract the sound from it. The
/tmp
storage is limited to 512MB. - There was a risk of a too-long processing-time, which means the Lambda could have timed-out. Because of these limitations, I decided to go for an ECS task running on Fargate.
- I needed to locally download the inout video to extract the sound from it. The
-
AWS Transcribe tmp file
Transcribe creates a.write_access_check_file.temp
at the root of the Bucket in which its end-result will be uploaded. This means that theparse_transcribe_result
Lambda will be triggered by the creation of this file and will try to parse it, resulting in an error (since the Lambda expects a.json
file, resulting from the Transcribe Job).
The solution was to trigger this Lambda when a file was uploaded to the root of the Bucket AND that this file ends with.json
(using thesuffix
feature). -
Transcribe and key path
My initial plan was to only use one Bucket for everything.
However Transcribe does not allow to specify a key path to use to upload its end-result (otherwise I would have used the already deployedapp_bucket
, and upload the final result under something like/results
). Only a Bucket can be specified in The Transcribe job. I could have used theapp_bucket
and uploads the results at its roots, but I think this breaks the logic of having dir-like structure in this Bucket.
The solution I choose was to create another Bucket (transcribe_result_bucket
) to hold the end-result of the Transcribe job.
- ECS on Fargate is suitable for this use-case because:
- I do not need to manage the under-lying instances
- I have 10GB for Docker layer, and additional 4GB for volume mounts, which is enough to download most of the input vide file locally.
- The Lambda functions works inside a Private subnets and uses VPC endpoints to reach the different services.
- Same for the ECS Cluster, which uses a NAT Gateway instead to pull the Docker container from ECR.
- If you want to test the solution by yourself, I added the
video.mp4
which you can use as an input. - The result from the Transcribe job can be found under
tmp_transcribe_result.json
. - The parsed final result can be found under
result.srt
.
- Sharding with Kinesis.
- Have a frontend.
These solutions might be implemented in the future in a private repo.