TODO
-
A (dead-easy) deployable data platform for you to deploy at no cost and play around with;
-
Straight-forward options and template parameters to integrate the platform with your existing AWS resources, including re-using your existing KMS keys, AWS Cognito UserPool, AWS ApiGateway custom authorizers;
-
By-the-book configurations of plenty of AWS Services, including examples for:
- AWS CloudWatch log groups, content-based filtered events rules (EventBridge) & alarms;
- KMS Encrypted S3 buckets, configured with AWS CloudTrail logging, robust bucket policies;
- KMS Encrypted standard and FIFO SQS queues, producers and consumers;
- AWS ApiGateway exotic integrations (e.g. S3) and other hacks.
-
AWS Cloudformation tips & tricks including Custom macros, resources & nested templates;
-
Fully serverless and cost efficient architecture design
-
Highly available and scalable ingress system
-
As less code as possible
-
As agnostic as possible
- AWS CLI (1.17.14+);
- Python (3.8.2+);
- boto3 (1.16.8+);
- AWS SAM CLI (0.47.0+).
-
This project contains git submodules. Make sure to pull all submodules using
submodule update
:git submodule update --init --recursive
-
The AWS SAM CLI has a known limitation, being unable to properly build AWS Lambdas (and other Local File System builds) in nested applications.
-
CAPABILITY_NAMED_IAM
: These Cloudformation templates include resources that affect permissions in your AWS account (e.g. creating new AWS Identity and IAM users). You must explicitly acknowledge this by specifying this capability. -
CAPABILITY_AUTO_EXPAND
: Some of these Cloudformation templates contain macros and Cloudformation nested applications. Macros perform custom processing on templates. You must acknowledge this capability.
This assumes your AWS CLI environment is properly configured.
-
From the project root directory, run the AWS SAM CLI to build and deploy the Cloudformation application. It is recommended to use the
--guided
option in order to configure the application deployment, including template parameters:sam build sam deploy --guided --capabilities "CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND"
-
You will be prompt with a selection menu, generating a configuration recap as follows:
Deploying with following values =============================== Stack name : data-platform Region : eu-west-1 Confirm changeset : True Deployment s3 bucket : <your_cfn_deployment_bucket> Capabilities : ["CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND"] Parameter overrides : {"EnableEncryption": "True", "KmsKeyArn": "NONE", "EnableApiAuthorization": "True", "ApiAuthorizerArn": "NONE"} Signing Profiles : {}
-
Deploy the application to your AWS account by confirming the Cloudformation changeset.
The data platform combine a number of AWS services, and can be integrated and used in a wild number of scenarios. The following examples shows some ways to start playing around and use some of its basic features, available out of the box.
If you've configured the data platform application to enable ApiAuthorization
and create its own Cognito UserPool, you will need to create a User and request an authorized token in order to use the data platform ingress API.
You may do so programmatically using the AWS CLI, as follows:
-
Get your UserPool id. The data plaform's CognitoPool should be named
data_platform-user_pool
:USER_POOL_ID=`aws cognito-idp list-user-pools \ --max-results 10 \ --query 'UserPools[*].[Name, Id]' \ --output 'text' | awk '/data_platform-user_pool/ { print $2 }'` echo 'UserPool id:' $USER_POOL_ID
-
Get your UserPool client id. This is a client application declaration that provides a way to generate authentication tokens used to authorize a user, which has been created along with the user pool. It has been configured to use a Client Secret, along with various Auth flows:
USER_POOL_CLIENT_ID=`aws cognito-idp list-user-pool-clients \ --user-pool-id $USER_POOL_ID \ --query 'UserPoolClients[0].ClientId' \ --output 'text'` echo 'UserPool client id:' $USER_POOL_CLIENT_ID
-
Get your UserPool client secret. This will used to generate a secret hash the UserPool client will verify when authenticating:
USER_POOL_CLIENT_SECRET=`aws cognito-idp describe-user-pool-client \ --user-pool-id $USER_POOL_ID \ --client-id $USER_POOL_CLIENT_ID \ --query 'UserPoolClient.ClientSecret' \ --output 'text'`
-
Create a new UserPool user. This requires to set a user temporary password, that the user will have to change, as we will do later on:
USER_POOL_USER_NAME='[email protected]' USER_POOL_USER_PASSWORD=`uuidgen` aws cognito-idp admin-create-user \ --user-pool-id $USER_POOL_ID \ --username $USER_POOL_USER_NAME \ --temporary-password $USER_POOL_USER_PASSWORD
-
Generate the UserPool user secret hash, as explained above:
USER_POOL_SECRET_HASH=`echo -n "$USER_POOL_USER_NAME$USER_POOL_CLIENT_ID" \ | openssl dgst -sha256 -hmac $USER_POOL_CLIENT_SECRET \ | xxd -r -p \ | openssl base64`
-
Authenticate to the UserPool using all the credentials set above. Please note that the first authentication session will require an extra step as the user will have to change their password:
USER_POOL_AUTH_SESSION=`aws cognito-idp admin-initiate-auth \ --auth-flow 'ADMIN_NO_SRP_AUTH' \ --user-pool-id $USER_POOL_ID \ --client-id $USER_POOL_CLIENT_ID \ --auth-parameters "USERNAME=$USER_POOL_USER_NAME,PASSWORD=$USER_POOL_USER_PASSWORD,SECRET_HASH=$USER_POOL_SECRET_HASH" \ --query 'Session' \ --output 'text'`
-
Set up a new password and get your Authorization token:
USER_POOL_USER_PASSWORD_NEW=`uuidgen` USER_POOL_AUTHORIZATION_TOKEN=`aws cognito-idp admin-respond-to-auth-challenge \ --user-pool-id $USER_POOL_ID \ --client-id $USER_POOL_CLIENT_ID \ --session $USER_POOL_AUTH_SESSION \ --challenge-name 'NEW_PASSWORD_REQUIRED' \ --challenge-responses "NEW_PASSWORD=$USER_POOL_USER_PASSWORD_NEW,USERNAME=$USER_POOL_USER_NAME,SECRET_HASH=$USER_POOL_SECRET_HASH" \ --query 'AuthenticationResult.IdToken' \ --output 'text'` echo 'UserPool id token:' $USER_POOL_AUTHORIZATION_TOKEN
-
Your user is now fully set up. Make sure to save the latest password you generated. Whenever you need a renewed Authorization token, you may use the following command:
USER_POOL_AUTHORIZATION_TOKEN=`aws cognito-idp admin-initiate-auth \
--auth-flow 'ADMIN_NO_SRP_AUTH' \
--user-pool-id $USER_POOL_ID \
--client-id $USER_POOL_CLIENT_ID \
--auth-parameters "USERNAME=$USER_POOL_USER_NAME,PASSWORD=$USER_POOL_USER_PASSWORD_NEW,SECRET_HASH=$USER_POOL_SECRET_HASH" \
--query 'AuthenticationResult.IdToken' \
--output 'text'`
echo 'UserPool id token:' $USER_POOL_AUTHORIZATION_TOKEN
This exampled showed how to programmatically change a UserPool user's password using the AWS CLI and an admin account. It is your responsibility to provide your user with a proper way of authenticating.
The data platform bundles an ApiGateway endpoint which proxies and directly integrates with S3, to provide a very reliable way of ingesting data at high speed and volume (Please refer to the Ingress API README for more details).
You may do so programmatically using the AWS CLI, as follows:
-
Get your ingress API endpoint:
INGRESS_API_AWS_REGION=`aws configure get region` INGRESS_API_ID=`aws apigateway get-rest-apis \ --query 'items[*].[name,id]' \ --output 'text'| awk '/data_platform-ingress/ { print $2 }'` INGRESS_API_ENDPOINT="https://$INGRESS_API_ID.execute-api.$INGRESS_API_AWS_REGION.amazonaws.com/main" echo 'Ingress API endpoint:' $INGRESS_API_ENDPOINT
-
You can publish JSON data to any table in your data platform using the
POST /table/<tableName>/object
. Make sure theContent-Type
header of your request is set.If you've enabled the
ApiAuthorization
, make sure to pass the required authorization info to your API authorizer (if using the custom data platform's Cognito UserPool, copy your token obtained above in the Authorization header).The following will POST a new document to the
myTable
table:curl -d '{"key1":"value1", "key2":"value2"}' \ -H 'Content-Type: application/json' \ -H "Authorization: $USER_POOL_AUTHORIZATION_TOKEN" \ -X 'POST' \ "$INGRESS_API_ENDPOINT/table/myTable/object"
-
If everything goes to plan, the API should return a
200 OK
response, containing a JSON encoded body as follows, meaning that the document has properly been copied to the ingress bucket, and will be running though the platform:{ "success": true, "data": { "id": "e7439ba1-e646-4aad-a18a-e7c381202721" } }
-
The
data.id
field corresponds to a unique AWS request id which will allow you to track your file throughout the platform.As for example above, checking the data platform's lake ingress S3 bucket, you will see that the document has been saved to key
id=e7439ba1-e646-4aad-a18a-e7c381202721
, and its destination table and request-time have been saved as object metadata:INGRESS_AWS_ACCOUNT_ID=`aws sts get-caller-identity \ --query 'Account' \ --output 'text'` aws s3api head-object \ --bucket "$INGRESS_AWS_ACCOUNT_ID-lake-ingress" \ --key 'id=e7439ba1-e646-4aad-a18a-e7c381202721' \ --query '[ContentType,Metadata]' [ "text/plain;base64", { "request-time": "27/Feb/2021:16:17:38 +0000", "table": "myTable" } ]
The document has been saved base64-encoded. The lake functions functions will then take over and:
- Validate the document type;
- Normalize the document structure to be easily parsable in AWS Athena.
TODO
TODO
All contributions are welcome! Depending on your interest and skill, you can help build and maintain the different parts of the project:
Improve on the resources configuration
Make pull requests, report bugs and misconfigurations of resources,
Start open discussions
Feel free to contact me, discuss of questionnable architecture design pattern, and share ideas on how this data platform framework could be improved.