Real-Time Analytics with Spark Streaming Ver 1.3

This module provides a configurable set of AWS infrastructure for real-time analytics. While you can view the full set of AWS resources below, the data pipeline this module produces uses three core things: Kinesis, Firehose, and S3.

How To Use This Module

To reference this module from your project's terraform, add a module block to your existing Terraform.

For example, this block would explicitly make use of the 1.3.0 version of this module. Note that this example shows the minimally required arguments.

module "analytics" {
  source = "../../"

  project_name   = "example"

  log_group_name = aws_cloudwatch_log_group.project_log_group.name
  kinesis_data_producers = {
    aws = [aws_iam_role.producer.arn]
  }
  s3_data_consumers = {
    aws = [aws_iam_role.consumer.arn]
  }
}

See examples for more, or the variable and output reference below.

How To Contribute

Clone this repo: git clone [email protected]:<org>/<repo>
Prepare the environment cd <repo>; scripts/prep.sh
Source useful functions . scripts/functions.sh

Provided functions

Function	Description
`dev-start`	This launches the development container interactively, useful if you want to `plan` or `apply` the examples.
`dev-docs`	Uses terraform-docs to update documentation.
`dev-fmt`	Runs `terraform fmt`
`dev-lint`	Uses tflint to lint the terraform.
`dev-test`	Runs `terraform test`

Note

The only tooling required on a contributor machine is git and docker.

Module documentation

Requirements

The following requirements are needed by this module:

terraform (>= 1.8)
aws (>= 5.45.0)
random (>= 3.6.1)

Providers

The following providers are used by this module:

aws (>= 5.45.0)
random (>= 3.6.1)

Resources

The following resources are used by this module:

aws_cloudwatch_dashboard.this (resource)
aws_cloudwatch_log_stream.firehose_logging_stream (resource)
aws_cloudwatch_metric_alarm.firehose_incoming_records_high_alarm (resource)
aws_cloudwatch_metric_alarm.firehose_incoming_records_low_alarm (resource)
aws_iam_role.firehose_role (resource)
aws_kinesis_firehose_delivery_stream.affiliate_firehose (resource)
aws_kinesis_resource_policy.this (resource)
aws_kinesis_stream.affiliate_stream (resource)
aws_s3_bucket.firehose_target (resource)
aws_s3_bucket_ownership_controls.this (resource)
aws_s3_bucket_policy.this (resource)
aws_s3_bucket_public_access_block.this (resource)
aws_s3_bucket_server_side_encryption_configuration.encrypt (resource)
aws_sns_topic.firehose-alarms (resource)
aws_sns_topic_subscription.alarm_subscriptions (resource)
random_string.random (resource)
aws_iam_policy_document.bucket_policy (data source)
aws_iam_policy_document.firehose_assume_role_policy (data source)
aws_iam_policy_document.firehose_policy (data source)
aws_iam_policy_document.stream_policy (data source)
aws_kms_key.key (data source)
aws_region.current (data source)

Required Inputs

The following input variables are required:

kinesis_data_producers

Description: Map of principals allowed to put records into the Kinesis stream

Example:

kinesis_data_producers = {
  aws = [
    "arn:aws:iam::123456789012:user/JohnDoe",
    "arn:aws:iam::123456789012:role/ec2_app/kinesis_role"

  ]
  federated = ["arn:aws:iam::123456789012:saml-provider/okta"]
}

Type:

object({
    aws       = optional(set(string))
    federated = optional(set(string))
  })

log_group_name

Description: The name of the log group in which to create the firehose logging stream

Type: string

project_name

Description: Name of the project this module is being included in, used when naming resourcces

Type: string

s3_data_consumers

Description: Map of principals allowed to read from the target s3 bucket

Example:

s3_dataconsumers = {
  aws = [
    "arn:aws:iam::123456789012:user/JohnDoe",
    "arn:aws:iam::123456789012:role/ec2_app/kinesis_role"

  ]
  federated = ["arn:aws:iam::123456789012:saml-provider/okta"]
}

Type:

object({
    aws       = optional(set(string))
    federated = optional(set(string))
  })

Optional Inputs

The following input variables are optional (have default values):

alarm_high_threshold

Description: The number of incoming records during a 5 minute period above which which the alarm should trigger

Type: number

Default: 1000

alarm_low_threshold

Description: The number of incoming records during a 5 minute period below which which the alarm should trigger

Type: number

Default: 50

alarm_subscription_email_addresses

Description: The set of email addresses to subscribe to alarm notifications
Each will receive an initial email asking to confirm the subscription

Type: set(string)

Default: []

buffering_interval

Description: Buffer incoming data for the specified period of time, in seconds, before delivering it to the destination
Note that both this and buffering_size may be set

Type: number

Default: 400

buffering_size

Description: Buffer incoming data to the specified size, in MBs, before delivering it to the destination
Note that both this and buffering_interval may be set

Type: number

Default: 10

data_transformation_lambda_arn

Description: The ARN of a lambda to use if data transformation in FireHose is desired
To control which revision is executed, ensure you specify it in the ARN

Type: string

Default: ""

dynamic_partitioning_config

Description: Supply a configuration object if you wish to use dynamic partitioning
If this is true, you must provide the jq_metadata_query, and s3_dynamic_prefix, s3_error_prefix
See the AWS blog post and documentation for more information

Of particular note:

When you use the Data Transformation feature in Firehose, the deaggregation will be applied before the Data Transformation. Data coming into Firehose will be processed in the following order: Deaggregation → Data Transformation via Lambda → Partitioning Keys.

Type:

object({
    jq_metadata_query = optional(string, "")
    s3_dynamic_prefix = optional(string, "")
    s3_error_prefix   = optional(string, "")

    enable_newline_appending    = optional(bool, false)
    enable_record_deaggregation = optional(bool, false)
    record_deaggregation_config = optional(object({
      type      = optional(string, "JSON")
      delimiter = optional(string, "")
    }), {})
  })

Default: {}

enable_dynamic_partitioning

Description: Whether to enable Firehose's dynamic partitioning dynamic_partitioning_config must be supplied if true

Changing this after creation will force destruction and recreation of the Firehose stream!

Type: bool

Default: false

env

Description: Current deployment environment name ('Dev', 'Test', or 'Prod')

Type: string

Default: "Dev"

kms_key_id

Description: The KMS key to use for encryption
Note that this is required if either use_kms_for_kinesis or use_kms_for_s3 are true
Note also that the key policy will need to permit the firehose role firehose_role_arn to do certain actions:

kms:Decrypt if used for Kinesis
kms:GenerateDataKey if used for S3

Type: string

Default: ""

retention_period

Description: The number of hours records remain accessible in the Kinesis stream

Type: number

Default: 24

shard_count

Description: The number of shards that the Kinesis stream will use, ignored if stream_mode is 'ON_DEMAND'

Type: number

Default: 1

stream_mode

Description: The capacity mode for the Kinesis stream ('PROVISIONED', or 'ON_DEMAND')

Type: string

Default: "ON_DEMAND"

use_kms_for_kinesis

Description: Use KMS key provided in kms_key_id for encryption of data in kinesis

Type: bool

Default: false

use_kms_for_s3

Description: Use KMS key provided in kms_key_id for S3 encryption (This is required for cross-account access)

Type: bool

Default: false

Outputs

The following outputs are exported:

Data transformation Lambda
KMS key policy

target_s3_bucket_name

Description: The S3 bucket to which Firehose will write output

Future work

Tidying

CI/CD for the module
Module publishing
Tool improvement
Get linting, testing, formatting, etc. into a pre-commit hook
Get tidiness tools working on examples as well
Add Auto-versioning

Functional Improvements

Clean up of the handling of the configuration for dynamic partitioning
Support for conversion of data formats and schema based on existing Glue data
Allow for more granular access control to the S3 bucket
Additional modularity

Examples Improvements

Examples' module source should be updated to point to the source repo (to prevent dev confusion)
Examples should include additional comments
Additional examples, particularly around dynamic partitioning
Testing of the examples

Test improvements

Switch over to using python tftest, rather than the built in framework
Add tests that exercise more functionality, end to end
Add tests verifying compatibility with additional tool versions

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.terraform-docs.yml		.terraform-docs.yml
Dockerfile.dev		Dockerfile.dev
README.md		README.md
cloudwatch.tf		cloudwatch.tf
firehose.tf		firehose.tf
iam.tf		iam.tf
kinesis.tf		kinesis.tf
main.tf		main.tf
outputs.tf		outputs.tf
s3.tf		s3.tf
variables.tf		variables.tf

Irialad/ctl_analytics

Folders and files

Latest commit

History

Repository files navigation

Real-Time Analytics with Spark Streaming Ver 1.3

How To Use This Module

How To Contribute

Provided functions

Note

Module documentation

Requirements

Providers

Resources

Required Inputs

Optional Inputs

Outputs

Future work

Tidying

Functional Improvements

Examples Improvements

Test improvements

About

Resources

Stars

Watchers

Forks

Languages