Skip to content

Irialad/ctl_analytics

Repository files navigation

Real-Time Analytics with Spark Streaming Ver 1.3

This module provides a configurable set of AWS infrastructure for real-time analytics. While you can view the full set of AWS resources below, the data pipeline this module produces uses three core things: Kinesis, Firehose, and S3.

How To Use This Module

To reference this module from your project's terraform, add a module block to your existing Terraform.

For example, this block would explicitly make use of the 1.3.0 version of this module. Note that this example shows the minimally required arguments.

module "analytics" {
  source = "../../"

  project_name   = "example"

  log_group_name = aws_cloudwatch_log_group.project_log_group.name
  kinesis_data_producers = {
    aws = [aws_iam_role.producer.arn]
  }
  s3_data_consumers = {
    aws = [aws_iam_role.consumer.arn]
  }
}

See examples for more, or the variable and output reference below.

How To Contribute

  1. Clone this repo: git clone [email protected]:<org>/<repo>
  2. Prepare the environment cd <repo>; scripts/prep.sh
  3. Source useful functions . scripts/functions.sh

Provided functions

Function Description
dev-start This launches the development container interactively, useful if you want to plan or apply the examples.
dev-docs Uses terraform-docs to update documentation.
dev-fmt Runs terraform fmt
dev-lint Uses tflint to lint the terraform.
dev-test Runs terraform test

Note

The only tooling required on a contributor machine is git and docker.

Module documentation

Requirements

The following requirements are needed by this module:

Providers

The following providers are used by this module:

Resources

The following resources are used by this module:

Required Inputs

The following input variables are required:

Description: Map of principals allowed to put records into the Kinesis stream

Example:

kinesis_data_producers = {
  aws = [
    "arn:aws:iam::123456789012:user/JohnDoe",
    "arn:aws:iam::123456789012:role/ec2_app/kinesis_role"

  ]
  federated = ["arn:aws:iam::123456789012:saml-provider/okta"]
}

Type:

object({
    aws       = optional(set(string))
    federated = optional(set(string))
  })

Description: The name of the log group in which to create the firehose logging stream

Type: string

Description: Name of the project this module is being included in, used when naming resourcces

Type: string

Description: Map of principals allowed to read from the target s3 bucket

Example:

s3_dataconsumers = {
  aws = [
    "arn:aws:iam::123456789012:user/JohnDoe",
    "arn:aws:iam::123456789012:role/ec2_app/kinesis_role"

  ]
  federated = ["arn:aws:iam::123456789012:saml-provider/okta"]
}

Type:

object({
    aws       = optional(set(string))
    federated = optional(set(string))
  })

Optional Inputs

The following input variables are optional (have default values):

Description: The number of incoming records during a 5 minute period above which which the alarm should trigger

Type: number

Default: 1000

Description: The number of incoming records during a 5 minute period below which which the alarm should trigger

Type: number

Default: 50

Description: The set of email addresses to subscribe to alarm notifications
Each will receive an initial email asking to confirm the subscription

Type: set(string)

Default: []

Description: Buffer incoming data for the specified period of time, in seconds, before delivering it to the destination
Note that both this and buffering_size may be set

Type: number

Default: 400

Description: Buffer incoming data to the specified size, in MBs, before delivering it to the destination
Note that both this and buffering_interval may be set

Type: number

Default: 10

Description: The ARN of a lambda to use if data transformation in FireHose is desired
To control which revision is executed, ensure you specify it in the ARN

Type: string

Default: ""

Description: Supply a configuration object if you wish to use dynamic partitioning
If this is true, you must provide the jq_metadata_query, and s3_dynamic_prefix, s3_error_prefix
See the AWS blog post and documentation for more information

Of particular note:

When you use the Data Transformation feature in Firehose, the deaggregation will be applied before the Data Transformation. Data coming into Firehose will be processed in the following order: Deaggregation → Data Transformation via Lambda → Partitioning Keys.

Type:

object({
    jq_metadata_query = optional(string, "")
    s3_dynamic_prefix = optional(string, "")
    s3_error_prefix   = optional(string, "")

    enable_newline_appending    = optional(bool, false)
    enable_record_deaggregation = optional(bool, false)
    record_deaggregation_config = optional(object({
      type      = optional(string, "JSON")
      delimiter = optional(string, "")
    }), {})
  })

Default: {}

Description: Whether to enable Firehose's dynamic partitioning dynamic_partitioning_config must be supplied if true

Changing this after creation will force destruction and recreation of the Firehose stream!

Type: bool

Default: false

Description: Current deployment environment name ('Dev', 'Test', or 'Prod')

Type: string

Default: "Dev"

Description: The KMS key to use for encryption
Note that this is required if either use_kms_for_kinesis or use_kms_for_s3 are true
Note also that the key policy will need to permit the firehose role firehose_role_arn to do certain actions:

  • kms:Decrypt if used for Kinesis
  • kms:GenerateDataKey if used for S3

Type: string

Default: ""

Description: The number of hours records remain accessible in the Kinesis stream

Type: number

Default: 24

Description: The number of shards that the Kinesis stream will use, ignored if stream_mode is 'ON_DEMAND'

Type: number

Default: 1

Description: The capacity mode for the Kinesis stream ('PROVISIONED', or 'ON_DEMAND')

Type: string

Default: "ON_DEMAND"

Description: Use KMS key provided in kms_key_id for encryption of data in kinesis

Type: bool

Default: false

Description: Use KMS key provided in kms_key_id for S3 encryption (This is required for cross-account access)

Type: bool

Default: false

Outputs

The following outputs are exported:

Description: The ARN of the Kinesis stream accepting source input for the analytics pipeline.
This will be needed to provide that appropriate permissions data producers' roles.

Description: Link to the created dashboard

Description: The ARN of the role Firehose will use.
This will be useful in resource policies such as:

  • Data transformation Lambda
  • KMS key policy

Description: The S3 bucket to which Firehose will write output

Future work

Tidying

  1. CI/CD for the module
  2. Module publishing
  3. Tool improvement
  4. Get linting, testing, formatting, etc. into a pre-commit hook
  5. Get tidiness tools working on examples as well
  6. Add Auto-versioning

Functional Improvements

  1. Clean up of the handling of the configuration for dynamic partitioning
  2. Support for conversion of data formats and schema based on existing Glue data
  3. Allow for more granular access control to the S3 bucket
  4. Additional modularity

Examples Improvements

  1. Examples' module source should be updated to point to the source repo (to prevent dev confusion)
  2. Examples should include additional comments
  3. Additional examples, particularly around dynamic partitioning
  4. Testing of the examples

Test improvements

  1. Switch over to using python tftest, rather than the built in framework
  2. Add tests that exercise more functionality, end to end
  3. Add tests verifying compatibility with additional tool versions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published