-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Kinesis LZO S3 Sink Setup 0.5.0
HOME > SNOWPLOW SETUP GUIDE > Step 4: setting up alternative data stores > Kinesis-LZO-S3-Sink-Setup
🚧 The documentation for the latest version can be found on the Snowplow documentation site.
This documentation is for version 0.5.0 of Kinesis LZO S3. For previous versions:
The Kinesis LZO S3 Sink reads records from an Amazon Kinesis stream, compresses them using splittable LZO or GZip, and writes them to S3.
It was created to store the Thrift records generated by the Scala Stream Collector in S3.
If it fails to process a record, it will write that record to a second Kinesis stream along with an error message.
To run the Kinesis LZO Sink, you must have installed the native LZO binaries. To do this on Ubuntu, run:
$ sudo apt-get install lzop liblzo2-dev
See Hosted assets for the zipfile to download.
Alternatively, you can build it from the source files. To do so, you will need scala and sbt installed.
To do so, clone the Snowplow repo:
$ git clone https://github.com/snowplow/kinesis-s3.git
Navigate into the Kinesis S3 folder:
$ cd kinesis-s3
Use sbt
to resolve dependencies, compile the source, and build an assembled fat JAR file with all
dependencies.
$ sbt assembly
The jar
file will be saved as snowplow-kinesis-s3-0.5.0
in the target/scala-2.12
subdirectory.
It is now ready to be deployed.
The sink is configured using a HOCON file. These are the fields:
-
aws.access-key
andaws.secret-key
: Change these to your AWS credentials. You can alternatively leave them as "default", in which case the DefaultAWSCredentialsProviderChain will be used. -
kinesis.in.stream-name
: The name of the input Kinesis stream. This should be the stream to which your are writing records with the Scala Stream Collector. -
kinesis.in.initial-position
: Where to start reading from the stream the first time the app is run. "TRIM_HORIZON" for as far back as possible, "LATEST" for as recent as possibly. -
kinesis.in.max-records
: Maximum number of records to read per GetRecords call -
kinesis.out.stream-name
: The name of the output Kinesis stream, where records are sent if the compression process fails. -
kinesis.out.shards
: If the out stream doesn't exist, create it with this many shards. -
kinesis.region
: The Kinesis region name to use. -
kinesis.app-name
: Unique identifier for the app which ensures that if it is stopped and restarted, it will restart at the correct location. -
s3.endpoint
: The AWS endpoint for the S3 bucket -
s3.bucket
: The name of the S3 bucket in which files are to be stored -
s3.format
: The format the app should write to S3 in (lzo
orgzip
) -
s3.max-timeout
: The maximum amount of time the app attempts to PUT to S3 before it will kill itself -
buffer.byte-limit
: Whenever the total size of the buffered records exceeds this number, they will all be sent to S3. -
buffer.record-limit
: Whenever the total number of buffered records exceeds this number, they will all be sent to S3. -
buffer.time-limit
: If this length of time passes without the buffer being flushed, the buffer will be flushed.
You can also now include Snowplow Monitoring in the application. This is setup through a new section at the bottom of the config. You will need to ammend:
-
monitoring.snowplow.collector-uri
insert your snowplow collector URI here. -
monitoring.snowplow.app-id
the app-id used in decorating the events sent.
If you do not wish to include Snowplow Monitoring please remove the entire monitoring
section from the config.
An example is available in the repo.
Note that setting the "bucket" field to a nested bucket (like "mybucket/myinnerbucket") may prevent the sink from working by throwing an exception:
com.amazonaws.services.s3.model.AmazonS3Exception: The bucket you are attempting to access must be addressed using the specified endpoint.
To get around this, include your bucket's S3 region in the endpoint field:
s3 {
endpoint: "http://s3-eu-west-1.amazonaws.com" # Rather than "http://s3.amazonaws.com"
bucket: "outer-bucket/inner-bucket"
}
The Kinesis S3 Sink is an executable jarfile which should be runnable from any Unix-like shell environment. Simply provide the configuration file as a parameter:
$ ./kinesis-lzo-sink-0.4.0 --config my.conf
This will start the process of reading events from Kinesis, compressing them, and writing them to S3.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2021 Snowplow Analytics Ltd. Documentation terms of use.
HOME » SNOWPLOW SETUP GUIDE » Step 4: Setup alternative data stores
- Step 1: Setup a Collector
- Step 2a: Setup a Tracker
- Step 2b: Setup a Webhook
- Step 3: Setup Enrich
-
Step 4: Setup alternative data stores
- 4.1: setup Redshift
- 4.2: setup PostgreSQL
- 4.3: installing the StorageLoader
- 4.4: using the StorageLoader
- 4.5: scheduling the StorageLoader
- 4.6: loading shredded types
- 4.7: setup Elasticsearch
- 4.8: setup the Kinesis LZO S3 Sink
- 4.9: setup Druid
- 4.10: setup Amazon DynamoDB
- 4.11: configuring storage targets
- 4.12: setup Google Cloud Storage Loader
- Step 5: Data modeling
- Step 6: Analyze your data!
Useful resources
- Troubleshooting
- IAM Setup
- Hosted assets
- Glossary of Terms
- Upgrade Guide
- Snowplow Version Matrix
- Batch Pipeline Steps (block dataflow diagram)