Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Batch and realtime

Sam Ritchie edited this page Aug 26, 2013 · 4 revisions

Running in a hybrid batch/realtime mode requires a few special configuration parameters.

TODO: This section requires some cleanup to mesh with the new API.

Batcher and BatchID

Every job is required to declare a Batcher:

import com.twitter.summingbird.batch.Batcher

implicit val batcher = Batcher.ofHours(1)

Batching is the secret sauce that allows Summingbird’s client to merge the output of a Hadoop aggregation and a Storm aggregation. You shouldn’t have to think about batching when writing your Summingbird jobs, but it helps to know what’s going on behind the scenes.

A Hadoop job will run every time a new batch of data becomes available (in this case, one hour of new data). Every new key stores some metadata about which batches have been aggregated into its partial value. So, instead of storing (K, V) pairs, Summingbird’s Scalding job stores (K, (BatchID, V)). The paired BatchID is the first batch that Scalding has NOT yet processed.

The Storm job, on the other hand, breaks down the (K, V) pairs it receives into ((K, BatchID), V). When a new Scalding run completes and drops a particular batch, Storm’s data store can safely drop all partial aggregations that reference that particular batch.

When a client performs a lookup into a hybrid summingbird system, it follows this algorithm:

  • Send a K to the key-value store holding Scalding data and get back (BatchID, V).
  • Use this batch and the current batch (based on the wall clock) to generate a sequence of missing batches.
  • For each batch in this sequence, query store for (K, BatchID).
  • Sum all partial values and return the final value to the client

If Scalding returns BatchID(5) and the wall clock states that the current batch is BatchID(7), the Summingbird client will fetch batches 5, 6, and 7 from Storm’s datastore and merge the three resulting partial values with the original value received from Manhattan. If any of these lookups fail, the entire lookup will fail, ensuring that the merged client only returns a fully-merged value.