Skip to content
Henry Haiying Cai edited this page Feb 11, 2015 · 42 revisions

hcai: Is that possible to generate a table of contents for all the topics?

What is Gobblin?

Gobblin is a universal ingestion framework. It's goal is to pull data from any source into an arbitrary data store. One major use case for Gobblin is pulling data into Hadoop. Gobblin can pull data from file systems, SQL stores, and data that is exposed by a REST API. See the Gobblin Home page for more information.

What programming languages does Gobblin support?

Gobblin currently only supports Java.

hcai: and version of java. Also the version of hadoop.

How do I run and schedule a Gobblin job?

Check out the [Deployment](Gobblin Deployment) page for information on how to run and schedule Gobblin jobs. Check out the [Configuration](Configuration Properties Glossary) page for information on how to set proper configuration properties for a job.

When running on Hadoop, each map task quickly reaches 100% completion, but then stalls for a long time. Why does this happen?

Gobblin currently uses Hadoop map tasks as a container for running Gobblin tasks. Each map task runs 1 or more Gobblin workunits, and the progress of each workunit is not hooked into the progress of each map task. Even though the Hadoop job reports 100% completion, Gobblin is still doing work. See the [Gobblin Deployment](Gobblin Deployment) page for more information.

Why does Gobblin on Hadoop stall for a long time between adding files to the DistrbutedCache, and launching the actual job?

Gobblin takes all WorkUnits created by the Source class and serializes each one into a file on Hadoop. These files are read by each map task, and are deserialized into Gobblin Tasks. These Tasks are then run by the map-task. The reason the job stalls is that Gobblin is writing all these files to HDFS, which can take a while especially if there are a lot of tasks to run. See the [Gobblin Deployment](Gobblin Deployment) page for more information.

Clone this wiki locally