-
Notifications
You must be signed in to change notification settings - Fork 751
FAQs
Gobblin is a universal ingestion framework. It's goal is to pull data from any source into an arbitrary data store. One major use case for Gobblin is pulling data into Hadoop. Gobblin can pull data from file systems, SQL stores, and data that is exposed by a REST API. See the Gobblin Home page for more information.
Gobblin currently only supports Java 6 and up.
The machine that Gobblin is built on must have Java installed, and the $JAVA_HOME
environment variable must be set.
Gobblin can run on both Hadoop 1.x and Hadoop 2.x. By default, Gobblin compiles against Hadoop 1.2.1, and can compiled against Hadoop 2.3.0 by running ./gradlew -PuseHadoop2 clean build
.
Check out the [Deployment](Gobblin Deployment) page for information on how to run and schedule Gobblin jobs. Check out the [Configuration](Configuration Properties Glossary) page for information on how to set proper configuration properties for a job.
Sqoop main focus bulk import and export of data from relational databases to HDFS, it lacks the ETL functionality of data cleansing, data transformation, and data quality checks that Gobblin provides. Gobblin is also capable of pulling from any data source (e.g. file systems, RDMS, REST APIs).
When running on Hadoop, each map task quickly reaches 100% completion, but then stalls for a long time. Why does this happen?
Gobblin currently uses Hadoop map tasks as a container for running Gobblin tasks. Each map task runs 1 or more Gobblin workunits, and the progress of each workunit is not hooked into the progress of each map task. Even though the Hadoop job reports 100% completion, Gobblin is still doing work. See the [Gobblin Deployment](Gobblin Deployment) page for more information.
Why does Gobblin on Hadoop stall for a long time between adding files to the DistrbutedCache, and launching the actual job?
Gobblin takes all WorkUnits created by the Source class and serializes each one into a file on Hadoop. These files are read by each map task, and are deserialized into Gobblin Tasks. These Tasks are then run by the map-task. The reason the job stalls is that Gobblin is writing all these files to HDFS, which can take a while especially if there are a lot of tasks to run. See the [Gobblin Deployment](Gobblin Deployment) page for more information.
This error typically occurs due to Hadoop version conflict issues. If Gobblin is compiled against a specific Hadoop version, but then deployed on a different Hadoop version or installation, this error may be thrown. For example, if you simply compile Gobblin using ./gradlew clean build -PuseHadoop2
, but deploy Gobblin to a cluster with CDH installed, you may hit this error.
It is important to realize that the the gobblin-dist.tar.gz
file produced by ./gradlew clean build
will include all the Hadoop jar dependencies; and if one follows the MR deployment guide, Gobblin will be launched with these dependencies on the classpath.
To fix this take the following steps:
- Delete all the Hadoop jars from the Gobblin
lib
folder - Ensure that the environment variable
HADOOP_CLASSPATH
is set and points to a directory containing the Hadoop libraries for the cluster
Cloudera Distributed Hadoop (often abbreviated as CDH) is a popular Hadoop distribution. Typically, when running Gobblin on a CDH cluster it is recommended that one also compile Gobblin against the same CDH version. Not doing so may cause unexpected runtime behavior. To compile against a specific CDH version simply use the hadoopVersion
parameter. For example, to compile against version 2.5.0-cdh5.3.0
run ./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.5.0-cdh5.3.0
.
In order for the above command to work, one may also need to add the following Gradle code to the list of repositories in each build.gradle
file.
maven {
url "https://repository.cloudera.com/artifactory/cloudera-repos/"
}
- Home
- [Getting Started](Getting Started)
- Architecture
- User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
- Case Studies
- Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
- Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
- Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
- Miscellaneous