-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A quickstart demo to showcase Hudi functionalities using docker along with support for integration-tests #455
Conversation
@vinothchandar @n3nash : Rebased this branch against master. Tests passes locally. Please use this instead of vinothchandar#4 |
@vinothchandar : The build is failing here [ERROR] Failed to execute goal on project hoodie-hadoop-base-docker: Could not resolve dependencies for project com.uber.hoodie:hoodie-hadoop-base-docker:pom:0.4.4-SNAPSHOT: Could not transfer artifact com.uber.hoodie:hoodie-hadoop-docker:jar:0.4.4-SNAPSHOT from/to Maven repository (https://central.maven.org/maven2/): Host name 'central.maven.org' does not match the certificate subject provided by the peer (CN=repo1.maven.org, O="Sonatype, Inc", L=Fulton, ST=MD, C=US) -> [Help 1]
|
docs/quickstart.md
Outdated
|
||
Stock Tracker data will be used to showcase both different Hudi Views and the effects of Compaction. | ||
|
||
Take a look at the director `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directory*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge help to understand the Hoodie, thanks!
docs/quickstart.md
Outdated
# Schedule a compaction. This will use Spark Launcher to schedule compaction | ||
hoodie:stock_ticks->compaction schedule | ||
.... | ||
Compaction instance : 20180910234509 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems compaction schedule
is trying to kick off YARN instead of spark cluster. Here's a log when testing
2018-09-15 03:59:11 INFO TimelineClientImpl:297 - Timeline service address: http://historyserver:8188/ws/v1/timeline/
2018-09-15 03:59:11 INFO RMProxy:98 - Connecting to ResourceManager at resourcemanager:8032
2018-09-15 03:59:11 INFO AHSProxy:42 - Connecting to Application History server at historyserver/172.22.0.7:10200
2018-09-15 04:14:10 ERROR SparkContext:91 - Error initializing SparkContext.
java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "resourcemanager":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost
at sun.reflect.GeneratedConstructorAccessor20.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sungjuly . I have rebased and this issue should be fixed now.
docs/quickstart.md
Outdated
@@ -14,11 +14,11 @@ Check out code and pull it into Intellij as a normal maven project. | |||
|
|||
Normally build the maven project, from command line | |||
``` | |||
$ mvn clean install -DskipTests | |||
$ mvn clean install -DskipTests -skipITs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-skipITs
should be -DskipITs
@sungjuly @n3nash : Thanks a lot for providing the feedback. I have revamped this PR and made few improvements. Here is the list:
I have marked this PR as WIP since I need to showcase incremental view as part of demo. @n3nash @sungjuly : Can you please follow the quickstart steps and let me know your experience. |
3369065
to
0654237
Compare
|
||
``` | ||
cd docker | ||
./setup_demo.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've encountered this. It seems it's related docker version. Here's my env to figure out problems.
- docker-engine: 18.06.1-ce (there's no way to downgrade 17.12 since the official site only supports the latest version for Mac)
- docker-compose: 1.22.0
➜ docker git:(docker) ✗ ./setup_demo.sh
Removing network compose_default
Creating network "compose_default" with the default driver
Creating kafkabroker ...
Creating zookeeper ... error
Creating hive-metastore-postgresql ...
Creating namenode ...
Creating kafkabroker ... error
ERROR: for zookeeper Cannot create container for service zookeeper: Conflict. The container name "/zookeeper" is already in use by container "8fb174b9eb8c40bdc20e0ed1b042a793fe49ec3dcea1f353c250b482c7c80019". You have to remove (or rename) that container to be able to reuse that name.
ERROR: for kafkabroker Cannot create container for service kafka: Conflict. The container name "/kafkabroker" is alreCreating hive-metastore-postgresql ... error
ame) that container to be able to reuse that name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sungjuly : ok, can you check "docker container ls" and see if there are stopped/running containers with the same name. If you find them, can you remove the containers using "docker rm".
This is an one-time issue for those who tried this PR before. I have moved the compose script to a different file but the container names are same. Removing the docker containers should hopefully fix the issue.
For a different reason, I reverted my docker (mac) installation to "factory defaults" and that also fixed the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker env was cleaned up all things before I tested. Here's output for docker container ls
➜ docker git:(docker) ✗ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked after resetting to factory defaults on docker! thank you! @bvaradar
@sungjuly: I am also online in hoodie slack channel. Ping me there if you need quicker reply. Thanks, |
@bvaradar would you please share the link for hoodie slack channel? I don't see any information, thank you! |
nvmd, I found this - #143 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huge thank you @bvaradar. It's super helpful to understand hoodie more!
docs/quickstart.md
Outdated
|
||
|
||
# Execute the compaction | ||
hoodie:stock_ticks->compaction run --compactionInstant 20180910234509 --parallelism 2 --sparkMemory 1G --schemaFilePath /var/demo/config/schema.avsc --retry 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you can add more description compactionInstant
value should be updated based on previous results of compactions show all
.
|
||
``` | ||
cd docker | ||
./setup_demo.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked after resetting to factory defaults on docker! thank you! @bvaradar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar @n3nash @sungjuly : Updated PR with incremental view demo in quickstart. Ready for review.
Tests were failing because of log-size limit issue. Fixed as part of PR : #465 |
services: | ||
|
||
namenode: | ||
image: varadarb/hudi-hadoop_2.8.4-namenode:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct ? Should the username be here ? @bvaradar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash :
I am not aware of any headless docker hub account . I have created a ticket to provide more context and assigned to @vinothchandar
Once the ticket is resolved, we can replace the image locations in Dockerfile, pom.xml and in compose scripts.
- /tmp/hadoop_data:/hadoop/dfs/data | ||
|
||
historyserver: | ||
image: varadarb/hudi-hadoop_2.8.4-history:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
13ff672
to
6899513
Compare
docs/quickstart.md
Outdated
exit | ||
``` | ||
|
||
#### Step 5 (b): Run Spark-SQl Queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sungjuly @vinothchandar @n3nash : added spark-sql example here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I've tested with new scenarios. It worked properly! Separately creating MOR/COW tables is a good idea to explain internal works. Thank you for your work as a user!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 It is so much better to test stuff now.. :) Thank you for doing this, even as co-creator of the project :) ha ha.
0cb91de
to
a634dc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works really well. Took me 20 mins to get containers up, from a decent internet connection in India.
docs/quickstart.md
Outdated
2018-09-24 22:20:00 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped! | ||
2018-09-24 22:20:00 INFO SparkContext:54 - Successfully stopped SparkContext | ||
# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS | ||
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /var/hoodie/ws/hoodie-utilities/target/hoodie-utilities-0.4.4-SNAPSHOT.jar --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now hoodie-utilities-0.4.4-SNAPSHOT.jar
with the new release. same comment on other 3 occurrances as well.. Can we name the jar differently inside the container without the version? that way it will continue to work? (I understand it takes away knowing what version is being tested.) thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also DeltaStreamer now takes a --storage-type COPY_ON_WRITE/MERGE_ON_READ
argument which is required.. we can use this and get rid of steps to create the dataset manually via CLI? DeltaStreamer will create the dataset if the basePath does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar : I have made changes to treat hoodie-utilities jar in the same was as other bundle jars. The utilities jar will now be part of docker image with version removed. It is also available using the alias $HUDI_UTILITIES_BUNDLE. I have updated quickstart and uploaded newer docker images with "latest" tag so you should see the changes when you setup demo again. The 2nd step of explicitly initializing Hudi datasets using Hudi CLI is also removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bvaradar docker/compose#3574 don't think compose up
will pull in latest images.. I am going remove all containers and retry.. We probably need to have a better support in the script for optionally also doing a compose pull
before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That worked @bvaradar .. One more issue I found was between recreating containers, the path on host machine /tmp/hadoop*
needs to be blown away.. Can we add a rm -rf
before the mkdir -p
in the setup script.. other than that, verified that building and deltastreamer work..
docs/quickstart.md
Outdated
Lets run similar queries against M-O-R dataset. Lets look at both | ||
ReadOptimized and Realtime views supported by M-O-R dataset | ||
|
||
# Run agains ReadOptimized View. Notice that the latest timestamp is 10:29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
docs/quickstart.md
Outdated
1 row selected (6.326 seconds) | ||
|
||
|
||
# Run agains Realtime View. Notice that the latest timestamp is again 10:29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
docs/quickstart.md
Outdated
exit | ||
``` | ||
|
||
#### Step 5 (b): Run Spark-SQl Queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 It is so much better to test stuff now.. :) Thank you for doing this, even as co-creator of the project :) ha ha.
running in spark-sql | ||
|
||
``` | ||
$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client --driver-memory 1G --executor-memory 3G --num-executors 1 --packages com.databricks:spark-avro_2.11:4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docker exec -it adhoc-1 /bin/bash
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
docs/quickstart.md
Outdated
|
||
#### Step 7(b): Run Spark SQL Queries | ||
|
||
Running the same queries in Spark-SQl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: SQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a pass. other changes seem ok.. Let me know once you have fixed the minor issues I left in comments..
hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/schema.avsc | ||
hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/schema.avsc | ||
# Kafka Source | ||
#hoodie.deltastreamer.source.kafka.topic=uber_trips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
<packaging>maven-plugin</packaging> | ||
<name>docker-maven-plugin</name> | ||
<description>A maven plugin for docker</description> | ||
<url>https://github.com/spotify/docker-maven-plugin</url> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the plugin not published anywhere we can pull down from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This file was accidentally added
@vinothchandar : Thanks a lot for the review comments. Incorporated them and updated the PR. |
@bvaradar reported two issues with respect to reiniting containers.. otherwise its good for merging. I will go ahead and do that |
@@ -0,0 +1,49 @@ | |||
[![Gitter chat](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/big-data-europe/Lobby) | |||
|
|||
# docker-hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bvaradar is this file meant to be checked in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this. I have removed it now.
@@ -0,0 +1,15 @@ | |||
# Create host mount directory and copy | |||
mkdir -p /tmp/hadoop_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a protective rm -rf
here? otherwise NN will start in safe mode since it does not recognize the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once the docker containers are stopped, the files are deleted using "rm -rf ". I did not want to do the deletion before docker compose down so that docker containers can be shutdown more cleanly. Will chat f2f to explain more.
@bvaradar actually lets do a quick sync before I merge this |
@vinothchandar : Changed the setup_demo script to first do docker-compose pull before docker-compose up in order to pull the latest version of docker images. Addressed other comments. |
0aec4bd
to
3f3f97b
Compare
…er integration tests. Docker images built with Hadoop 2.8.4 Hive 2.3.3 and Spark 2.3.1 and published to docker-hub Look at quickstart document for how to setup docker and run demo
Comes with foundations for adding docker integration tests. Docker images built with Hadoop 2.8.4 Hive 2.3.3 and Spark 2.3.1
Demo using docker containers with documentation