Hadoop Single Node Cluster on Docker.

Following this steps you can build and use the image to create a Hadoop Single Node Cluster containers.

Pull the image

 $ docker pull julienlau/hadoop-single-node-cluster:3.3.3

Creating the hadoop image

 $ git clone https://github.com/rancavil/hadoop-single-node-cluster.git
 $ cd hadoop-single-node-cluster
 $ docker build -t hadoop .

Creating the container

To run and create a container execute the next command:

 $ docker run --name <container-name> -p 9864:9864 -p 9870:9870 -p 8088:8088 -p 9000:9000 --hostname <your-hostname> hadoop

Then type the standard docker command Ctrl+p then Ctrl+q if you want to detach from the container and keep it running as a daemon or run directly with option -d.

Change container-name by your favorite name and set your-hostname with by your ip or name machine. You can use myhdfs as your-hostname

When you run the container, at the entrypoint you use the docker-entrypoint.sh shell that creates and starts the hadoop environment.

You should get the following prompt:

 hduser@myhdfs:~$

To check if hadoop container is working:

go to the url in your browser: http://localhost:9870
use hdfs bin from outside the host docker cp <container-name>:/home/hduser/hadoop-3.3.3/bin/hdfs . and try a mkdir ./hdfs dfs -mkdir hdfs://localhost:9000/tmp

Notice: if you want to change to another port than 9000 you must also adapt the file core-site.xml and rebuild the image... or redirect the port to say 19000 by using at docker run the option -p 19000:9000

Notice: the hdfs-site.xml configure has the property, so don't use it in a production environment.

 <property>
      <name>dfs.permissions</name>
      <value>false</value>
 </property>

A first example

Make the HDFS directories required to execute MapReduce jobs:

 hduser@myhdfs:~$ hdfs dfs -mkdir /user
 hduser@myhdfs:~$ hdfs dfs -mkdir /user/hduser

Copy the input files into the distributed filesystem:

 hduser@myhdfs:~$ hdfs dfs -mkdir input
 hduser@myhdfs:~$ hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml input

Run some of the examples provided:

 hduser@myhdfs:~$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.3.jar grep input output 'dfs[a-z.]+'

 2020-08-08 01:57:02,411 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
 2020-08-08 01:57:04,754 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
 2020-08-08 01:57:04,754 INFO impl.MetricsSystemImpl: JobTracker metrics system started
 2020-08-08 01:57:08,843 INFO input.FileInputFormat: Total input files to process : 10
 ..............
 .............
 ............
 File Input Format Counters 
      Bytes Read=175
 File Output Format Counters 
      Bytes Written=47

Examine the output files: check the output files from the distributed filesystem and examine them:

 hduser@myhdfs:~$ hdfs dfs -ls output/
 Found 2 items
 -rw-r--r--   1 hduser supergroup          0 2020-08-08 01:58 output/_SUCCESS
 -rw-r--r--   1 hduser supergroup         47 2020-08-08 01:58 output/part-r-00000

Checking the result using cat command on the distributed filesystem:

 hduser@myhdfs:~$ hdfs dfs -cat output/*
 1	dfsadmin
 1	dfs.replication
 1	dfs.permissions

Stopping and re-starting the container

To stop the container execute the following commands, to gratefully shutdown.

 hduser@myhdfs:~$ stop-dfs.sh
 hduser@myhdfs:~$ stop-yarn.sh

After that.

 hduser@myhdfs:~$ exit

To re-start the container, and go back to our Hadoop environment execute:

 $ docker start -i <container-name>

Data persistence

This docker does not use volume. Data will not be persisted beyond the life of a container instance. You would clean the data by doing:

 $ docker stop <container-name> && docker rm <container-name>

Kubernetes 101

To host it on a dev kubernetes cluster:

kubectl apply -f hdfs-deployment.yaml
kubectl expose deployment hdfs --type=NodePort --name=hdfs-service
kubectl get svc hdfs-service
# get map for port 9000 of service hdfs-service
kubectl get svc hdfs-service -o=jsonpath='{.spec.ports[?(@.port==9000)].nodePort}'
kubectl get svc --all-namespaces -o go-template='{{range .items}}{{ $save := . }}{{range.spec.ports}}{{if .nodePort}}{{$save.metadata.namespace}}{{"/"}}{{$save.metadata.name}}{{" - "}}{{.name}}{{": "}}{{.targetPort}}{{" -> "}}{{.nodePort}}{{"\n"}}{{end}}{{end}}{{end}}'

At this point you should be able to run a tpcx-hs on kubernetes workload using the spark-submit option --conf "spark.hadoop.fs.defaultFS=hdfs://hdfs-service.default.svc.cluster.local:9000/" by using for example the following project : https://github.com/julienlau/tpcx-hs

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
examples		examples
Dockerfile		Dockerfile
README.md		README.md
core-site.xml		core-site.xml
docker-entrypoint.sh		docker-entrypoint.sh
hdfs-deployment.yaml		hdfs-deployment.yaml
hdfs-site.xml		hdfs-site.xml
ssh_config		ssh_config
yarn-site.xml		yarn-site.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Single Node Cluster on Docker.

Pull the image

Creating the hadoop image

Creating the container

A first example

Stopping and re-starting the container

Data persistence

Kubernetes 101

About

Releases

Packages

Contributors 4

Languages

rancavil/hadoop-single-node-cluster

Folders and files

Latest commit

History

Repository files navigation

Hadoop Single Node Cluster on Docker.

Pull the image

Creating the hadoop image

Creating the container

A first example

Stopping and re-starting the container

Data persistence

Kubernetes 101

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages