Skip to content

Latest commit

 

History

History
104 lines (79 loc) · 2.53 KB

README.md

File metadata and controls

104 lines (79 loc) · 2.53 KB

Word count map-reduce in Python. Dockerized!

A toy map-reduce app that counts words occurence in a file. Consists of split.py, map.py, and reduce.py. Runs in containers, as Docker Swarm jobs. Orchestrated by StackStorm workflow wordcount.yaml

Usage

  1. Configure Swarm and StackStorm, per ../../README.md

  2. Build the containers and push to local repo. From the current directoy, run:

    ./docker-build.sh
    
  3. With StackStorm installed and pipeline pack set up (Configure StackStorm (follow ), and run the pipeline workflow:

    st2 run -a pipeline.wordcount \
    input_file=/vagrant/share/loremipsum.txt result_filename=loremipsum.res \
    parallels=8 delay=10
    

Details

Test on a tiny file:

wordcount/map.py data/nory.txt data/out.txt && \
wordcount/reduce.py data/out.txt data/res.txt && \
cat data/res.txt

a 6
catholic 6
was 6
because 3
her 3
mother 3
and 2
father 2
nory 1
been 1
his 1
norys 1
or 1
had 1

Now splitting by 2 and reducing back:

# Split a file by two
wordcount/split.py data/loremipsum.txt 2 data/out
# Map two parts
for i in 1 2; do wordcount/map.py data/out.$i data/out.map.$i ; done
# Check what we got
ls data
# Combine map output together
cat data/out.map.* > data/map.out
# Reduce to results
wordcount/reduce.py data/map.out data/loremipsum.res
cat data/loremipsum.res
rm data/*out*

Running in docker

Ssh to the any of a docker boxes, e.g. ssh st2.my.dev. Move (It is /vagrant/ in dev environment, adjust accordingly).

Build docker containers. From the current directory, run commands:


docker build -t split -f Split.Dockerfile .
docker build -t map -f Map.Dockerfile .
docker build -t reduce -f Reduce.Dockerfile .

Manually run the map-reduce with containerized functions:

# Split intput file by 2 chunks
docker run -it -v /vagrant/share/:/share split /share/loremipsum.txt 2 /share/out
# Run map on the two chunks
for i in 1 2; do docker run -it -v /vagrant/share:/share map /share/out.$i /share/map.out.$i; done
# Combine map output in one file
cat /vagrant/share/map.out.* > /vagrant/share/map.out
# Reduce to results
docker run -it -v /vagrant/share:/share reduce /share/map.out /share/loremipsum.res
# See the answer:
cat /vagrant/share/loremipsum.res
# Check that it's correct
diff /vagrant/share/loremipsum.res /vagrant/apps/wordcount/data/loremipsum.res
# Clean up intermediate files
rm /vagrant/share/*out*

Unit tests

Install pytest and run it from project root.