GOAL
To setup a basic prototype for distributed computing in docker. If time permits, add a complex computing task.
INTRODUCTION
For the purpose of Distributed (Scientific) Computing, scientists across the world have been mostly using pre-configured VM images to let the client volunteer in contributing towards micro-processing tasks that involve processing of raw data received in chunks over the network.
But, since the introduction of Docker, life has changed and so have the performance benchmarks. We propose an system that uses the benefits of Docker to hopefully perform far better than the currently achieved milestones through VMs. The VMs have a huge overhead of starting up, as compared to Docker containers. Moreover, we don't even need to explain the difference between running more than one VM on a HostOS compared to running multiple docker containers on that same machine! See the point? :)
-
Youtube Video Explaining this project:
-
Click here for Screencast for running just one script on client side.
INSTALLATION
-
Server side (src/server/):
-
Make sure that your 'src/server/' is up and running, either locally (for test purpose), or if its deployed elsewhere, then the hostname/IP and Port is provided in the environment variables as under
$DC_HOST
and$DC_PORT
.(refer next major point on 'Client side' for this script)
-
Do ensure that for running the server, you need to install mongoDB. Refer to following: install_mongo guide and then set the env variables
U_DB, U_USER and U_PASS
giving the same values to them as the db name, it's user and password set while setting up mongoDB. -
Ensure that you've installed deps from
dockerComp/src/server/requirements.txt
-
To run the src/server, open up a terminal, go to dockerComp/src/server/ and run
$ ./start
This starts the server locally on your machine.
-
-
Client side (src/client/):
-
Note: For server side deployement (i.e., the server that basically is responsible for distributing data to clients), It has to be deployed somewhere and it's IP has to be provided in your
configuration
file. And then you may distribute the scriptinstaller.sh
alongwith theconfiguration
to the clients. -
Download This Script and run
$ ./installer.sh
[configure your Server location for this script, as under$DC_HOST
&$DC_PORT
]- Once installed, the daemon output would lie in
$HOME/dockerComp/src/client/scripts/nohup.out
and the daemon itself, would like in$HOME/dockerComp/src/client/scripts/slave_manager
. To kill the daemon, you need to run$ kill -9 $(ps -e | grep slave_manager | awk -F' ' '{print $1}')
-
Should you need to remove all traces of dockerComp from your machine, just run the script 'cleanup` included in the source code of this project root.
Cheers! :)
NOTES
-
Demo link to be updated soon.
-
In case you're curious how to go about running this from client side:
- So once the server is up and running, all one has to do is download and run installer.sh
-
Docker Image:
$ docker pull arcolife/docker_comp
(will be kept updated)
FAQ
Refer to Wiki .. click Here!.
References:
- http://www.rightscale.com/blog/sites/default/files/docker-containers-vms.png
- http://en.wikipedia.org/wiki/Docker_%28software%29#cite_ref-3
So, just to give you a context of this whole project, take a look at this project called CernVM. This is a really awesome project, developed to help collect CERN's LHC data and perform data analysis on a volunteer's computer or even on commercial clouds. Just imagine if the whole process of using VM was dockerized!
FEATURES
- Can be used for: - Image Processing - General Data Analysis - Scientific Computing - CrowdSourcing projects.
FUTURE GOALS
-
Make this a pluggable dockerized distributed computing tool, where you just have to include a compution task (say, map-reduce) and make it send data to clients. The app should be able to handle the rest.
-
Benchmark results and compare with existing methodologies.
TESTS
-
From client side:
- although the default connection establishment test is included with install scripts;
run
$ src/client/scripts/test_server_conn
(make sure env varsDC_HOST
andDC_PORT
are set)
- although the default connection establishment test is included with install scripts;
run
-
From server side:
- TBD
-
Workloads:
- Currently a simple task. TBD.
WORKFLOW
-
Server
-
Dashboard to Manage:
- No. of Clients (and # of containers per client)
- Resources allocated to the containers
-
Master app that manages data sent to each client and checks for integrity.
-
-
Client
- Installation of Docker
- Starting Containers
- Installation of Application inside the Container
- Connection Establishment with the Server.
- Scripts for the computation
- Error Reporting
REFERENCES
- https://github.com/cernvm
- http://en.wikipedia.org/wiki/List_of_distributed_computing_projects
- http://www.rightscale.com/blog/sites/default/files/docker-containers-vms.png
- http://www.psc.edu/science/
- http://pybossa.com/
- https://okfn.org/press/releases/crowdcrafting-putting-citizens-control-citizen-science/
- http://www.mediaagility.com/2014/docker-the-next-big-thing-on-cloud/
- http://cernvm.cern.ch/portal/
- http://www.nature.com/news/software-simplified-1.22059