This repository presents a lightweight demo of HTCondor cluster on GCP (Google Cloud Platform). Such a cluster can be applied to many fields' work. We do only stocks price simulation with random walk process and Monte-Carlo in this repository.
This repository got expired by an example from Google Cloud Solution (link: https://cloud.google.com/solutions/analyzing-portfolio-risk-using-htcondor-and-compute-engine), which deploys a high-throughput-computing (htc) cluster with HTCondor and do simulation for four stocks (AMZN, GOOG, FB, NLFX) with random walk process and Monte-Carlo (1000 simulations for each stock).
-
We would go through all companies listed on Nasdaq, the current amount is 8856 (updated per day). For each stock, it will do 1000 random-walk simulations, it says near nine million simulations all together. By my Macbook pro (2.8 GHz Core i7, 16 GB RAM), the whole process takes over three hours. We will see the HTC cluster on GCP does such work in just minutes.
-
The original example fixes size of the HTC-cluster as six virtual machines: one central manager, one jobs submitter and four workers. In this repository, users can set number of the machines. Though, please estimate cost of the machines. For example, I used 400 predefined n1-standard-1 virtual machines and 400 preemptible n1-standard-1 virtual machines. The cluster took around 3 minutes to do simulation for all 8851 companies listed on Nasdaq, current price for n1-standard-1 is 0.0475$ and 0.01$ per machine per hour for predefined and preemptible respectively, then cost for the 3 minutes work was 1.15$ (= 3 x 400 / 60 x (0.0475 + 0.01)), in addition, it costed money for used RAM of those machines.
-
We can speed further up by using more VMs and more advanced VMs. For example, it will run faster if it allocates n1-standard-4 instead of n1-standard-1. But the cost will fly up. Parallel computing helps. Since we use n1-standard-4 VMs, we should run computation in 4 parallel processes. In my recent test, it allocated eight n1-standard-4 VMs, did 4-processing in each VMs. It took only 290 seconds. While it took 360 seconds with 32 n1-standard-4 VMs but single processing per machine.
If you are interested in the comparison between HTC-cluster with and without parallel computing. Please go through this https://medium.com/@bo.huang_61578/high-through-put-cluster-with-parallel-computing-on-google-cloud-platform-5c2f54d3212e?source=friends_link&sk=91d07789afa1f5a0bf1fce264addd9a8
The above image shows the architecture of the HTC-cluster and its working process. Let's go through it block by block from left to right.
All relevant code and simulation results will be stored in Cloud Storage. We would build a bucket in Cloud Storage for this project. We would build reusable virtual machines images and leave them on Google Cloud Platform. It would be very easy to create the HTC-cluster when we need it, and destroy it after the work to stop money counting.
With the prepared virtual machine images and coding files, the HTConder-cluster can be created by one line of command. This cluster consists of one central manager machine (condor-master), one submitter machine (condor-submit) and several worker machines (condor-compute).
Once we trigger the cluster, submitter machine will distribute the jobs to worker machines and the manager would cover the scheduling things. If some of the workers are stuck or lie down, the relevant jobs will be rescheduled to other workers by the manager machine.
Each worker machine will run the assigned codes with assigned data, deliver results to Cloud Storage.
Now let's implement this repository step by step. The blocks in gray is what we will do with the Cloud Shell's terminal.
First of all, please make sure that you have created project and enabled billing on Google Cloud Platform.
We highly recommend to use GCP's Cloud Shell with Linux as your platform. Please install Cloud-SDK into your own Linux system if you would not use GCP's Cloud Shell. OSX is not a good choice because of the Makefile.
In the following steps, it assumes that we work on the Cloud Shell.
user_name@cloudshell:~ (your project)$ git clone https://github.com/BoHuang2018/hpc-htc-demo-stocks.git
Or,
user_name@cloudshell:~ (your project)$ git clone https://github.com/avalonsolutions/avalonx-stockpriceprediction.git
$ cd hpc-htc-demo-stocks
Or,
$ cd avalonx-stockpriceprediction
$ make upload bucketname=hpc-htc-demo-stocks
$ make createimages
$ make createcluster
Note the "properties" in the .yaml file (condor-cluster.yaml), we can increase the number of "count" (number of predefined virtual machines) and "pvmcount" (number of preemptible virtual machines) to hundreds and even thousands. Please estimate the cost before you use those big numbers.
Now we set 0 as 'count' and 8 as 'pvmcount', and the 'instancetype' is 'n1-standard-4'. It says we will use only 8 preemptible virtual machines in the type of n1-standard-4. The total number of virtual machines would be 10, because we need one for condor-master and one for condor-submit.
Now the high-throughput-computing cluster is ready, it's time to run it.
$ make ssh bucketname=hpc-htc-demo-stocks
$ gcloud compute ssh condor-submit --zone us-east1-b --command "python3 script_generator_runner.py --start_date=2017-06-01 --end_date=2019-06-01"
The terminal would print like
Submitting job(s)..........
There will be many '.' because there is a long sequence jobs what condor-submit needs to submit. After a while, the terminal print something about the jobs are finished and tell the total time consuming.
In your Storage's bucket, gs://hpc-htc-demo-stocks, there finds a new folder named as 'stock_simulation_based_on_2017-06-01_2019_06_01', which includes .csv files storing the simulation results.
If you are curious about what condor-submit do, you can load back to the condor-submit's terminal:
$ gcloud compute ssh condor-submit --zone us-east1-b
And use the HTCondor's function 'condor_q':
$ condor_q
Then it would list all jobs, tell you how many have be done, how many are on hold, and so on.
At the same time, there would finds files named like 'err.1' ... 'err.8800' ... 'out.', 'run.' which help to trackback and debug.
Please remember to delete all resource on GCP if you would not use the HTC-cluster for a while, or the billing from GCP will be heavy because the cluster using 34 virtual machines.
There are two ways to clean up
- Delete the whole project on the GCP console
- Destroy the HTC-cluster and delete the bucket in Cloud Storage (the bucket is not expensive) by GCP console
To destroy the cluster :
$ make destroycluster
The End.