- Assigned: January 19th
- Due: January 21st 8:40 AM
The goal of this assignment is to set up a Linux virtual machine and Postgres database running in Microsoft Azure.
Many of the assignments in this class will use Microsoft's cloud computing infrastructure. Using a cloud service like Microsoft (or Amazon, etc) makes it easy to share data, and run any number of virtual machines that are identical for all students in the class. For this homework, we will use a free instance, but we will provide educational credits for future assignments.
Caution: running services (e.g., virtual machines) on any cloud provider costs money or credits. It can be very easy to spend more than you anticipated by leaving your services running. Make it a habit to stop your services when not in use. If you run out of credits, there's not much we (the staff) can do.
Signup
Register for a free trial account
You may be asked to submit your credit card information for screening purposes, however you will not be charged as the service will stop your account if your trial credits are used up. Once the class registration has settled down, we will provide you with information to use the class's Azure credits.
Launch an instance
- Go to https://portal.azure.com/.
- Click "Resource groups" in the top left of the screen.
- Click "Add".
- Type any name (e.g. "w4111").
- Change "Resource group location" to "East US".
- Click Create.
- Click "+ New" in the top left of the screen.
- Click "Compute", then select Ubuntu Server 14.04 LTS under Featured Apps.
- Click Create (leave "deployment model" at the default: "Resource Manager").
- Click on "Basics".
- In "User name" type the name for your account on your own machine, and fill in a password (please use a "good" one).
- In "Resource group" click the arrow on the right hand side and select the group you created in step 5.
- Click OK at the bottom.
- In the "Size" section, choose "A1 Standard" on the right.
- In the "Settings" section click "OK" at the bottom.
- In the "Summary" section click "OK". This will display a message that it is creating your machine.
- After it starts, you will see a number in "Public IP" like 40.76.207.44, this is your machine's address.
SSH to Your Instance
Windows Users: You can use PuTTY to connect to your VM, or install Cygwin and follow the direction below.
Using a terminal program (e.g, Terminal on Mac, xterm on Athena, or Cygwin on Windows), type the following (replacing the words in <angle brackets>
with the values from above).
ssh <username>@<IP address>
It will first ask to verify the identity of the server with the following message:
The authenticity of host 'hellodb.cloudapp.net (168.61.188.128)' can't be established.
RSA key fingerprint is 30:6f:12:df:b0:a7:4f:cc:f6:ee:19:a6:11:e5:2a:e9.
Are you sure you want to continue connecting (yes/no)?
Type "yes" and press enter. You will only need to do this once. It will then ask for your password. After you enter it, you should see something like:
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-25-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information as of Wed Aug 19 04:28:23 UTC 2015
System load: 0.27 Processes: 107
Usage of /: 4.1% of 28.80GB Users logged in: 0
Memory usage: 4% IP address for eth0: 10.0.0.4
Swap usage: 0%
Graph this data and manage this system at:
https://landscape.canonical.com/
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
Last login: Wed Aug 19 04:28:23 2015 from columbia.edu
ej@hellodb:~$
Install additional software
Ensure the following packages are available using the Ubuntu package management tool apt-get.
To install a package, type:
sudo apt-get install <packagename1 packagename2 ...>
Use this command to install the following packages:
- postgresql-9.3
- postgresql-server-dev-9.3
- sqlite3
- git
- python-virtualenv
- python-dev
Setup Python:
Python uses its own package manager to install/update/remove packages. In general, the following installs python packages:
pip install <packagename>
Typically the package manager will require sudo
and install the packages in a global folder that affects everyone using your machine. This is bad hygiene because different python applications may use different versions of packages and it's easy to step on each other's toes.
We will use virtualenv
to create virtual environments that contain their own copies of python
and packages. When we work in a virtual environment, pip will install packages local to the environment rather than globbaly. You can read a detailed tutorial. We installed the virtualenv
command with apt-get above.
Lets setup your environment
-
Install some helper commands from the
virtualenvwrapper
package (this is the one time you should install globally)sudo pip install virtualenvwrapper
-
add the wrapper commands (you may add this line in
~/.bashrc
so it runs when you create a bash shell)source /usr/local/bin/virtualenvwrapper.sh
-
create a new environment (will create a folder
test/
in~/.virtualenvs/
)mkvirtualenv test
-
switch (activate) an environment by using
workon
workon test
-
switch out of an environment:
deactivate
Now let's install a set of useful packages into your environment:
- Activate your environment (see above)
- install the following packages using
pip
(see above)- flask
- psycopg2
- sqlalchemy
- click
- Deactivate and you're done
Sign up for GitHub
GitHub is a source control repository website that many in industry use to manage their projects.
In fact, this class is organized as repositories under the organization 4111: http://www.github.com/w4111.
Each assignment is managed as a separate repository.
As the course progresses, more repositories will be made available.
Checkout the homework 0 repository
To work on an assignment, it's a good idea to "fork" the repository into your own GitHub account and complete the assignment there. Let's try this with the homework 0 repository at https://github.com/w4111/hw0
-
To start, fork the repository
-
Clone the repository to your Azure VM
-
Once you modify your files, commit the changes.
-
If you want to share it with a teammate, or just want to make sure GitHub has a copy of your files in case the VM crashes, Push the changes to GitHub.
-
If your teammate pushed to the same repository on GitHub, pull the changes from GitHub by typing within the repository:
git pull
Let's make sure you have access to Python, sqlite3, and the git repository.
Python
Type python
and ensure that you see the following like (the Python version may be slightly different 2.7.X):
Python 2.7.4 (default, Apr 19 2013, 18:28:01)
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Then try importing some modules from the packages we installed
>>> import flask
>>> import psycopg2
>>> import sqlalchemy
>>> import click
If that worked, push ctrl+d
to exit the prompt.
sqlite3
SQLite is an "embedded" SQL database (it doesn't depend on a dedicated server process; instead the client just manipulated a stored datbase file directly.)
To ensure it is installed, type sqlite3
and verify that you see the following:
SQLite version 3.7.15.2 2013-01-09 11:53:05
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
If you do, press CTRL-D
to exit the prompt (or type .quit
).
PostgreSQL
We will use the PostgreSQL DBMS in this class. Check that the client program works:
psql --version
If it prints something like the following then it works
psql (PostgreSQL) 9.3.9
git repository
Type cat hw0/README.md
You should see the instructions for this hw fly by.
To complete this homework, go into your hw0/
directory. There should be a file called
iowa-liquor-sample.csv
. The state of Iowa released
a dataset containing all sales transactions at alcoholic beverage stores during 2014. We will use
this dataset for many assignments in this course. Since it contains over 3 million records, this is
a small sample.
Disclaimer: this course does not condone drinking, we are using this dataset because it is a common format for a sales transaction log in a silghtly more accessible domain than typical bank transactions
Write a python script that reads the file and computes the number of records (in this file, each line is a record) that contain the exact case insensitive phrase "single malt scotch". Ignore upper and lower casing, so "Single Malt Scotch", and "SINGLE Malt Scotch" all match, whereas "Single's Malty Scootch" does not.
We recommend writing your program on your local machine, in your favourite text editor (I use Sublime Text but there are many other good ones). You can then copy this file to your VM with scp: scp <filename.py> <username>@<ip address>:
. Each time you edit it, you need to copy it, which is a bit annoying, but does work.
Submit your assignment at http://goo.gl/forms/SQzmn4T4xd
Note that you must be logged in to Columbia's lionmail to submit, and we will only consider your first submission.
Whew, you're almost done! Go read the assigned readings.
While your free trial is pretty good, it's very easy to accidentally use up all of your credits (including those the course will provide you). To conserve your hours (and avoid wasting energy), make sure to turn off your machine whenever you are not using it.
- Go to the Azure Console.
- In the left panel, click on "Virtual Machines".
- Select your VM.
- Click "Stop" in the menu at the top of the screen, then click "Yes".
- To resume the VM, click "Start" in the menu.
- Before doing future assignments, you'll have to follow these instructions and choose "Start" to restart your instance. Note that Shutting down and Starting the instance will potentially change the "Public DNS" value for that instance, however the URL .cloudapp.net will stay the same.
NOTE: Turning off your machine from the command prompt does not work on Azure (this seems to be "by design" which seems ... strange).