Course preparation

In preparing for the course, we advise that you invest significant time exercising your coding skills in Python. Once classes start, we expect that you have a basic understanding of Python and some of its core data science packages.

We use the Python programming language (version 3.6) in this course but the course is not about Python. It's about analytical tools and methods for working with social data. To that end, Python is just a tool—like the ability to write English if you're taking a poetry class. Thus, learning Python from scratch as we go is a strategy that will not work. It's hard to learn to code and there is no way to avoid effort. However, if you get started early it will not only enhance your learning experience, it will also save you lots of time down the road. If you already know a scientific programming language like Matlab, R or Stata, you may have an easier time but learning Python will still require effort.

To get started, head over to the preparation page for the SDS summer course. The teaching material from the summer course is available here where we cover all the recommended knowledge and skills (see below). Some additional external resources are found below.

The most essential skills are basic Python and using Jupyter. We expect knowledge that covers basic data types, data structures, iterative procedures. For the first SDS summer lecture on this topic see slides/notebook. Besides our summer course there are many good Python introductions, for instance python-course.eu.

The most relevant data science packages are: pandas, seaborn, numpy, scipy and scikit-learn. It is not an absolute necessity that you know all points below for the first class but you are expected to fill out any gaps yourself during the first three weeks. The most relevant skills are:

Data structuring with pandas e.g. input/output, basic arithmetic operations, using groupby and joining datasets. A great short overview can be found in Greg Reda's three part intro here.
Plotting with seaborn using explorative plots, e.g. distplot, barplot, regplot, pairplot. It may also be relevant to look at the background plotting framework matplotlib.
Using scikit-learn to apply and evaluate basic machine learning methods e.g. lasso, random forest with cross-validation.
If you have time we encourage investigating numpy which is how Python works with matrices.

For more info on these packages check out their GitHub repo's and have a look at the documentation.

A good sanity check is that you are able to finish the following exercises with little effort:

basic Python and Jupyter (exercises, solutions)
the in-class pandas exercise from the summer course (exercises; solutions)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Course preparation

Clone this wiki locally