1, 2 & 4 July 2019, Cambridge University Bioinformatics Training
Instructors: Hugo Tavares & Sandra Cortijo
Helper: Martin van Rongen
This is a general introduction to R for exploratory data analysis.
Our practicals will be very hands-on, focusing on learning the necessary sintax to allow you to do exploratory data analysis in R, from data manipulation to visualisation. We will focus on tabular data, which is general enough to allow you to apply these skills to a wide range of problems. On the third day we will go through a more complex example using transcriptomic data.
Below, we provide links to detailed materials for your reference, many of which were developed by the Data Carpentry organisation.
If you have any questions please post a new issue on our GitHub repository.
All necessary software and data will be available on the training machines at the Bioinformatics Training Room (Craik-Marshall Building).
However, you are welcome to use your own laptop, in which case you need to:
- Download and install R (here)
- Download and install RStudio (here)
- Install the CRAN R packages
tidyverse
,corrplot
,cowplot
andggfortify
(open RStudio and go toTools > Install Packages
) - Install the Bioconductor R package
ComplexHeatmap
(instructions here)
This lesson will cover the basics of using R with RStudio and how to produce a wide range of graphs for data visualisation.
- Introduction to RStudio
- Introduction to R
- Starting with data
- Data visualisation using
ggplot2
(part I)
This lesson will cover some functions to effectively manipulate and summarise tabular data and we will learn more about data visualisation.
Digital data recording often starts with a spreadsheet software (e.g. Excel). For an effective data analysis, it's crucial to start with a well structured and formatted dataset. Because of this, we will have a brief discussion about common issues that should be considered when recording data.
Further reading:
- Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10
- Hadley Wickham (2013) Tidy Data, Journal of Statistical Software, 59:10
In this session we will apply the concepts learned so far to a worked example of an exploratory data analysis of transcriptomic data.
Further reading:
- Conesa et al. (2016) A survey of best practices for RNA-seq data analysis, Genome Biology 17, 13
- Jake Lever, Martin Krzywinski & Naomi Altman (2017) Principal component analysis, Nature Methods 14, 641–642
- Naomi Altman & Martin Krzywinski (2017) Clustering, Nature Methods 14, 545–546
- One page summary of functions
- Summary of R basics
- Summary of dplyr functions and their equivalent in base R
- Cheatsheets for dplyr, ggplot2 and more
- Data-to-Viz website with great tips for choosing the right graphs for your data
Reference books:
- Holmes S, Huber W, Modern Statistics for Modern Biology - covers many aspects of data analysis relevant for biology/bioinformatics from statistical modelling to image analysis.
- Peng R, Exploratory Data Analysis with R - an more general introduction to exploratory data analysis techniques.
- Grolemund G & Wickham H, R for Data Science - a good follow up from this course if you want to learn more about
tidyverse
packages. - McElreath R, Statistical Rethinking - an introduction to statistical modelling and inference using R (a more advanced topic, but written in an accessible way to non-statisticians).
- Also see the lecture materials, which include access to the draft of the book's second edition.
- James G, Witten D, Hastie T & Tibshirani R, Introduction to Statistical Learning - an introductory book about machine learning using R (also advanced topic).
- Also see this course material for a practical introduction to this topic.
Other courses at Cambridge:
- List of scheduled courses
- Some particular courses that might be of interest:
- Note that you do not need to attend the "Intro to R" courses, because we've already covered that material in this course.