-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnew-anno-prac.Rnw
557 lines (362 loc) · 27.8 KB
/
new-anno-prac.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
%\VignetteIndexEntry{Genomic Annotation Practical}
%\VignettePackage{GeneticsHTSCourse}
%\VignetteEngine{knitr::knitr}
% To compile this document
% library('knitr'); rm(list=ls()); knit('DESeq2.Rnw')
\documentclass[12pt]{article}
\newcommand{\usecase}{\textit{\textbf{Use Case: }}}
<<knitr, echo=FALSE, results="hide">>=
library("knitr")
opts_chunk$set(tidy=FALSE,dev="png",fig.show="hide",
fig.width=4,fig.height=4.5,
message=FALSE,eval=FALSE)
@
<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@
\title{Annotation and Visualisation of sequencing data in Bioconductor}
\author{Mark Dunning}
\date{Last modified: \today}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\maketitle
\tableofcontents
\section{Annotation using an online resource}
The \Biocpkg{biomaRt} package provides an interface to the \href{www.biomart.org}{{\color{blue}biomart}} website; allowing the data to be accessed in a consistent manner and without having write complex SQL queries or understand the underlying schema. The databases available through BioMart can be obtained with the \Rfunction{listMarts} function
\subsection{General BiomaRt Usage}
\usecase
Connect to the default version of Ensembl through BiomaRt. What version of the human genome does it provide?
<<>>=
library(biomaRt)
listMarts(host="www.ensembl.org")
ensemblLatest <- useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl")
datasets <- listDatasets(ensemblLatest)
datasets[which(datasets[,1] == "hsapiens_gene_ensembl"),]
@
Unfortunately, our RNA-seq was aligned and annotated against \textit{hg19} (a.k.a GRCh37). Different versions of biomaRt and genome version can have different attributes and different underlying annotation. Therefore, if we annotate using a conflicting version of Ensembl our results may be misleading.
\usecase Connect to the human genome in the February 2014 archive of biomaRt. Verify that hg19 (/GRCh37) is being accessed.
<<>>=
ensembl_75 = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="feb2014.archive.ensembl.org",
path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
datasets <- listDatasets(ensembl_75)
datasets[which(datasets[,1] == "hsapiens_gene_ensembl"),]
@
The function \Rfunction{getBM} is used to build-up queries in biomart. We must specify one or more \textit{attributes} that we want to retreive along with the \textit{filters} that we are using the query the database.
\subsection{Annotating the RNA-seq results}
Having now connected to a suitable database through \Biocpkg{biomaRt}, we will now proceed to use the online resources to annotate our RNA-seq results.
\usecase
Read the results of the edgeR analysis. If you have a set of results from the RNA-seq practical, feel free to use these. Otherwise, you can find a set of results in the Day2 course directory. Take just the top 100 results to make the queries run quicker
<<>>=
load("Day2//topHits_FWER_0.01.RData")
head(topHits)
topHits <- topHits[1:100,]
@
\bioccomment{Loading the object {\tt topHits\_FWER\_0.01.RData} creates an object called {\tt topHits} in your workspace.}
The \Robject{topHits} data frame provides us with a list of genes that are found to be \textit{statistically significant}. However, it is a separate investigation to determine if the genes are \textit{Biologically significant}.
We can use \Biocpkg{biomaRt} to annotate the results of our RNA-seq analysis. Firstly, we have to recognise that the first column of the \Robject{topHits} data frame are Entrez IDs. We can then use these IDs as filters to access the database. When we want to access the database, it is important to know the exact name that \Biocpkg{biomaRt} uses for the filter.
\usecase What is the name of the filter for Entrez gene ID?
\bioccomment{We can search for particular text in the data frame of attributes using \Rfunction{grep}, \Rfunction{match} etc}
<<>>=
flt <- listFilters(ensembl_75)
head(flt)
flt[grep("entrez",flt[,1]),]
@
Then it is a matter of deciding what attributes we want to retrieve from the database. Again, we need to know the names that \Biocpkg{biomaRt} will accept for the particular attributes that we want.
\usecase What are the attribure names for HGNC symbol, chromosome name and description?
<<>>=
attr <- listAttributes(ensembl_75)
attr[1:50,]
attr[grep("symbol",attr[,1]),]
@
We now have everything we need in order to make a biomaRt query.
\usecase
Annotate the top hits from the DEseq analysis with their gene symbol and description. Make sure that you also include Entrez ID in the list of attributes that are returned.
\warning{You will need an internet connection in order for this to work}
<<>>=
extraInfo <- getBM(attributes = c("entrezgene","hgnc_symbol", "description"),
filters = "entrezgene",
values = topHits[,1],
mart = ensembl_75)
head(extraInfo)
@
You may notice that the results retrieved from biomaRt may not be in the same order as the RNA-seq results that we are trying to annotate. The \Rfunction{merge} function is useful in joining two data frames together and making sure that the rows match-up. In addition to the two data frames we are trying to merge, we also need to have a common identifier in both data frames (in our case, the Entrez ID). The \Rcode{by.x} and \Rcode{by.y} arguments are used to specify the column that this identifier can be found in.
\usecase Create a merged table containing both edgeR results and the annotations.
<<>>=
annotatedHits <- merge(topHits, extraInfo, by.x = 1, by.y = 1,sort=FALSE)
head(annotatedHits)
@
The resulting table, \Robject{annotatedHits}, can now be used to ease the interpretation of the RNA-seq analysis and can be shared with collaborators.
\subsection{Retrieving Gene Sequences}
In this section we will explore how to retrieve sequences for our most differentially-expressed genes. Such a set of sequences could be used to discover motifs, for example.
\usecase Retrieve the chromosome names, start and end positions and strand for the top hits. Restrict the data to just chromsomes 1 to 22 and the sex chromosomes.
<<>>=
posInfo <- getBM(attributes = c("entrezgene", "chromosome_name","start_position"
,"end_position","strand"),
filters = "entrezgene",
values = topHits[,1],
mart = ensembl_75)
posInfo <- posInfo[posInfo[,2] %in% c(1:22,"X","Y"),]
@
You should be aware that plenty of functionality exists to manipulate and analyse intervals in the form of \Robject{GRanges} objects. You'll see that the chromosome names from \Biocpkg{biomaRt} are numerical values, whereas other packages in Bioconductor expect chromosome names with the \textit{chr} prefix. Also, the strand is not given in a standard format and we need to convert these into '+' or '-' characters.
\usecase Create a GenomicRanges representation of these gene coordinates. You will need to make sure that the sequence names and strand information is acceptable.
\bioccomment{The \Rfunction{ifelse} function is a useful way of populating a vector based on a logical test. i.e. each value in the output vector is determined by the result of the logical test.}
<<>>=
library(GenomicRanges)
strand = ifelse(posInfo[,5] == 1, "+", "-")
genePos <- GRanges(paste0("chr",posInfo[,2]),
IRanges(posInfo[,3],posInfo[,4],names=posInfo[,1]),strand)
@
\bioccomment{You may sometimes find the \Rfunction{with} function used in these situations. It allows the columns from a data frame to be accessed by name, so the code is a bit more elegant. The result should be the same however.}
<<>>=
genePos <- with(posInfo, GRanges(paste0("chr",chromosome_name),
IRanges(start_position,end_position,names=entrezgene),
strand))
@
\usecase Load the package that provides genome sequence for hg19 and get the DNA sequences for the top hits. Translate the sequences into amino acids
<<>>=
library(BSgenome.Hsapiens.UCSC.hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19
myseqs <- getSeq(hg19,genePos)
myseqs
translate(myseqs)
@
The GRanges object that we created can be manipulated using the functionality from the \Biocpkg{IRanges} package. For instance we can resize, extend, collapse the intervals.
\usecase Get the sequences 100 base-pairs upstream of each gene and write these to a \textit{fasta} file
<<>>=
promSeqs <- getSeq(hg19, flank(genePos,100))
writeXStringSet(promSeqs , file = "topGenesPromoters.fa")
@
\section{Annotation using pre-built packages}
Organism-level packages provide an alternative to \Biocpkg{biomaRt} and permit annotation queries offline. These packages are re-built every 6 months as part of the Bioconductor development cycle and are version controlled. Therefore it is easy to keep track of which version of the package was used for a particular analysis. Each package is built around a central identifier (e.g. Entrez ID) and meta data surrounding the annotation sources used can be displayed once such an organism package has been installed and loaded.
The terminology of {\tt columns} and {\tt keys} is equivalent to {\tt attributes} and {\tt filters} used by \Biocpkg{biomaRt}. The valid columns and keys can be interrogated using \Rfunction{columns} and \Rfunction{keytypes}.
\usecase Load the annotation package for Humans and display the metadata describing how the package was created. Find out what information can be retrieved from the package and what types of key can be used.
\bioccomment{For covenience, we often assign a shorter name to the annotation package.}
<<>>=
library(org.Hs.eg.db)
hs <- org.Hs.eg.db
columns(hs)
keytypes(hs)
@
For completeness, we note how to add extra information to our top hits using the pre-built organism packages rather than biomaRt. The \Rfunction{select} function is used to make a query to the organism database package. We have to supply a set of valid keys and columns, in a similar manner to how we used \Biocpkg{biomaRt}.
\usecase Repeat the annotation steps from Section 1.2 using the \Biocannopkg{org.Hs.eg.db} organism package rather than biomaRt
<<>>=
extraInfoFromOrg <- select(org.Hs.eg.db, keys = topHits[,1],
keytype = "ENTREZID",columns = c("SYMBOL","GENENAME"))
head(extraInfoFromOrg)
extraInfoFromOrg <- extraInfoFromOrg[order(extraInfoFromOrg[,1]),]
head(extraInfoFromOrg)
@
\bioccomment{If you want to table in the same order as the biomaRt results, you need to re-order according to the first column}
In the analysis of this RNA-seq dataset, we allocated reads to genes. However, in this section we will explore the ways of counting reads that align to other genomic regions of interest. In this case we will use \textit{exons} and have to re-visit the orginal aligned reads.
In order to keep the computational requirements for the analysis down file we will restrict the analysis of genes that occur on chromosome 22. We therefore need a list of the IDs of all such genes. We can do this using the \Biocannopkg{org.Hs.eg.db} organism package.
\usecase
Use the appropriate organism package to retrieve the Entrez IDs of all genes located on Human chromosome 22.
<<>>=
chr22Genes <- select(hs, columns = "ENTREZID", keytype = "CHR", keys = "22")
head(chr22Genes)
chr22ID <- chr22Genes[, 2]
@
\bioccomment{You should see a warning message about location-based queries in the \Biocannopkg{org.Hs.eg.db} organism package being deprecated. This means such functionality will be disabled in future releases. In the following sections we will look at obtaining location information in a more-rigorous manner.}
\warning{If you get the error message:\\ \Rcode{unused arguments (columns = "ENTREZID", keytype = "CHR", keys = "22")} \\ when trying to run the \Rfunction{select}, it is because another R package has defined another function called \Rfunction{select} and over-written the function from \Rpackage{AnnotationDBI} that we intended to use. If this occurs the statement \\ \Rcode{select <- AnnotationDbi::select} \\ should allow you to run the line of code.}
\subsection{Transcript packages}
Bioconductor provide a number of pre-built packages that have a database of all transcript information for a particular organism. The package we are going to use is \Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene}. A full list of available transcript packages is available on the Bioconductor \href{http://tinyurl.com/l7hsus5}{{\color{blue}website}}. Look for the packages that start with {\tt TxDb.}.
As its name suggests, \Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene} was built from the UCSC tables for the hg19 genome. Convenient functions exist that can return the gene structure as a \Rclass{GRanges} object. One such function is \Rfunction{exonsBy} which returns a list of \Robject{GRanges} objects, each item in the list pertaining to a particular gene. Thus, the output of \Rfunction{exonsBy} can be subset in the usual manner ({\tt '[[}').
\bioccomment{transcripts and cds regions can be selected in a similar fashion. See the help pages of \Rfunction{intronsBy} and \Rfunction{cdsBy}}
\usecase
Get all the locations of exons on chromosome 22, grouped according to gene. How many Genes are there on chromosome 22?
<<>>=
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx <- TxDb.Hsapiens.UCSC.hg19.knownGene
exo <- exonsBy(tx, "gene")
exo
length(exo)
exo22 <- exo[names(exo) %in% chr22ID]
length(exo22)
@
\bioccomment{If you cannot find a TranscriptDb package that provides the annotation you need, you can use the \Rfunction{makeTxDbPackage} function to create one from the UCSC browser, biomaRt or another source. See the help page \Rcode{?makeTxDbPackage}for more details.}
\bioccomment{Make sure that you understand how the ranges in this exon list are accessed. HINT: Why does {\tt exo22[[1]]} work, but {\tt exo22[["1"]] not}}
\usecase
Import the aligned reads for sample "16N" using \Biocpkg{GenomicAlignments}. Restrict the query to just chromsome 22 reads.
<<>>=
library(GenomicAlignments)
bam <- readGAlignments(file="Day2/bam/16N_aligned.bam",
param=ScanBamParam(which = GRanges("chr22",IRanges(1,51304566))),
use.name=TRUE)
@
We are now in a position to begin overlapping reads with exons. You should have already seen the \Rfunction{findOverlaps} function that will report the amount of ovelap between two sets of ranges. We will start simple by just doing an overlap of a single gene to make sure that we understand the output
Before performing an overlap operation, it is worth checking that the {\tt seqLevels} of the bam file and ranges are compatible.
\usecase Use the \Rfunction{countOverlaps} function to count the number of reads covering each exon of Entrez ID "4627".
<<>>=
seqlevelsStyle(bam)
seqlevelsStyle(exo22)
counts <- countOverlaps(exo22[["4627"]], bam)
counts
@
\Rfunction{summarizeOverlaps} extends \Rfunction{findOverlaps} by providing options to resolve reads that overlap multiple features. Each read is counted a maximum of once. Different modes of counting are available. See the help page for \Rfunction{summarizeOverlaps} for more details.
\usecase
Repeat the counting for all genes on chromosome 22
<<>>=
so <- summarizeOverlaps(exo22, bam)
gCounts <- assays(so)$counts
dim(gCounts)
head(gCounts)
@
\section{Visualisation}
\subsection{Exporting tracks}
It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis and integration with other data types, or sharing with collaborators. For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the \textit{bed} format to display these locations.
At the moment, we have a \Robject{GenomicFeatures} object that represents every exon. However, we do not need this level of granularity for the bed output.
\usecase Use the \Rfunction{range} function to obtain a single range for every gene
<<>>=
geneRegions <- unlist(range(exo))
@
\bioccomment{Here the \Rfunction{unlist} function collapses the object into a single range interval for each gene}
We can now select the names of the statistically significant genes from the edgeR output in the usual manner.
\usecase
Select all significant genes from the edgeR analysis (lets use a p-value of 0.01), and then get the genomic intervals that correspond to these genes.
<<>>=
load("Day2//topHits_FWER_0.01.RData")
sigResults <- topHits
head(sigResults)
sigGeneRegions <- geneRegions[na.omit(match(sigResults[,1],names(geneRegions)))]
@
\bioccomment{For some reason, some of the genes reported in the edgeR output do not have associated genomic features. Hence we have to discard such genes from the output by using \Rfunction{na.omit}}
Rather than just representing the genomic locations, the .bed format is also able to colour each range according to some property of the analysis (e.g. direction and magnitude of change) to help highlight particular regions of interest. A \textit{score} can also be displayed when a particular region is clicked-on. A useful propery of \Robject{GenomicRanges} is that we can attach metadata to each range using the \Rfunction{mcols} function. The metadata can be supplied in the form of a data frame.The \Biocpkg{rtracklayer} function can be used to import and export browsers tracks and can make use of these data when creating a bed file for visualisation.
\usecase
Store the adjusted p-values and log fold-change in the metadata of the ranges object for the significant genes
<<>>=
mcols(sigGeneRegions) <- sigResults[match(names(sigGeneRegions), sigResults[,1]),]
@
\usecase Tidy-up the GenomicRanges object so that only chromosomes 1 to 22 and sex chromosomes are included
<<>>=
seqlevels(sigGeneRegions)
sigGeneRegions <- keepSeqlevels(sigGeneRegions, paste0("chr", c(1:22,"X","Y")))
@
\bioccomment{Here, \Rfunction{keepSeqLevels} is used as a convenience to retain only the chromosomes that we are interested in}
We can create a colour scheme from the fold-change values. i.e. so that large positive differences are dark blue, and large negative changes are dark red (or vice-versa)
\usecase
Create a vector where the entries are either red or blue depending on whether each fold change is negative or positive
<<>>=
logfc <- sigGeneRegions$logFC
Col <- ifelse(logfc < 0, "darkred","darkblue")
@
The colours and score can be saved in the \Rclass{GRanges} object in \Rcode{score} and \Rcode{itemRgb} columns respectively and will be used to construct the track when exporting.
\usecase
Export the signifcant results from the DE analysis as a \textit{.bed} track using \Biocpkg{rtracklayer}. You can load the resulting file in IGV, if you wish.
<<>>=
mcols(sigGeneRegions)$score <- logfc
mcols(sigGeneRegions)$itemRgb <- Col
mcols(sigGeneRegions)$name <- paste0("GeneID:",names(sigGeneRegions))
library(rtracklayer)
export(sigGeneRegions , con = "topHits.bed")
@
\bioccomment{In IGV, you could navigate to {\tt chr18:52258390-52266724} (the top hit in the analysis) to verify that the data are displayed correctly.}
The export of other formats such as \textit{gff}, \textit{bedGraph}, \textit{wig} and \textit{bigwig} is also possible with \Biocpkg{rtracklayer}. We do not need to explicitly tell the function what format to export the data in, it automatically detects this from the file name.
\subsection{Brief introduction to ggplot2}
\CRANpkg{ggplot2} is a popular package that is capable of generating sophisticated graphics. The key concept is that a plot is generated by adding a series of \textit{layers} onto a dataset that define how the data are transformed and displayed. To construct a plot in \CRANpkg{ggplot2}, we must have our data in a \textit{data frame} and be able to define the characteristics (\textit{aesthetics}) of the plot from the variables in the data frame.
A \href{https://en.wikipedia.org/wiki/Volcano_plot_(statistics)}{{\color{blue}volcano plot}} is commonly used to identify changes in high-throughput studies. The magnitude of change is shown on the x-axis and significance on the y-axis. In gene-expression studies, the significance measure is usually the \textit{log-odds} score (commonly-denoted as 'B' in limma and related projects).
For these plots, we will load up the edgeR results for \textit{all} genes and use ggplot2 to signify which are the differentially expressed genes.
\usecase Load the complete table of edgeR results, and create another column to denote some measure of significance; with increasing values denoting increasing evidence for differential expression
<<>>=
load("Day2/edgeRAnalysis_ALLGENES.RData")
head(y)
y$significance <- -10*log10(y$FDR)
edgeRresults <- y
@
\usecase Make a volcano plot from the edgeR results using ggplot2. Select appropriate columns from the data frame to map to the x and y axis.
<<>>=
library(ggplot2)
ggplot(edgeRresults, aes(x = logFC,y=significance)) + geom_point()
@
\bioccomment{To construct the plot, we have to specify what variables are mapped to the aesthetics on the plot using the \Rfunction{aes} function. However, this is not enough to produce the plot. We also have to specify how the variables are arranged on the plot by picking an appropriate \textit{geom}. In this case we choose \Rfunction{geom\_point} for a scatter plot.}
A real advantage of \CRANpkg{ggplot2} is that we are able to use the variables in the data frame to set other properties of the plot, such as the colour, size and plotting characters. \CRANpkg{ggplot2} will take care of the colouring and create an appropriate legend.
\usecase Create a new variable in the edgeR results data frame to indicate whether each gene is statistically significant or not. Use this new variable to colour the points on the volcano plot
<<>>=
edgeRresults$DE <- edgeRresults$FDR < 0.01
ggplot(edgeRresults, aes(x = logFC,y=significance,col=DE)) + geom_point()
@
A disadvantage of \CRANpkg{ggplot2} is that is not immediately obvious how to change the appearance of the plot once it has been created; especially if one is familiar with \textit{base} graphics. An example of how to change the default colours that ggplot2 assigns is shown below.
\usecase Make an MA-plot of the edgeR results. Recall that this displays the average expression level on the x axis, and log fold-change on the y-axis. Modify the plot so that significant genes are shown in red.
<<>>=
ggplot(edgeRresults, aes(x = logCPM,y = logFC,col = DE)) + geom_point()
ggplot(edgeRresults, aes(x = logCPM,y = logFC,col = DE)) + geom_point(alpha=0.4) +
scale_color_manual(values=c("black","red"))
@
\usecase Modifty the aesthetics again so that the size of each point is relative to the level of statistical significance
<<>>=
ggplot(edgeRresults, aes(x = logCPM,y = logFC,col = DE,size=significance)) +
geom_point(alpha=0.4) +
scale_color_manual(values=c("black","red"))
@
More information on ggplot2 is covered in the \href{http://www.cookbook-r.com/Graphs/}{{\color{blue}R cookbook}}
\subsection{ggbio}
We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage of the \Robject{GenomicRanges} and \Robject{GenomicFeatures} object-types. In this section we will show a worked example of how to combine several types of genomic data on the same plot.
The documentation for \Biocpkg{ggbio} is very extensive and contains lots of examples.\newline
{\tt http://www.tengfei.name/ggbio/docs/}
The \Biocpkg{Gviz} package is another Bioconductor package that specialising in genomic visualisations, but we will not explore this package in the course.
The \textit{Manhattan} plot is a common way of visualising genome-wide results, and this is implemented as the \Rfunction{plotGrandLinear} function. We have to supply a value to display on the y-axis, typically this is some measure of significance. Changing this variable displayed on the y-axis is done by using the \Rfunction{aes} function, which is inherited from \CRANpkg{ggplot2}. The positioning of points on the x-axis is handled automatically by \Biocpkg{ggbio}, using the ranges information to get the genomic coordinates of the ranges of interest.
\usecase
Plot the genomic locations of the significant genes on a linear scale. Modify the colour scheme to show whether each gene is up- or down-regulated
<<>>=
library(ggbio)
plotGrandLinear(sigGeneRegions , aes(y = score))
mcols(sigGeneRegions)$Up <- logfc > 0
plotGrandLinear(sigGeneRegions, aes(y = logFC, col = Up))
@
A useful function within \Biocpkg{ggbio} is \Rfunction{autoplot}, which will construct an appropriate plot based on the object-type of the input. For example, \Biocpkg{ggbio} is able to plot the structure of genes according to a particular model represented by a \Robject{GenomicFeatures} object.
For this example, we are going to pick a gene on chromosome 22 that has most evidence for differential expression at the gene-level, and is more-expressed in normals than tumours.
\usecase What is the first gene in the list of differentially-expressed genes on chromosome 22 that is more highly-expressed in normals than tumours
<<>>=
myGene <- which(seqnames(sigGeneRegions)=="chr22" & mcols(sigGeneRegions)$logFC <0)[1]
sigGeneRegions[myGene]
@
\usecase What is the Entrez ID for the gene you have selected? Plot the gene structure of this gene
<<>>=
entrez <- sigGeneRegions$genes[myGene]
autoplot(tx, which = exo22[[entrez]])
@
We can even plot the location of sequencing reads if they have been imported using \Rfunction{readGAlignments} function (or similar).
\usecase
Plot the number of reads in this region surrounding the gene.
<<>>=
myreg <- flank(reduce(exo22[[entrez]]), 1000, both = T)
bamSubset <- bam[bam %over% myreg]
autoplot(bamSubset,which=myreg)
@
\usecase
Repeat the plot, but with a smoothed coverage
<<>>=
autoplot(bamSubset , stat = "coverage")
@
Like \CRANpkg{ggplot2}, \Biocpkg{ggbio} plots can be saved as objects that can later be modified, or combined together to form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device when we query the object. This strategy is useful when we want to add a common element (such as an ideogram) to a plot composition and don't want to repeat the code to generate the plot every time.
\usecase Make an ideogram for Human chromosome 22 and save as an object.
<<>>=
idPlot <- plotIdeogram(genome = "hg19",subchr = "chr22")
idPlot
@
The plots produced by \Biocpkg{ggbio} (and indeed \Rpackage{ggplot2} on which the package is based) are not in standard R graphics format, so techniques like \Rcode{par(mfrow)} for arranging plots on the same page are not applicable. However, \Biocpkg{ggbio} allows the plots to be saved as \textit{objects}, which can later be arranged by specialist functions such as \Rfunction{tracks}. This function automatically aligns the x-axis of each plot to be on the same scale.
\usecase
Make a combined plot of the aligned reads and transcripts for the selected gene and comment on what you see. Also show the ideogram for chromosome 22
<<>>=
geneMod <- autoplot(tx, which = myreg)
reads1 <- autoplot(bamSubset, stat = "coverage")
tracks(idPlot,geneMod ,reads1)
@
In this case, it might be of interest to compare the coverage of several samples on the same plot. One way of doing this is suggested below. Essentially, we loop for a vector of file names, and calculate the coverage of the chosen area in turn and create a plot object. We can then use the \Rfunction{tracks} function on the individual plots that we have saved.
<<>>=
mytracks <- alist()
fls <- paste0("Day2/bam/",dir("Day2/bam/",pattern=".bam"))
fls <- fls[-grep("bai",fls)]
names(fls) <- basename(fls)
for(i in 1:length(fls)){
bam <- readGAlignments(file=fls[i],
param=ScanBamParam(which = GRanges("chr22",IRanges(1,51304566))),
use.name=TRUE)
bamSubset <- bam[bam %over% myreg]
mytracks[[i]] <-autoplot(bamSubset, stat="coverage") + ylim(0,20)
}
tracks(idPlot,geneMod,mytracks[[1]], mytracks[[2]],
mytracks[[3]],mytracks[[4]],mytracks[[5]],
mytracks[[6]])
@
\fixme{Doesn't seem to display a separate legend for each bam file at the moment}
\end{document}