Analysis of alternative polyadenylation (APA) from RNA-seq data (human and mouse). QAPA consists of two main components:
- Extraction and annotation of 3′ UTR sequences from gene models
- Calculation of relative usage of alternative 3′ UTR isoforms based on transcript-level abundance.
Note that QAPA itself does not perform transcript quantification. It relies on other tools such as Sailfish and Salmon.
QAPA consists of both Python and R scripts. A conda virtual environment
can be created using the provided environment.yml
file.
-
Clone the repo:
git clone https://github.com/morrislab/qapa.git cd qapa
-
(Optional) Install
mamba
for faster Conda managementconda install -c conda-forge mamba
-
Create the environment
mamba env create -f environment.yml conda activate qapa
-
Test that
qapa
command is availableqapa --help
QAPA has three sub-commands:
build
: Generate a 3′ UTR library from annotationsfasta
: Extract sequences for indexing by transcript quantification toolsquant
: Calculate relative usage of alternative 3′ UTR isoforms
Pre-compiled libraries for human and mouse are available for download below. Otherwise, continue reading to build from scratch.
Updated in v1.3.0, the following libraries are pre-compiled with Polyasite V2:
(Old) Versions v1.2.3 and earlier:
To run build
, gene and poly(A) annotation sources need to be prepared:
A. Gene annotation
-
Ensembl gene metadata table from Biomart.
Human and mouse tables are provided in the
examples
folder. To obtain a fresh copy, download a table of Ensembl Genes from Biomart with the following attributes:- Ensembl Gene ID
- Ensembl Transcript ID
- Gene Type
- Transcript Type
- Gene Name
Alternatively, download via MySQL (see here for more details):
mysql --user=anonymous --host=martdb.ensembl.org --port=5316 -A ensembl_mart_89 \ -e "select stable_id_1023 as 'Gene stable ID', stable_id_1066 as 'Transcript stable ID', \ biotype_1020 as 'Gene type', biotype_1064 as 'Transcript type', \ display_label_1074 as 'Gene name' from mmusculus_gene_ensembl__transcript__main" \ > ensembl_identifiers.txt
To change the species, replace the table name (e.g. for human, use
hsapiens_gene_ensembl__transcript__main
). -
GENCODE gene prediction annotation table
Download from UCSC Table Browser or alternatively via MySQL (see here for more details). For example, to download mm10 gene predictions:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \ -e "select * from wgEncodeGencodeBasicVM9" mm10 > gencode.basic.txt
Alternatively, if you are starting from a GTF/GFF file, you can convert it to genePred format using the UCSC tool
gtfToGenePred
:gtfToGenePred -genePredExt custom_genes.gtf custom_genes.genePred
Note that it is important to include the
-genePredExt
option!
B. Poly(A) site annotation (optional)
This step can be skipped, otherwise continue reading. Two options are available:
Option 1: standard approach using PolyASite and GENCODE poly(A) track (as described in the paper)
-
PolyASite database
Download BED files (human or mouse) from http://polyasite.unibas.ch/.
-
GENCODE poly(A) sites track
Download from UCSC Table Browser or alternatively via MySQL. For example, to download mm10 annotations:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \ -e "select chrom, txStart, txEnd, name2, score, strand \ from wgEncodeGencodePolyaVM9 where name2 = 'polyA_site'" -N mm10 \ > gencode.polyA_sites.bed
Option 2: use custom BED track of poly(A) sites
A custom BED file of poly(A) sites can be used to annotate 3′ UTRs. Each entry must contain the start (0-based) and end coordinate of a poly(A) site.
Once the data files have been prepared, we can then use build
to create the 3'
UTR library. See qapa build -h
for usage details. The following
describes several example use cases:
-
To extract 3′ UTRs from annotation, run:
qapa build --db ensembl_identifiers.txt -g gencode.polyA_sites.bed -p clusters.mm10.bed gencode.basic.txt > output_utrs.bed
-
If using a custom BED file, replace the
-g
and-p
options with-o
:qapa build --db ensembl_identifiers.txt -o custom_sites.bed gencode.basic.txt > output_utrs.bed
-
If using a custom genePred file converted from GTF, supply the file as in 1. (e.g. the first positional argument):
qapa build --db ensembl_identifiers.txt -o custom_sites.bed custom_genes.genePred > output_utrs.bed
-
If bypassing the poly(A) annotation step, include the
-N
option:qapa build -N --db ensembl_identifiers.txt gencode.basic.txt > output.utrs.bed
Results will be saved in the file output_utrs.bed
(default is STDOUT).
It is important that the sequence IDs are not modified as it will be parsed by
quant
below.
Additional notes:
- 3' UTRs that contain introns will be skipped.
- Chromosome names that contain underscores are currently not supported and will be skipped.
- Only genes with
Gene Type = 'protein_coding'
will be considered.
- Use
--debug
option to produce more verbose logging messages - Use
--save
and--temp <dir_path>
to save intermediate files generated byqapa build
.<dir_path>
should be a user accessible directory.
To extract sequences from the BED file prepared by build
, a reference genome in
FASTA format is required. e.g. http://hgdownload.soe.ucsc.edu/downloads.html.
Then, run the command:
qapa fasta -f genome.fa output_utrs.bed output_sequences.fa
Essentially fasta
is a wrapper that calls bedtools getfasta
. Note that
genome.fa
must be uncompressed. Sequences will be saved in
output_sequences.fa
.
Expression quantification of 3′ UTR isoforms must be carried out first using the
FASTA file prepared by fasta
as the index. For example, to index the sequences
using Salmon:
salmon index -t output_sequences.fa -i utr_library
Following expression quantification, QAPA expects the results to be located inside its own sub-directory. For example, typical Sailfish/Salmon results may appear with the following directory structure:
project/
|-- sample1/quant.sf
|-- sample2/quant.sf
|-- (etc.)
The quant
sub-command calls two R scripts:
create_merged_table.R
and compute_pau.R
. The first script combines the
quantifications from each sample into a single table. The second script computes
the relative proportion of each isoform in a gene, measured as Poly(A) Usage
(PAU) (compute_pau.R
).
qapa quant --db ensembl_identifiers.mm10.txt project/sample*/quant.sf > pau_results.txt
Results will be saved in the file pau_results.txt
(default is STDOUT).
For advanced usage, the R scripts can be run on its own. Run
create_merged_table.R -h
and compute_pau.R -h
for usage details.
The final output format is as follows:
Column | Description |
---|---|
APA_ID | unique identifier consisting of in the format <Ensembl Gene ID>_<number>_<(P|D|S)> , where P = proximal, D = distal, and S = single |
Transcript | one or more Ensembl Transcript IDs |
Gene | Ensembl Gene ID |
Gene_Name | gene symbol |
Chr | chromosome |
LastExon.Start | start coordinate of last exon |
LastExon.End | end coordinate of last exon |
Strand | + or - |
UTR3.Start | start coordinate of 3′ UTR |
UTR3.End | end coordinate of 3′ UTR |
Length | length of the 3′ UTR |
Num_Events | number of PAS per gene |
sample1.PAU | PAU estimate for sample1 |
sample2.PAU | PAU estimate for sample2 |
sample1.TPM | TPM estimate for sample1 |
sample2.TPM | TPM estimate for sample2 |
Ha, K.C.H., Blencowe, B.J., Morris, Q. (2018). QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biol. 19, 45. [link]