Skip to content

BuildingAnnotationFiles

Eaton edited this page Jan 15, 2019 · 1 revision

Splicing Annotations and Protein Associations

A number of annotation files are built prior to running AltAnalyze that are necessary for:

  1. Organizing exons and introns from discrete transcripts into consistently ordered sequence blocks (UCSCImport.py and EnsemblImport.py).

  2. Identifying which exons and introns align to alternative annotations (alignToKnownAlt.py and EnsemblImport.py).

  3. Identifying exons and junctions with likely constitutive annotations (EnsemblImport.py).

  4. Identifying which probe sets align to which exons and introns (ExonArrayEnsemblRules.py).

  5. Extracting out protein sequences with functional annotations (ExtractUniProtFunctAnnot.py and EnsemblSQL.py).

  6. Identifying features that overlap with microRNA binding sites (MatchMiRTargetPredictions.py and ExonSeqModule.py).

  7. Matching feature genomic coordinates to cDNA exon coordinates and identify the optimal matching and non-matching mRNA/protein for each feature (IdentifyAltIsoforms.py).

  8. Identifying features that overlap with protein domains and UniProt motifs (ExonAnalyze_module.py and FeatureAlignment.py)

These annotation files are necessary for all exon and junction analyses. Junction analyses further require:

  1. Matching reciprocal junctions to annotated exons or introns (JunctionArray.py, EnsemblImport.py and JunctionArrayEnsemblRules.py), creating a file analogous to (4) above.
  2. Matching reciprocal junction sequence to microRNA binding sites (JunctionSeqSearch.py), creating a file analogous to (6) above.
  3. Matching junction sequence to cDNA sequences (mRNASeqAlign.py), prior to identification of optimal matching and non-matching mRNA/proteins (IdentifyAltIsoforms.py).

With the creation/update of these files, the user is ready to perform alternative exon analyses for the selected species and platform. Since many of these analyses utilize genomic coordinate alignment as opposed to direct sequence comparison, it is import to ensure that all files were derived from the same genomic assembly.

Note: Although all necessary files are available with the AltAnalyze program at installation and some files can be updated automatically from the AltAnalyze server, users can use these programs to adjust the content of these files, use the output for alternative analyses, or create custom databases for currently unsupported species.

Building Ensembl-Exon Associations

Exon and Gene Arrays

The Affymetrix Exon 1.0 ST and Gene 1.0 ST arrays are provided with probe set sequence, transcript cluster and probe set genomic location from Affymetrix. Each of these annotations is used by AltAnalyze to provide gene, transcript and exon associations. Although transcript clusters represent putative genes, the AltAnalyze pipeline derives new gene associations to Ensembl genes, so that each probe set aligns to a single gene from a single gene database. This annotation schema further allows AltAnalyze to determine which probe sets align to defined exons regions (with external exon annotations), introns, and untranslated regions (UTR).

To begin this process, Ensembl exons (each with a unique ID) and their genomic location and transcript associations are downloaded for the most recent genomic assembly using the AltAnalyze EnsemblSQL.py module, which parses various files on the Ensembl FTP SQL database server to assemble the required fields. This file is saved to the directory “AltDatabase/ensembl/*species*/” with the filename “*species*_Ensembl_transcript-annotations.txt". Since Ensembl transcript associations are typically conservative, transcript associations are further augmented with exon-transcript structure data from the UCSC genome database, from the file “all_mrna.txt” (Downloads/*species*/Annotation database/all_mrna.txt.gz). This file encodes genomic coordinates for exons in each transcript similar to Ensembl. Transcript genomic coordinates and genomic strand data from UCSC is matched to Ensembl gene coordinates to identify genes that specifically overlap with Ensembl genes with the Python program UCSCImport.py. Unique transcripts, with distinct exon structures from Ensembl, are exported to the folder “AltDatabase/ucsc/*species*/" with the filename “*species*_UCSC_transcript_structure_filtered_mrna.txt”, with the same structure as the Ensembl_transcript-annotations file.

Once both transcript-structure files have been saved to the appropriate directory, ExonArrayAffyRules.py calls the program EnsemblImport.py to perform the following steps:

  1. Imports these two files, stores exon-transcript associations identify exon regions to exclude from further annotations. These are exons that signify intron-retention (overlapping with two adjacent spliced exons) and thus are excluded as valid exon IDs. These regions are also flagged as intron-retention regions for later probe set annotations.

  2. Assembles exons from all transcripts for a gene into discrete exon clusters. If an exon cluster contains multiple exons with distinct boundaries, the exon cluster is divided into regions that represent putative alternative splice sites (Figure 3.3). These splice sites are explicitly annotated downstream. Each exon cluster is ordered and number from the first to the last exon cluster (e.g., E1, E2, E3, E4, E5), composed of one or more regions. These exon cluster/region coordinates and annotations are stored in memory for downstream probe set alignment in the module ExonArrayAffyRules.py.

  3. Identifies alternative splicing events (cassette-exon inclusion, alternative 3’ or 5’ splice sites, alternative N-terminal and C-terminal exons, and combinations there in) for all Ensembl and UCSC transcripts by comparing exon cluster and region numbers for all pairs of exons in each transcript (see proceeding sections for more information). Alternative exons/exon-regions and corresponding exon-junctions are stored in memory for later probe set annotation and exported to summary files for creation of databases for the AltAnalyze Viewer (Figure 2.12).

  4. Constitutive exon regions are defined by counting the number of unique Ensembl and UCSC mRNA transcripts (based on structure) associated with each unique exon region. If multiple exon regions have the same number of transcript associations, then these are grouped. The grouped exon regions that contained the most transcripts for the gene are defined as constitutive exon regions. If only one exon region is considered constitutive then the grouped exon regions with the second highest ranking (based on number of associated transcripts) are included as constitutive (when there at least 3 ranks) (Figure 3.3).

Upon completion, ExonArrayAffyRules.py:

  1. Imports Affymetrix Exon 1.0 ST probe sets genomic locations and transcript cluster annotations from the Affymetrix probeset.csv annotation file (e.g., HuEx-1_0-st-v2.na23.hg18.probeset.csv). Note: prior to AltAnalzye 1.14, mRNA alignment count numbers were also downloaded from this file to deduce constitutive probe sets. Although transcript clusters will be disregarded at the end of the analysis, these are used initially to group probe sets.

  2. Transcript cluster genomic locations are matched to Ensembl genes genomic locations (gene start and stop) to identify single transcript clusters that align to only one Ensembl gene for the respective genomic strand. For transcript clusters aligning to more than one Ensembl gene, coordinates for each individual probe set are matched to aligning Ensembl genes, to identify unique matches. If multiple transcript clusters align to a single Ensembl gene, only probe sets with an Affymetrix annotated annotation corresponding to that Ensembl gene from the probeset.csv file, are stored as proper relationships. This ensures that if other genes, not annotated by Ensembl exist in the same genomic interval, that they will not be inaccurately combined with a nearby Ensembl gene. If multiple associations or other inconsistencies are found, probe set coordinates are matched directly to the exon cluster locations derived in EnsemblImport.py.

  3. Each probe sets is then aligned to exon clusters, regions, retained introns, constitutive exon regions and splicing annotations for that gene. In addition to splicing annotations from EnsemblImport.py, splicing annotations from the UCSC genome annotation file “knownAlt.txt” (found in the same server directory at UCSC as “all_mRNA.txt”) using the program alignToKnownAlt.py, are aligned to these probe sets. If a probe set does not align to an Ensemblmport.py defined exon or intron and is upstream of the first exon or downstream of the last exon, the probe set is assigned a UTR annotation (e.g., U1.1). All aligning probe sets are annotated based on the exon cluster number and the relative position of that probe set in the exon cluster, based on relative 5’ genomic start (e.g., E2.1). This can mean that probe set E2.1 actually aligns to the second exon cluster in that gene in any of the exon regions (not necessarily the first exon region), if it is the most 5’ aligning.

  4. These probe set annotations are exported to the directory “AltDatabase/*species*/exon” with the filename “*species*_Ensembl_probesets.txt”. Exon block and region annotations for each probe set are designated in the AltAnalyze result file.

Junction Arrays

For the exon-junction AltMouse array, the same process is applied to the highlighted critical exon(s) from all pre-determined reciprocal junction-pairs, exported by the program ExonAnnotate_module.py. A highlighted critical exon is an exon that is predicted to be regulated as the result of two alternative junctions. For example, if examining the exon-junctions E1-E2 and E1-E3, E2 would be the highlighted critical exon. Alternatively, for the mutually-exclusive splicing event E2-E4 and E1-E3, E2 and E3 would be considered to be the highlighted critical exons. To obtain the genomic locations of these exons, sequences for each are obtained from a static build of the mouse AltMerge program (March 2002) (ExonAnalyze_module.py) and searched for in FASTA formatted sequence obtained from BioMart for all Ensembl genes with an additional 2 kb upstream and downstream sequence (JunctionArray.py and EnsemblImport.py). For both the AltMouse and junction array, gene to Ensembl ID associations are obtained by comparing gene symbol names and assigned accession numbers (GenBank for AltMouse and Ensembl for JAY) in common, as opposed to coordinate comparisons.

For the later generation HJAY and MJAY junction arrays, a similar strategy is employed to the AltMouse to obtain genomic coordinates. First, probe set sequences are extracted (exons and junctions) and searched for in the same FASTA formatted sequence file downloaded for AltMouse probe set mapping. The same methods are employed for both AltMouse and JAY array to obtain probeset genomic coordinates. This allows for the export of an exon-coordinate file analogous to the exon probeset.csv file. As with the Affymetrix exon array coordinate file, these genomic positions are aligned to AltAnalyze annotated exon regions and annotations. For junction probe sets, the two exons that compose the junction are individually mapped to exon regions (Figure 3.1). For exon-junction alignments, if the 5’ junction exon aligns perfectly to the last few base pairs of the exon and the 3’ junction exon aligns perfectly to the first few bases, then the match is considered to be to the predicted junction. If the match is not perfect, then new exon IDs are assigned to the corresponding exons, reflecting the mismatching alignment genomic position.

Reciprocal probe sets for the HJAY and MJAY arrays are determined using two approaches: 1) AltAnalyze determined junction comparisons for all Ensembl and UCSC mRNA isoforms (Figure 3.1) and 2) comparison of Affymetrix assigned junctions exon cluster ID annotations to find junction ends with common exon cluster IDs. Approach 1 is computed using the function “getJunctionComparisonsFromExport" from JunctionArrayEnsemblRules.py and approach 2 is computed using the function “identifyJunctionComps" from JunctionArray.py. The resulting relationships are stored in “AltDatabase/EnsMart55/Mm/junction/Mm_junction_comps_updated.txt” (or relevant species and database version). This approach allows us to capture known alternative splicing events, based on mRNA evidence (approach 1) and putative alternative splicing events predicted by Affymetrix (approach 2). This does not currently capture all alternative splicing events and alternative exons, since some junctions do not share an exon cluster ID, but do reflect alternative modes of transcript regulation. To further detect such events, all exon and junction data is also stored for later individual probeset analysis using algorithms previously reserved for the exon and gene arrays (e.g., splicing-index, MiDAS, FIRMA).

Constitutive exon-junction probe sets are for the HJAY and MJAY arrays are currently determined by two methods: 1) determine if both of the exons in the junction are annotated as constitutive and 2) determine if the junction is annotated as a putative ‘alternative’ probe set according to the manufacturer annotation file (e.g., mjay.r2.annotation_map). If both lines of evidence indicate the probeset is constitutive, then the probeset is annotated accordingly. Probe set annotations are provided in “AltDatabase/EnsMart55/Hs/junction/Hs_Ensembl_probesets.txt” (modify based on species and Ensembl version). This functionality is executed in the JunctionArray.py function “identifyJunctionComps” and the JunctionArrayEnsemblRules.py function “importAndReformatEnsemblJunctionAnnotations”.

The highlighted critical exons associated with reciprocal junctions are used for annotation of the associated reciprocal exon(s). In fact, these critical exons are the ones examined for assigning an alternative exon annotation to reciprocal junctions (e.g, alternative cassette-exon), Ensembl genomic domain overlap and microRNA binding site targets. For Ensembl genomic domain overlap (FeatureAlignment.py), the exon genomic coordinates are used as opposed to the junction coordinates. When aligning identified miRNA target sequences, all Ensembl and UCSC identified exon sequences, identified during the database build procedure discussed in previous sections, are extracted from Ensembl chromosome FASTA files using the function getFullGeneSequences() in the module EnsemblSQL.py, stored in a temporary file and used as surrogates for exon array probe set sequences in JunctionSeqSearch.py.

RNA-Seq Analyses

The AltAnalyze gene database for RNASeq database is built using many of the same modules as those used for junction arrays. Instead of initially aligning known features to exons and junctions, as with the junction microarrays, all known Ensembl/UCSC extracted exons, junctions, reciprocal junctions, alternative exons, constitutive exons are exported by EnsemblImport.py. and stored in the format as probe set aligned annotations. For example, instead of the file Hs_Ensembl_probesets.txt, the files Hs_Ensembl_exons.txt and Hs_Ensembl_junctions.txt are exported to the RNASeq database directory. Rather than unique probe set identifiers, unique IDs for each exon and junction are stored in the format Ensembl-gene:ExonID and Ensembl-gene:5primeExonID-3primeExonID, respectively. The exon IDs are the exon region IDs in the format E1.3-E2.1. Associated Ensembl/UCSC exon annotations and genomic coordinates are stored with each exon and junction ID. These identifiers are treated the same as junction array probe set IDs and analyzed using the same modules outlined above for mRNA isoform alignment (mRNASeqAlign.py), optimal reciprocal protein isoform identification (IdentifyAltIsoforms.py), alternative domain (ExonAnalyze_module.py, FeatureAlignment.py) and miRNA binding site identification (MatchMiRTargetPredictions.py, JunctionSeqSearch.py). Just as with junction arrays, AltAnalyze exon region sequence and genomic locations, for the critical exons associated with reciprocal junctions, are used to identify overlapping miRNA targets and genomic overlapping InterPro domains. For single-junction analyses (e.g., splicing-index and MiDAS), these annotations are built separately using the same pipeline for individual junctions rather than reciprocal junctions, in a similar manner to junction arrays. These single junction annotations are stored to the sub-folder “junction” in the RNASeq directory.

When a users begins an AltAnalyze analysis on their experimental RNA-Seq input files, some of the above analyses are repeated to build an augmented, experiment specific set of AltAnalyze databases that include de novo identified junctions, exons, reciprocal junction-pairs and alternative exon annotations. Several files normally found in the AltAnalyze program directory, AltDatabase/EnsMart<version>/<species>/RNASeq folder are stored to this same path, in the user designated output directory. Thus, analyses can be easily repeated at any step using these compiled dataset specific annotations. Only identified exon and junction annotations are included in these databases to maximize performance. These analyses are directed by the module RNASeq.py. In this module, the functionalignExonsAndJunctionsToEnsembl() extracts genomic locations of the identified junction splice sites (5’ and 3’) and exons, aligns these to Ensembl junctions genes, exons and introns and known splice sites. If both junction splice-sites are found in the AltAnalyze database, those junction and associated exon entries are migrated to the user database. If the junctions or splice-sites are novel, new identifiers are created indicating which exon/intron region the splice-site is found in with the splice-site genomic location indicated in the exon/junction name (e.g., ENSMUSG00000019715:I2.1_29554605). If the two junction splice-sites align to two different genes, the junction is annotated as trans-splicing. For any novel identified exon junctions, reciprocal junction analysis is re-performed using the function compareJunctions() in EnsemblImport.py to identify novel reciprocal junctions to be used during ASPIRE or Linear Regression analyses. The number of novel junctions and trans-splicing events is reported during BED file import. When an exon coordinates and read counts are present, these coordinates are aligned to Ensembl genes, exon regions and introns. Novel exons that overlap with an Ensembl gene are included.

Extracting UniProt Protein Domain Annotations Overview

The UniProt protein database is a highly curated protein database that provides annotations for whole proteins as well as protein segments (protein features or domains). These protein feature annotations correspond to specific amino acid (AA) sequences that are annotated using a common vocabulary, including a class (feature key) and detailed description field. An example is the TCF7L1 protein, which has five annotated feature regions, ranging in size from 7 to 210 AA. One of these regions has the feature key annotation "DNA binding" and the description "HMG box". To utilize these annotations in AltAnalyze, these functional tags are extracted along with full protein sequence, and external annotations for each protein (e.g., Ensembl gene) from the "uniprot_sprot_*taxonomy*.dat" file using the ExtractUniProtFunctAnnot.py program. FTP file locations for the UniProt database file can be found in the file "Config/Default-file.csv" for each supported species. To improve Ensembl-UniProt annotations, these relationships are also downloaded from BioMart and stored in the folder "AltDatabase/uniprot/*species*" as "*species*_Ensembl-UniProt.txt", which are gathered at ExtractUniProtFunctAnnot.py runtime to include in the UniProt sequence annotation file. These files are saved to "AltDatabase/uniprot/*species*" as "uniprot_feature_file.txt" and "uniprot_sequence.txt".

Extracting Ensembl Protein Domain Annotations Overview

In addition to protein domains/features extracted from UniProt, protein features associated with specific Ensembl transcripts are extracted from the Ensembl database. One advantage of these annotations over UniProt, is that alternative exon changes that alter the sequence of a feature but not its inclusion will be reported as a gain and loss of the same feature, as opposed to just one with UniProt. This is because protein feature annotations in UniProt only typically exist for one isoform of a gene and thus, alteration of this feature in any way will result in this feature being called regulated. Although an Ensembl annotated feature with a reported gain and loss can be considered not changed at all, functional differences can exist do to a minor feature sequence change that would not be predicted if the gain and loss of the feature were not reported.

Three separate annotation files are built to provide feature sequences and descriptions, “Ensembl_Protein”, “Protein”, and “ProteinFeatures” files (“AltDatabase/ensembl/*species*”). The “ProteinFeatures” contains relative AA positions for protein features for all Ensembl protein IDs, genomic start and end locations along with InterPro annotations and IDs. Only those InterPro domains/features with an alignment e-score < 1 are stored for alignment to regulated exons. The “Protein” file contains AA sequences for each Ensembl protein. The “Ensembl_Protein " provides Ensembl gene, transcript and protein ID associations. Data for these files are downloaded and extracted using the previously mentioned EnsemblSQL.py module. The feature annotation source in these files is InterPro, which provides a description similar to UniProt. As an example, see this Ensembl gene, which has similar feature descriptions to UniProt for the same gene, TCF7L1.

Extracting microRNA Binding Annotations Overview

To examine the potential gain or loss of microRNA bindings sites as the direct result of exon-inclusion or exclusion, AltAnalyze requires putative microRNA sequences from multiple prediction algorithms. These binding site annotations are extracted from the following flat files:

  • miRNAMap predictions from multiple algorithms (TargetScan, RNAhybrid and miRanda) are downloaded from the miRNAMap ftp server for all species present. Target sequences from this database are linked to Ensembl transcript IDs.

  • TargetScan conserved predicted targets. UTR sequences (UTR_Sequences.txt) are matched to corresponding miRNA target coordinates (Predicted_Targets_Info.txt) for each species based on the taxonomic identifer. The primary gene ID, gene-symbol, is linked to Ensembl based on gene-symbol to Ensembl gene annotations (from several Ensembl builds – outdated symbols are often used).

  • miRanda human centric with multi-species alignment information was obtained from target predictions organized by Ensembl gene ID. A larger set of associations was also pulled from species-specific files, where gene symbol was related to Ensembl gene. Both files provided target microRNA sequence.

  • Sanger center (miRBase) sequence was provided as a custom (requested) dump of their version 5 target predictions, containing Ensembl gene IDs, microRNA names, and putative target sequences, specific for either mouse or human. Currently, rat has not been obtained.

  • PicTar conserved predicted targets (from Dog to Human)[Krek2005] were provided as supplementary data (Supplementary Table 3), with conservation in human, chimp, mouse, rat, and dog for a set of 168 microRNAs. For mouse, human gene symbols were searched for in the BioMart derived “Mm_ Ensembl_annotation.txt” table after converting these IDs to a mouse compatible format (e.g., TCF7L1 to Tcf7l1). The same was used for aligning to PicTar. The same strategy is used for rat.

Ensembl gene to microRNA name and sequence are stored for all prediction algorithm flat files and directly compared to find genes with one or more lines of microRNA binding site evidence using the program MatchMiRTargetPredictions.py. The flat file produced from this program (“combined_gene-target-sequences.txt”) was used by the program ExonSeqSearch.py to search for these putative microRNA binding site sequences among all probe sets from the “*species*_Ensembl_probeset.txt” file built by ExonArrayEnsemblRules.py and probe set sequence from the Affymetrix 1.0 ST probe set FASTA sequence file (Affymetrix) or the reciprocal junction critical exon sequence file (see Building Ensembl-Exon Associations). For gene arrays, NetAffx currently only provides probe sequences, thus, probe set sequences from the exon array are used for probe set found to overlap in genomic space. Two resulting files, one with any binding site predictions and another required to have evidence from at least two algorithms, are saved to “AltDatabase/*species*/*array_type*/” as “*species*_probe set_microRNAs_any.txt” and “*species*_probe set_microRNAs_multiple.txt”, respectively.

Inferring Protein-Exon Associations Overview

To obtain associations between specific exons, junctions or probe sets to proteins, the programs IdentifyAltIsoforms.py (exon and junction platforms) and ExonSeqModule.py (junction platforms only) were written. The program IdentifyAltIsoforms.py grabs all gene mRNA transcripts and associated exon genomic coordinates from Ensembl and UCSC and compares these to probe set coordinates (or critical exon for junction-sensitive platforms) to find pairs of transcripts (one containing the probe set or critical exon and the other not) that have the least number of differing exons (see Constitutive and Gene Expression Calculation). These transcript pairs are thus most likely to be similar with exception to the region containing the probe set or critical exon. Once these best matches are identified, corresponding protein sequences for each mRNA are downloaded from Ensembl using EnsemblSQL.py or are downloaded using NCBI webservices via BioPython’s Entrez function.

If protein sequences are unavailable for an mRNA accession, the mRNA sequence for that identifier is downloaded (using the above mentioned services) and translated using the custom IdentifyAltIsoforms.py function “BuildInSilicoTranslations”, which uses functions from the BioPython module to translate an mRNA based on all possible start and stop sites to identify the longest putative translation that also shares either the first or last 5 AA of its sequence with the N-terminus or C-terminus (respectively) of a UniProt protein.

While this same protocol is used for junction-sensitive platforms, prior to this analysis reciprocal junctions are mapped to all possible aligning and non-aligning transcripts through direct junction sequence comparison via mRNASeqAlign.py. This module takes the junction sequence for all reciprocal junction-pairs and searches for a match among mRNA transcript FASTA formatted sequences from Ensembl and UCSC mRNAs that correspond to that gene. Only a 100% sequence match is allowed, (matches may not occur do to polymorphisms between sequence sources and genomic assemblies). These associations are then used by IdentifyAltIsoforms.py, as described above.

At this point, only two mRNAs are matched toeach exon, junction, probe set or junction-pair, matching mRNAs. Next, differences in protein feature composition between these two proteins, alternative N or C-terminal sequences, coding sequence and protein length are assessed using ExonAnalyze_module.py and exported to text files (associations and protein sequences) for import when AltAnalyze is run. These two files have the suffix “exoncomp.txt”. Two analogous files with the suffix “seqcomp.txt”, are derived by identifying the best matching and non-matching mRNAs based on comparison of protein sequence as compared exon composition (e.g. least differences in protein domain composition), but are currently used only for internal analyses (contact the authors for the seqcomp files to replace exoncomp). The genomic locations of critical exons associated with these reciprocal junctions are used to identify direct and indirect overlapping InterPro domains from Ensembl in the function FeatureAlignment.py.

Required Files for Manual Update

For advanced users and developers that wish to build databases outside of AltAnalyze’s normal release cycle or build custom database installations, several built-in, automated tools are available to build AltAnalyze databases. These functions are accessible from command-line flags in AltAnalyze. In AltAnalyze 2.0, we have tried to increase ease and consistency in which these databases are built, by building new methods to automatically download and extract necessary external databases from resource FTP and HTTP servers, where accessible. Although several external databases are required for the above build strategies, nearly all of these can be automatically downloaded by AltAnalyze during the automated build processes. The exception to this rule is Affymetrix annotation files, which cannot be downloaded without logging into the Affymetrix support site and downloading these manually. These files include the Affymetrix probeset.csv annotation files and probeset FASTA sequence files. See Creating an AltAnalyze Database From Scratch for details on updating and customizing the AltAnalyze databases.

Clone this wiki locally