This respository contains several sequence alignments of protein-encoding DNA from several species. The following alignments are included, so far:
- Cancer genetic-susceptibility genes from 9 mammals.
- Biomembrane proteins: 9 genes from several species
- Populus tremula: 19 proteins
The sequence alignments of protein-encoding DNA are given fasta format (*.fasta and *.fas).
The consequetive multiple sequence alignment is provided for four cancer related genes ATM, BRCA1, BRCA2, and P53. The sequences are from nine mammals: Human, Chimpanzee, Gorilla, Rhesus_monkey, Dog, Horse, Cow, Mouse, common rat, and a marsupial: Opossum.
In the sequence alignment, genes are consecutivally allocated in the following order:
Gene | start | end |
---|---|---|
ATM | 1 | 3074 |
BRCA1 | 3075 | 4983 |
BRCA2 | 4984 | 8549 |
p53 | 8550 | 8946 |
Sequence alignment alignments for the following biomembranes genes (from several species) are provided:
- mmc2: ADP/ATP translocase 1
- mmc4: intramembrane serine protease GlpG
- mmc6: potassium channel protein
- mmc8: ferrichrome outer membrane transporter
- mmc10: formate dehydrogenase, nitrate-inducible, iron-sulfur subunit
- mmc12: formate dehydrogenase-N subunit gamma
- mmc14: sodium/potassium-transporting ATPase subunit alpha-1 isoform a
- mmc16: sodium/potassium-transporting ATPase subunit beta-1
- mmc18: beta-2 adrenergic receptor
The alignment with the prediction of the ancestor states are provided as well. The ancestor states were estimated with the sofware MEGA6 (Tamura K., Stecher G., Peterson D., Filipski A., and Kumar S. (2013). MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Molecular Biology and Evolution30: 2725-2729). The guide tree and the description of the algorithm used is also given.
The dataset of DNA protein-encoding genes was derived in the study: Ingvarsson PK. Natural selection on synonymous and nonsynonymous mutations shapes patterns of polymorphism in Populus tremula. Mol Biol Evol, 2010, 27:650–60.
The autors deposited the datasetat GenBank/EMBL databases (accession numbers EU752500–EU754117). Here, the multiple sequence alignments for the following proteins are provided.
- isolates: expressed protein genes
- putative protein gene
- ribonucleotide reductase beta subunit gene
- esterase lipase thioesterase gene
- aspartyl protease gene
- C-x8-C-x5-C-x3-H type Zn-finger gene
- casein kinase II regulatory subunit gene
- chalcone synthase gene
- cinnamyl alcohol dehydrogenase gene
- class Ib aminoacyl-tRNA synthetase gene
- class V aminotransferase gene
- cytochrome P450 gene
- G-D-S-L lipolytic enzyme gene
- heat shock protein Hsp20 gene
- isolate swl5-aut64 expressed protein gene
- NAC domain protein gene
- peptidase C1A papain gene
- serine threonine-specific protein phosphatase and bis(5-nucleosyl)-tetraphosphatase gene
- U5 snRNP-specific protein-like factor gene
Multiple sequence aligment of ENV DNA sequences isolated from patient from 1997 to 2017.
The protein-coding DNA sequences were downloaded from the HIV sequence database at HIV DATABASES (https://www.hiv.lanl.gov/content/index). Web Alignments were used, and manually corrected (not all the files).
The reference strain HXB2 is included on each FASTA file.
Multiple sequence aligments of 1016 human CDS references to protein-coding regions from 16 human genomes:
- "GCA_000002115.2_genomic_genbank.fna.gz"
- "GCA_000002125.2_genomic_genbank.fna.gz"
- "GCA_000002135.3_genomic_genbank.fna.gz"
- "GCA_000212995.1_genomic_genbank.fna.gz"
- "GCA_000252825.1_genomic_genbank.fna.gz"
- "GCA_000306695.2_genomic_genbank.fna.gz"
- "GCA_000365445.1_genomic_genbank.fna.gz"
- "GCA_000442335.2_genomic_genbank.fna.gz"
- "GCA_001292825.2_genomic_genbank.fna.gz"
- "GCA_001524155.4_genomic_genbank.fna.gz"
- "GCA_001712695.1_genomic_genbank.fna.gz"
- "GCA_002077035.3_genomic_genbank.fna.gz"
- "GCA_002180035.3_genomic_genbank.fna.gz"
- "GCA_003634875.1_genomic_genbank.fna.gz"
- "GCA_009914755.1_genomic_genbank.fna.gz"
- "GCA_011064465.1_genomic_genbank.fna.gz"
Two subfolder are included plus_strand and minus_strand containing the corresponding aligments to the positive and negative strands. That is, the blast match was independently accomplished on the positive strand and negative strand. The list of alignment fasta files are given in the files named "cds_list.RData" and "cds_list_minus_strand", which can be read in R.
Multiple sequence aligments of 931 human CDS references to protein-coding regions from non-redundant nucleotide NCBI database (09/16/2020).
The list of alignment fasta files are given in the files named "cds_aligned_files.RData" and "cds_aligned_files.txt", which can be read in R.
Curated HIV-1 sequence alignments downloaded from HIV sequence database.
The list of alignment fasta files are given in the files named hiv1_aligned_files.RData and hiv1_aligned_files.txt
The protein-coding DNA sequences, from all the HIV-1 genes, were isolated from patients, covering the year from 2007 to 2018.
DNA sequence alignment of the protein-coding sequences from phospholipase B domain containing-2 (PLBD2) carrying the footprint sequence motif recognized (targeted) by the Silencing Transcription factor (REST), also known as Neuron-Restrictive Silencer Factor (NRSF) REST (NRSF)
The aligned sequences are:
- NM_001159727.2:54-83_Homo_sapiens
- XM_019950502.2:111-140_Tursiops_truncatus
- XM_033408778.1:115-144_Orcinus_orca
- XM_032876216.1:91-120_Lontra_canadensis
- XM_032602682.1:115-144_Phocoena_sinus
- XM_032310131.1:107-136_Mustela_erminea
- LR738414.1:88042426-88042455_Lutra_lutra
- XM_030860450.1:112-141_Globicephala_melas
- XM_030821458.1:66-95_Nomascus_leucogenys
- XM_022550731.2:109-138_Delphinapterus_leucas
- XM_030336151.1:55-84_Lynx_canadensis
- XM_029922190.1:78-107_Suricata_suricatta
- XM_015152887.2:74-103_Macaca_mulatta