Welcome to seqalignments repository

This respository contains several sequence alignments of protein-encoding DNA from several species. The following alignments are included, so far:

Cancer genetic-susceptibility genes from 9 mammals.
Biomembrane proteins: 9 genes from several species
Populus tremula: 19 proteins

Data format

The sequence alignments of protein-encoding DNA are given fasta format (*.fasta and *.fas).

1. Cancer genetic-susceptibility genes from 9 mammals.

The consequetive multiple sequence alignment is provided for four cancer related genes ATM, BRCA1, BRCA2, and P53. The sequences are from nine mammals: Human, Chimpanzee, Gorilla, Rhesus_monkey, Dog, Horse, Cow, Mouse, common rat, and a marsupial: Opossum.

In the sequence alignment, genes are consecutivally allocated in the following order:

Gene	start	end
ATM	1	3074
BRCA1	3075	4983
BRCA2	4984	8549
p53	8550	8946

2. Biomembrane proteins: 9 genes from several species

Sequence alignment alignments for the following biomembranes genes (from several species) are provided:

mmc2: ADP/ATP translocase 1
mmc4: intramembrane serine protease GlpG
mmc6: potassium channel protein
mmc8: ferrichrome outer membrane transporter
mmc10: formate dehydrogenase, nitrate-inducible, iron-sulfur subunit
mmc12: formate dehydrogenase-N subunit gamma
mmc14: sodium/potassium-transporting ATPase subunit alpha-1 isoform a
mmc16: sodium/potassium-transporting ATPase subunit beta-1
mmc18: beta-2 adrenergic receptor

The alignment with the prediction of the ancestor states are provided as well. The ancestor states were estimated with the sofware MEGA6 (Tamura K., Stecher G., Peterson D., Filipski A., and Kumar S. (2013). MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Molecular Biology and Evolution30: 2725-2729). The guide tree and the description of the algorithm used is also given.

3. Populus tremula: 19 proteins

The dataset of DNA protein-encoding genes was derived in the study: Ingvarsson PK. Natural selection on synonymous and nonsynonymous mutations shapes patterns of polymorphism in Populus tremula. Mol Biol Evol, 2010, 27:650–60.

The autors deposited the datasetat GenBank/EMBL databases (accession numbers EU752500–EU754117). Here, the multiple sequence alignments for the following proteins are provided.

isolates: expressed protein genes
putative protein gene
ribonucleotide reductase beta subunit gene
esterase lipase thioesterase gene
aspartyl protease gene
C-x8-C-x5-C-x3-H type Zn-finger gene
casein kinase II regulatory subunit gene
chalcone synthase gene
cinnamyl alcohol dehydrogenase gene
class Ib aminoacyl-tRNA synthetase gene
class V aminotransferase gene
cytochrome P450 gene
G-D-S-L lipolytic enzyme gene
heat shock protein Hsp20 gene
isolate swl5-aut64 expressed protein gene
NAC domain protein gene
peptidase C1A papain gene
serine threonine-specific protein phosphatase and bis(5-nucleosyl)-tetraphosphatase gene
U5 snRNP-specific protein-like factor gene

4. Protein-coding DNA sequences of HIV1 ENV protein

Multiple sequence aligment of ENV DNA sequences isolated from patient from 1997 to 2017.

The protein-coding DNA sequences were downloaded from the HIV sequence database at HIV DATABASES (https://www.hiv.lanl.gov/content/index). Web Alignments were used, and manually corrected (not all the files).

The reference strain HXB2 is included on each FASTA file.

5. Protein-coding DNA sequences from human genomes

Multiple sequence aligments of 1016 human CDS references to protein-coding regions from 16 human genomes:

"GCA_000002115.2_genomic_genbank.fna.gz"
"GCA_000002125.2_genomic_genbank.fna.gz"
"GCA_000002135.3_genomic_genbank.fna.gz"
"GCA_000212995.1_genomic_genbank.fna.gz"
"GCA_000252825.1_genomic_genbank.fna.gz"
"GCA_000306695.2_genomic_genbank.fna.gz"
"GCA_000365445.1_genomic_genbank.fna.gz"
"GCA_000442335.2_genomic_genbank.fna.gz"
"GCA_001292825.2_genomic_genbank.fna.gz"
"GCA_001524155.4_genomic_genbank.fna.gz"
"GCA_001712695.1_genomic_genbank.fna.gz"
"GCA_002077035.3_genomic_genbank.fna.gz"
"GCA_002180035.3_genomic_genbank.fna.gz"
"GCA_003634875.1_genomic_genbank.fna.gz"
"GCA_009914755.1_genomic_genbank.fna.gz"
"GCA_011064465.1_genomic_genbank.fna.gz"

Two subfolder are included plus_strand and minus_strand containing the corresponding aligments to the positive and negative strands. That is, the blast match was independently accomplished on the positive strand and negative strand. The list of alignment fasta files are given in the files named "cds_list.RData" and "cds_list_minus_strand", which can be read in R.

6. Alignments of human reference CDS to Protein-coding DNA sequences

Multiple sequence aligments of 931 human CDS references to protein-coding regions from non-redundant nucleotide NCBI database (09/16/2020).

The list of alignment fasta files are given in the files named "cds_aligned_files.RData" and "cds_aligned_files.txt", which can be read in R.

7. HIV-1 Protein-coding DNA sequences

Curated HIV-1 sequence alignments downloaded from HIV sequence database.

The list of alignment fasta files are given in the files named hiv1_aligned_files.RData and hiv1_aligned_files.txt

The protein-coding DNA sequences, from all the HIV-1 genes, were isolated from patients, covering the year from 2007 to 2018.

8. Phospholipase B domain containing-2 (PLBD2)

DNA sequence alignment of the protein-coding sequences from phospholipase B domain containing-2 (PLBD2) carrying the footprint sequence motif recognized (targeted) by the Silencing Transcription factor (REST), also known as Neuron-Restrictive Silencer Factor (NRSF) REST (NRSF)

The aligned sequences are:

NM_001159727.2:54-83_Homo_sapiens
XM_019950502.2:111-140_Tursiops_truncatus
XM_033408778.1:115-144_Orcinus_orca
XM_032876216.1:91-120_Lontra_canadensis
XM_032602682.1:115-144_Phocoena_sinus
XM_032310131.1:107-136_Mustela_erminea
LR738414.1:88042426-88042455_Lutra_lutra
XM_030860450.1:112-141_Globicephala_melas
XM_030821458.1:66-95_Nomascus_leucogenys
XM_022550731.2:109-138_Delphinapterus_leucas
XM_030336151.1:55-84_Lynx_canadensis
XM_029922190.1:78-107_Suricata_suricatta
XM_015152887.2:74-103_Macaca_mulatta

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
BRCA1		BRCA1
Biomembrane		Biomembrane
COVID-19		COVID-19
CYCS		CYCS
HIV1		HIV1
HIV1_ENV		HIV1_ENV
Oncogenes		Oncogenes
PLBD2		PLBD2
P_tremula		P_tremula
Pyrococcus		Pyrococcus
human_cds		human_cds
.Rhistory		.Rhistory
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to seqalignments repository

Data format

1. Cancer genetic-susceptibility genes from 9 mammals.

2. Biomembrane proteins: 9 genes from several species

3. Populus tremula: 19 proteins

4. Protein-coding DNA sequences of HIV1 ENV protein

5. Protein-coding DNA sequences from human genomes

6. Alignments of human reference CDS to Protein-coding DNA sequences

7. HIV-1 Protein-coding DNA sequences

8. Phospholipase B domain containing-2 (PLBD2)

About

Releases

Packages

genomaths/seqalignments

Folders and files

Latest commit

History

Repository files navigation

Welcome to seqalignments repository

Data format

1. Cancer genetic-susceptibility genes from 9 mammals.

2. Biomembrane proteins: 9 genes from several species

3. Populus tremula: 19 proteins

4. Protein-coding DNA sequences of HIV1 ENV protein

5. Protein-coding DNA sequences from human genomes

6. Alignments of human reference CDS to Protein-coding DNA sequences

7. HIV-1 Protein-coding DNA sequences

8. Phospholipase B domain containing-2 (PLBD2)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages