GitHub - LashaBuxo/Genome-Tools: research on specific human genes

Genome Working Tools

Python project is a multi-functional BioInformatics tool, which simplifies calculations on genome annotations provided by Ensembl or RefSeq (NCBI).

Instruction to use

The project requires desired annotations (GFF format), sequence files (FASTA format) or other required files for calculations to be imported into the project and declared in worker_genome_values.py file.

Project supports annotations and sequence files format provided from RefSeq (NCBI) or Ensembl. In addition to adding appropriate files, specific organism name must be declared in worker_genome_values.py as well, for which files are added.

Note: NCBI .gff files lacks UTR 5' and UTR 3' features. In order to add these features run script add_ustrs_to_gff.py after that use any converter tool to change the text format to the ANSII Unicode. add_ustrs_to_gff.py is written by "David Managadze" (works at NCBI) and this script was suggested for use in NCBI's readme files.

Genome Load

from worker_genome import *

genome = GenomeWorker(SPECIES.Homo_sapiens,  # For which organism genome must be loaded?
                      ANNOTATIONS.ENSEMBL,  # Which annotation to use?
                      ANNOTATION_LOAD.GENES_AND_TRANSCRIPTS,  # What type of features are required from annotation?
                      SEQUENCE_LOAD.NOT_LOAD)  # Should sequence also be loaded?

Sequence Analyzer Usage

from worker_analyzer import *

analyzer = AnalyzerData()

analyzer.analyze_sequence_stats("CCAGCAGCAG",  # Nucleotide Sequence
                                1)  # ORF for sequence
print(analyzer.analyzed_peptide)  # Output: 'QQQ' 
analyzer.analyze_sequence_stats("ATGATG", 0)
print(analyzer.analyzed_peptide == 'QQQMM')  # Output: 'QQQMM'
print(analyzer.get_gc_content())  # Output: 0.64285714285

Example Usage

Calculate protein-coding genes greater than 10000 (base pair) length:

total_genes = 0
genes_greater_1000000 = 0

for chr_id in range(1, genome.chromosomes_count() + 1):
    genes_cnt = genome.genes_count_on_chr(chr_id)
    total_genes += genes_cnt
    for i in range(0, genes_cnt):
        gene = genome.gene_by_indexes(chr_id, i)
        genes_greater_1000000 += 1 if gene.end - gene.start + 1 > 10000 else 0
print(f'{genes_greater_1000000} genes from {total_genes}')

14227 genes from 19176

How many RNA transcript starts with ATGGGG in human?

genome = GenomeWorker(SPECIES.Homo_sapiens, ANNOTATIONS.ENSEMBL,
                      ANNOTATION_LOAD.GENES_AND_TRANSCRIPTS_AND_CDS, SEQUENCE_LOAD.LOAD)

count = 0
for chr_id in range(1, genome.chromosomes_count() + 1):
    genes_cnt = genome.genes_count_on_chr(chr_id)
    for i in range(0, genes_cnt):
        gene = genome.gene_by_indexes(chr_id, i)
        transcript = genome.get_transcript_from_gene_by_criteria(gene.id, criteria=TRANSCRIPT_CRITERIA.LONGEST_CDS,
                                                                 tie_breaker_criteria=TRANSCRIPT_CRITERIA.RANDOM)
        seq = genome.retrieve_feature_sequence(chr_id, transcript)
        count += 1 if seq.startswith("ATGGGG") else 0

16 transcripts starts with ATGGGG

Based on all protein-coding genes, it can be used to outline gene GC content, by calculating GC content in specific regions of the gene and taking average value from all the gene. Regions are divided here into k=50 sub-regional parts.

Used organisms in analyses are: Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Danio Rerio, Mus Musculus, Rat norvegicus, Homo sapiens

When Ensembl annotated protein-coding genes are used:

When RefSeq (NCBI) annotated protein-coding genes are used:

3rd Party Resources

gffutils - We use gffutils for working with large GFF files

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
built_databases/genome_data		built_databases/genome_data
used_data		used_data
.gitignore		.gitignore
Comparative Gene Outline (Ensembl, k=50, procc=1000).png		Comparative Gene Outline (Ensembl, k=50, procc=1000).png
Comparative Gene Outline (NCBI, k=50, procc=1000).png		Comparative Gene Outline (NCBI, k=50, procc=1000).png
README.md		README.md
_config.yml		_config.yml
example.py		example.py
worker_analyzer.py		worker_analyzer.py
worker_genome.py		worker_genome.py
worker_genome_enums.py		worker_genome_enums.py
worker_genome_values.py		worker_genome_values.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genome Working Tools

Instruction to use

Genome Load

Sequence Analyzer Usage

Example Usage

When Ensembl annotated protein-coding genes are used:

When RefSeq (NCBI) annotated protein-coding genes are used:

3rd Party Resources

About

Releases

Packages

Languages

LashaBuxo/Genome-Tools

Folders and files

Latest commit

History

Repository files navigation

Genome Working Tools

Instruction to use

Genome Load

Sequence Analyzer Usage

Example Usage

When Ensembl annotated protein-coding genes are used:

When RefSeq (NCBI) annotated protein-coding genes are used:

3rd Party Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages