Skip to content

Pediatric oncogenomics data analysis pipeline for Innovation Health Hub med tech development programm

Notifications You must be signed in to change notification settings

EdgarsLiepa/prevengs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PreveNGS

Pediatric oncogenomics data analysis pipeline for RTU Innovation Health Hub med tech development program.

THIS IS A DEVELOPMENT VERSION OF THE PIPELINE. IT IS NOT READY FOR PRODUCTION USE.

This code processes transcriptome data and generates a plot of the results. The output plot shows the expression levels of different genes.

Transkriptoma_datu_plusma.jpg

Description

dependency list.

Input Files:

  • HT-Seq output file with counts for each gene - <SampleName>.htseq_counts.txt
    • 3 collumns
      • Gene Name
      • Gene ID (Will be filtered out)
      • Count
  • Reference Lengths.
    • GTF file with gene lengths.
  • metadata file.
    • SampleName,
    • attr_diagnosis_group (Solid Tumor, Hematologic Malignancy e.t.c...),
    • source (St. JUDE, BKUS)

Output:

  • Sample Reference table
  • PCA
  • Top10 genes plot.
  • TPM list if calculated.
  • QQplots.

About The Project

Getting Started

Get docker image

Docker image can be pulled from docker hub or built from the Docker file provided in the repository.

From Docker hub

 docker pull edgarsliepa/prevengs:latest

Build the docker image

Pull this repository.

add dependencies to python dependency list to requirements.txt add dependencies to R library install code to requirements.R which is executed while image is built.

clone this repository and update submodules

git clone https://github.com/EdgarsLiepa/prevengs.git

# Update repository submodules.
git submodule update --init --recursive

Docker image can be built from Docker file provided in the repository. From prevengs project.

docker build -t prevengs .

cross compile docker image with buildx

docker buildx build --platform linux/amd64,linux/arm64,linux/arm/v7 -t <username>/<image>:latest --push .

Get Human Release 31 (GRCh38.p12) reference gene anotations

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gtf.gz 

Run Pipeline with docker image

Create a reference Database for the pipeline

path_to_files can be specified in 2 ways:

  • Path directory and all files in directory are taken.
  • Pattern to match files by names in a directory. use "" when specifying filenames with pattern
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript create_reference_table.R <"path_to_files"> <"output_folder_path">

Add new samples to the reference table

samples can be add to a database by using the -add flag. by specifying <"combined_sample_file"> <"path_to_files_to_add"> <"output_folder_path">

docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript create_reference_table.R -add <"combined_sample_file"> <"path_to_files_to_add"> <"output_folder_path">

run outsingle

Input:

  • tsv feature table with counts for each gene from htseq.
  • output folder name

Output:

  • CSV file: combined pvalue and z-score table for Genes.

Program will save intermediate files in the folder where input file is located. But will be deleted after pipeline is completed.

docker run -v "$PWD":/usr/src/app -it --rm prevengs python3 run_outsingle.py <"path_to_files"> <"output_folder_path">

Run Differential expression analysis

Input:

  • tsv feature table with counts for each gene from htseq.
  • metadata file.
  • output folder name

Output:

  • CSV file: combined pvalue and logFold2 table for Genes.
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/dge.R <path to input file with ht_seq features> <path_to_metadata> <output_folder>

Run full pipeline script through the docker image

# Usage: Rscript pipeline.R <input_directory> <output_folder>

docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/pipeline.R data/BKUS_SAMPLES data/gencode.v31.chr_patch_hapl_scaff.annotation.gtf ./rez data/metadata_BKUS.tsv

Change $PWD to directory path with HTseqfiles and scripts. Local directory is mounted to the docker container at /usr/src/app.
This needed to access the files from the docker container.

Currently the pipeline script needs to mounted as well.

Run the python script through the docker image

# Process transcriptome featureCounts.

# positional arguments:
#   counts_file  The featureCounts file to process.
#   gtf_file     The GTF file to use for gene length calculation.

# options:
#   -h, --help   show this help message and exit

docker run -v "$PWD":/usr/src/app -it --rm prevengs python3 src/script.py 'data/RNS_FLT3_156.F.fastq.genome.htseq_counts.txt' 'data/gencode.v31.chr_patch_hapl_scaff.annotation.gtf'

Running any R script through the docker image

Example of R script for top5_boxplot through the docker image.

docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/top5_boxplot.R data/ ./

Run the R PCA script through the docker image.

Usage: Rscript PCA_for_all_genes.R <input_directory> <output_file>

docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/PCA_for_all_genes.R data/ ./

Run docker container in interactive mode Can be used to test or use dependencies in docker image.

docker run -v "$PWD":/usr/src/app -it --rm prevengs

Run tests

To run the tests, run the following command:

docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript tests/dge_test.R

the test that package is used to run the tests.

TODO

  • Add exception when file in HT seq folder is not in right forma (create_reference_table.R)
    • Define HT seq file format.
    • BUG: breaks if htseq files are not named with *counts.txt
    • Add maybe patern match for file
  • Fix Submodules.
  • Definēt kādu references genome tabulu izmantot.

Authors

  • Edgars Liepa
  • Ņikita Fomins
  • Pauls Daugulis
  • Agate Jarmakovica
  • Aivija Stugle

About

Pediatric oncogenomics data analysis pipeline for Innovation Health Hub med tech development programm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published