Pediatric oncogenomics data analysis pipeline for RTU Innovation Health Hub med tech development program.
THIS IS A DEVELOPMENT VERSION OF THE PIPELINE. IT IS NOT READY FOR PRODUCTION USE.
This code processes transcriptome data and generates a plot of the results. The output plot shows the expression levels of different genes.
dependency list.
Input Files:
- HT-Seq output file with counts for each gene - <SampleName>.htseq_counts.txt
- 3 collumns
- Gene Name
- Gene ID (Will be filtered out)
- Count
- 3 collumns
- Reference Lengths.
- GTF file with gene lengths.
- metadata file.
- SampleName,
- attr_diagnosis_group (Solid Tumor, Hematologic Malignancy e.t.c...),
- source (St. JUDE, BKUS)
Output:
- Sample Reference table
- PCA
- Top10 genes plot.
- TPM list if calculated.
- QQplots.
Docker image can be pulled from docker hub or built from the Docker file provided in the repository.
docker pull edgarsliepa/prevengs:latest
Pull this repository.
add dependencies to python dependency list to requirements.txt add dependencies to R library install code to requirements.R which is executed while image is built.
clone this repository and update submodules
git clone https://github.com/EdgarsLiepa/prevengs.git
# Update repository submodules.
git submodule update --init --recursive
Docker image can be built from Docker file provided in the repository. From prevengs project.
docker build -t prevengs .
cross compile docker image with buildx
docker buildx build --platform linux/amd64,linux/arm64,linux/arm/v7 -t <username>/<image>:latest --push .
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gtf.gz
path_to_files can be specified in 2 ways:
- Path directory and all files in directory are taken.
- Pattern to match files by names in a directory. use
""
when specifying filenames with pattern
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript create_reference_table.R <"path_to_files"> <"output_folder_path">
samples can be add to a database by using the -add
flag. by specifying <"combined_sample_file"> <"path_to_files_to_add"> <"output_folder_path">
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript create_reference_table.R -add <"combined_sample_file"> <"path_to_files_to_add"> <"output_folder_path">
Input:
- tsv feature table with counts for each gene from htseq.
- output folder name
Output:
- CSV file: combined pvalue and z-score table for Genes.
Program will save intermediate files in the folder where input file is located. But will be deleted after pipeline is completed.
docker run -v "$PWD":/usr/src/app -it --rm prevengs python3 run_outsingle.py <"path_to_files"> <"output_folder_path">
Input:
- tsv feature table with counts for each gene from htseq.
- metadata file.
- output folder name
Output:
- CSV file: combined pvalue and logFold2 table for Genes.
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/dge.R <path to input file with ht_seq features> <path_to_metadata> <output_folder>
# Usage: Rscript pipeline.R <input_directory> <output_folder>
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/pipeline.R data/BKUS_SAMPLES data/gencode.v31.chr_patch_hapl_scaff.annotation.gtf ./rez data/metadata_BKUS.tsv
Change $PWD to directory path with HTseqfiles and scripts. Local directory is mounted to the docker container at /usr/src/app.
This needed to access the files from the docker container.
Currently the pipeline script needs to mounted as well.
# Process transcriptome featureCounts.
# positional arguments:
# counts_file The featureCounts file to process.
# gtf_file The GTF file to use for gene length calculation.
# options:
# -h, --help show this help message and exit
docker run -v "$PWD":/usr/src/app -it --rm prevengs python3 src/script.py 'data/RNS_FLT3_156.F.fastq.genome.htseq_counts.txt' 'data/gencode.v31.chr_patch_hapl_scaff.annotation.gtf'
Example of R script for top5_boxplot through the docker image.
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/top5_boxplot.R data/ ./
Run the R PCA script through the docker image.
Usage: Rscript PCA_for_all_genes.R <input_directory> <output_file>
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript src/PCA_for_all_genes.R data/ ./
Run docker container in interactive mode Can be used to test or use dependencies in docker image.
docker run -v "$PWD":/usr/src/app -it --rm prevengs
To run the tests, run the following command:
docker run -v "$PWD":/usr/src/app -it --rm prevengs Rscript tests/dge_test.R
the test that package is used to run the tests.
- Add exception when file in HT seq folder is not in right forma (create_reference_table.R)
- Define HT seq file format.
- BUG: breaks if htseq files are not named with *counts.txt
- Add maybe patern match for file
- Fix Submodules.
- Definēt kādu references genome tabulu izmantot.
- Edgars Liepa
- Ņikita Fomins
- Pauls Daugulis
- Agate Jarmakovica
- Aivija Stugle