Most dependencies for the pipeline are met by the two conda_requirements
conda_protmap.yml
and conda_protmap_R.yml
.
The dependencies are tested on a ubuntu docker enviroment. To setup the docker
enviroment the script ./setup_docker.sh
is used. All further depencencies
are shown there. To meet the other depencies that are not covered by docker
adapte this script accordingly.
In the contribution we used all raw reads from the bio sample PRJNA655119
for the expression analysis.
https://www.ncbi.nlm.nih.gov/bioproject/?term=prjna655119
For the publication the genomes for the following bacteria are downloaded.
This is handled by the script ./build_db/download_all_genomes.sh
.
The short names are used throughout the scripts and should not be changed. The full names are the following.
short name | scientific name | taxid | assembly id |
---|---|---|---|
anaero | Anaerostipes caccae DSM 14662 | 411490 | GCA_014131675.1 |
bact | Bacteroides thetaiotaomicron VPI5482 | 226186 | GCA_014131755.1 |
bifi | Bifidobacterium longum NCC2705 | 206672 | GCF_000007525.1 |
blautia | Blautia producta ATCC 27340 DSM 2950 | 1121114 | GCA_014131715.1 |
clostri | Clostridium butyricum DSM 10702 | 1316931 | GCA_014131795.1 |
ecoli | Escherichia coli str K12 substr MG1655 | 511145 | GCF_000005845.2 |
ery | Erysipelatoclostridium ramosum DSM 1402 | 445974 | GCA_014131695.1 |
lacto | Lactobacillus plantarum subsp plantarum ATCC 14917 JCM 1149 CGMCC 12437 | 525338 | GCA_014131735.1 |
All major steps are covered in ther own sub directory.
- build_db:
- Builds databases for comet, downloads genomes
- comet:
- MS data is gathered, PSMs are generated.
- transcritptom
- All scripts for the mapping of the transcriptomic reads are here.
- data_accumulation
- Here most of the final analysis are done
- candidates
- Here the candidate selection and evaluation is done.
- UCSC_track_tools
- The UCSC track hub is generated here
- figure_plotting
- All scripts for plots that where automatically generated from data are here.
- start_anno_html
- The result for evidence of early annotation startsites are generated here.
The file parameters.json holds paramters for the script to run and must be changed for each system.
- session_id
- trackhub session id
- hub_id
- hub_id for UCSC genome browser
- data_dir
- dir to store all data. 1.5 TB at least.
- tmp_dir
- dir for temporary files
- publication_dir
- dir to output figures and infos
- ms_dir
- dir of ms data. Not needed if PRIDE is available.
- chrome_bin
- dir to CORRECT (version) chrome binary .
- bin_path
- dir where comet etc is expected.
- blastdb_dir
- blast nt db.