-
Notifications
You must be signed in to change notification settings - Fork 4
/
readme.txt
executable file
·145 lines (124 loc) · 7.07 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#########################################################################################
TTTTTTTTTTTTTTTTTTTTTTT OOOOOOOOO BBBBBBBBBBBBBBBBB IIIIIIIIII
T:::::::::::::::::::::T OO:::::::::OO B::::::::::::::::B I::::::::I
T:::::::::::::::::::::T OO:::::::::::::OO B::::::BBBBBB:::::B I::::::::I
T:::::TT:::::::TT:::::T O:::::::OOO:::::::O BB:::::B B:::::B II::::::II
TTTTTT T:::::T TTTTTT O::::::O O::::::O B::::B B:::::B I::::I
T:::::T O:::::O O:::::O B::::B B:::::B I::::I
T:::::T O:::::O O:::::O B::::BBBBBB:::::B I::::I
T:::::T O:::::O O:::::O B:::::::::::::BB I::::I
T:::::T O:::::O O:::::O B::::BBBBBB:::::B I::::I
T:::::T O:::::O O:::::O B::::B B:::::B I::::I
T:::::T O:::::O O:::::O B::::B B:::::B I::::I
T:::::T O::::::O O::::::O B::::B B:::::B I::::I
TT:::::::TT O:::::::OOO:::::::O BB:::::BBBBBB::::::B II::::::II
T:::::::::T OO:::::::::::::OO B:::::::::::::::::B I::::::::I
T:::::::::T OO:::::::::OO B::::::::::::::::B I::::::::I
TTTTTTTTTTT OOOOOOOOO BBBBBBBBBBBBBBBBB IIIIIIIIII
TOBI: Tumor Only Boosting Identification of Driver Mutations
Tumor-Only Boosting Identification (TOBI) is a framework for unified germline and somatic
analysis analysis using largely tumor-only samples. TOBI uses gradient booosting to learn
features of confirmed somatic variants from a small training set of tumor-normal sampless,
then generates a classification model that identifies variants with somatic characteristics
in tumor-only samples.
First, WES files from tumor samples undergo variant calling, annotation, and filtering for
quality. TOBI then merges variants across multiple samples. In pre-processing, TOBI labels
variants from the training set as somatic (“som”) or non-somatic (“non_som"). Finally, in
the machine learning step, TOBI generates and applies a somatic classifier.
Ver. 1.2: April 12, 2016
cjmadubata & tchu modified from Alireza Roshan Ghias's code
(Ver. 1.1: Nov 07, 2014 https://github.com/alireza202/TOBI.git TOBI)
dependencies:
- Python 2.7.11
- Perl v5.10.1
- R v3.1.2
- Java 1.7.0_25
- samtools 0.1.19
- bcftools 0.1.19
- VCFtools v0.1.10.1
- snpEff v3.6 & dbNSFP (https://sites.google.com/site/jpopgen/dbNSFP)
- snpSift v3.6
#########################################################################################
###varCall_filtering###
inputs at each step:
V (variant calling): indexed .bam files in a folder. Files must have .bam extension
and filename cannot start with a number.
A (annotation): .vcf files in a folder. Files must have .vcf extension and filename
cannot start with a number. If starting from this step, please format vcf to
match bcftools output.
F (filter): .vcf files in a folder. Files must have .vcf extension and filename
cannot start with a number.
usage: TOBIvaf.py [-h] [--inputdir INPUTDIR] [--output OUTPUT]
[--config CONFIG] [--steps STEPS] [--cluster {hpc,amazon}]
[--debug] [--cleanup] [--ref REF] [--start START]
[--end END] [--snpeff SNPEFF] [--annovcf ANNOVCF]
[--dbnsfp DBNSFP] [--vcftype {default,TCGA}]
[--mergename MERGENAME]
TOBIv1.2: Tumor Only Boosting Identification of Driver Mutations All arguments
can be specified in a config file. (See included varCall.config file as an
example).
Arguments:
General Arguments:
-h, --help show this help message and exit
--inputdir INPUTDIR [REQUIRED] directory for bam/vcf files.
--output OUTPUT [REQUIRED] output directory.
--config CONFIG config file specifying command line arguments.
Arguments specified in the command line overwrite
config file arguments.
--steps STEPS [REQUIRED] Specify which steps of pipeline to run. V:
variant calling A: annotate F: filter M: merge eg.
--steps AF
--cluster {hpc,amazon}
[REQUIRED] Specify which cluster to run on. hpc: run
on an SGE hpc cluster amazon: CURRENTLY UNIMPLEMENTED
--debug Debug/verbose flag. Default: False
--cleanup Delete temporary debug files. Default True
VCF Step Arguments:
--ref REF [REQUIRED - VCF] Reference genome file.
--start START Start index used for testing. Will not work in config.
Default 1
--end END End index used for testing. Will not work in config.
Default 74
Annotation Step Arguments:
--snpeff SNPEFF [REQUIRED - ANNOTATE] Directory where snpEff is
--annovcf ANNOVCF [REQUIRED - ANNOTATE] A comma separated list of .vcf
files to annotate with.
--dbnsfp DBNSFP [REQUIRED - ANNOTATE] Path to dbNSFP file
Filter Step Arguments:
--vcftype {default,TCGA}
Specifies vcf type specically for TCGA filtering
Merge Step Arguments:
--mergename MERGENAME
[REQUIRED - MERGE] Name for final merged file
#########################################################################################
### machine_learning ###
Step 8. Pre-processing using R. Needs customization each time.
usage: TOBIml.py [-h] [--input INPUT] [--output OUTPUT] [--somatic SOMATIC]
[--log LOG] [--check_missed CHECK_MISSED] [--suffix SUFFIX]
[--vcftype {default,TCGA}] [--train_size TRAIN_SIZE]
[--verbose]
{preprocess,machinelearning}
TOBIv1.2: Tumor Only Boosting Identification of Driver Mutations. Machine
learning step.
positional arguments:
{preprocess,machinelearning}
preprocess: preprocessing step;
machinelearning: machine learning step
optional arguments:
-h, --help show this help message and exit
--input INPUT [REQUIRED] input file
--output OUTPUT [REQUIRED] output file for PP, output folder for ML
--somatic SOMATIC [REQUIRED] formatted file containing somatic variants
--log LOG Optional argument to specify a log to pipe stdout and
stderr to
--check_missed CHECK_MISSED
[PP ARG] checking which mutations in important genes
are missed by filtering
--suffix SUFFIX [ML ARG] a label specific to this particular run (e.g.
<date>_<disease>)
--vcftype {default,TCGA}
Specifies vcf type specically for TCGA filtering
--train_size TRAIN_SIZE
[ML ARG] number of patients you want in the training
set.
--verbose verbose flag