MotifTools

A collection of python tools to evaluate a variant on TF motif

D-score: motif-breaking or motif-gaining power
B-score: burden score

Dependencies

Python 2.6 or higher (tested on Python 2.7.12)
BioPython 1.66 or higher
NumPy
BedTools v2.26 or higher

Config

Make a copy of config.ini from config-sample.ini
Specify path to bedtools
Specify path to JASPAR database pfm file (i.e., JASPAR TF profiles, 2016 core non-redundant vertebrates, can be downloaded here)
C_PRECISION tells how many digits to preserve when calculating P-value
C_PSEUDOCOUNTS is added to PFM to avoid division by zero
C_PVAL_THRESHOLD is P-value threshold for reporting TF motif significance on either reference or alternate genome. For example, suppose C_PVAL_THRESHOLD is 0.0001 and there is a variant that causes TF motif P-value to change from 0.001 to 0.1. Although it has D-score of 20, it does not get reported because neither TF motif on reference genome nor on alternate genome is significant. If a TF motif has P-value of 0.001 on reference genome and if a variant makes a TF motif to have P-value of 0.0001, it will be reported with D-score of -10.

Inputs

Peak BED file
Variant VCF file
Reference genome FASTA file
- For example, hg19 genome can be downloaded from here

PFM in Jaspar 2016 PFM format

For example, a file containing a single position frequency (count) matrix

>MA0491.1	JUND
A  [12389 15163     0     0 38710     0     0  1547 38710     0  5786 ]
C  [  759  4824     0     0     0 22857     0 37163     0  5136 14956 ]
G  [13640 18723     0 38710     0 14078     0     0     0  2301 12429 ]
T  [11922     0 38710     0     0  1775 38710     0     0 31273  5539 ]

Usage

python runDscore.py --name {"MCF-7_CTCF"} --bed {peak.bed} --pfm {CTCF.pfm} --vcf {variant.vcf} --ref {reference_genome.fa}
python runBscore.py --sample {"MCF-7"} --tf {"CTCF"} --bed {peak.bed} --vcf {variant.vcf} --ref {reference_genome.fa}

or

python runDscore.py -n {"MCF-7_CTCF"} -b {peak.bed} -p {CTCF.pfm} -v {variant.vcf} -r {reference_genome.fa}
python runBscore.py -s {"MCF-7"} -t {"CTCF"} -b {peak.bed} -v {variant.vcf} -r {reference_genome.fa}

Output

D-score output is saved in BED format (i.e., genomic coordinate is 0-based).

chr1	205304887	205304906	chr1:205304888-205304906_+	12.5378302556	+	3.20285835187e-06	5.74540645175e-05	TCCTCA[C]TAGGGGGCAGCA	TCCTCA[T]TAGGGGGCAGCA	15.122657723	9.53769650652	7
chr14	107239045	107239065	chr14:107239046-107239064_-	-0.857778555555	-	7.40751347621e-05	6.07987421972e-05	AGGACATGAGGTGGC[A]CTC	AGGACATGAGGTGGC[G]CTC	8.96464812219	9.41133123776	16
chr1	115499385	115499405	chr1:115499386-115499404_-	1.22679468983	-	6.97545074217e-05	9.25234344322e-05	tcagcactagaggatg[C]tc	tcagcactagaggatg[G]tc	9.10192102726	8.45112655346	17
chr15	79888612	79888632	chr15:79888613-79888631_-	-1.33949922329	-	8.03446164355e-06	5.90210402152e-06	CAGCCTCTAGGAGGAGCT[C]	CAGCCTCTAGGAGGAGCT[G]	13.5124960611	14.0679325841	19
chr20	57023063	57023083	chr20:57023064-57023082_-	-1.16307241472	-	6.53455936117e-05	4.99929847138e-05	CAGACAGCCGGGGGCAGG[G]	CAGACAGCCGGGGGCAGG[A]	9.24950881325	9.84527805058	19
chr1	201528952	201528971	chr1:201528953-201528971_+	12.2863623638	+	7.54283973947e-07	1.27694183902e-05	CCGCCACC[A]GAGGGTGCAG	CCGCCACC[G]GAGGGTGCAG	17.3776446873	12.6446297995	9
chr1	201528952	201528971	chr1:201528953-201528971_+	-4.59613738195	+	7.54283973947e-07	2.61770765064e-07	CCGCCACCAGAGGGTGC[A]G	CCGCCACCAGAGGGTGC[C]G	17.3776446873	18.8381967673	18
chr5	176541052	176541071	chr5:176541053-176541071_+	1.43157045253	+	8.56964798004e-05	0.000119157128211	TAG[T]CACACGGTGGCGACA	TAG[G]CACACGGTGGCGACA	8.62883752903	7.85123023661	4
chr17	55825082	55825102	chr17:55825083-55825101_-	-7.89814065592	-	8.06438401924e-06	1.30845000967e-06	AGACCAGTAGGTGTCA[T]CA	AGACCAGTAGGTGTCA[G]CA	13.5051877266	16.5578288634	17
chr17	53591549	53591569	chr17:53591550-53591568_-	3.72653409279	-	1.96814653464e-09	4.64206095785e-09	TGGCCACTAGATGGCACC[A]	TGGCCACTAGATGGCACC[G]	23.6278904949	23.0321212576	19

COLUMN 01: chromosome
COLUMN 02: start position of motif
COLUMN 03: end position of motif
COLUMN 04: 1-based genomic coordinate ID
COLUMN 05: D-score
COLUMN 06: strand
COLUMN 07: P-value on reference genome
COLUMN 08: P-value on alternate genome
COLUMN 09: motif sequence on reference genome
COLUMN 10: motif sequence on alternate genome
COLUMN 11: raw motif score of reference genome based on PWM (position weight matrix)
COLUMN 12: raw motif score of alternate genome based on PWM (position weight matrix)
COLUMN 13: variant position w.r.t. motif

How to interpret D-score

D-score is a motif "[D]isruptive score" of a variant. It is calculated by difference between P-value between reference genome and alternate genome.

D-score = [-10 * log(P-val_Ref)] - [-10 * log(P-val_Alt)]
D-score = -10 * log(P-val_Ref/P-val_Alt)

Positive D-score denotes a variant is decreasing the likelihood of TF to bind the motif (motif-break)
Negative D-score denotes a variant is increasing the likelihood of TF to bind the motif (motif-gain)

Future Improvements

Current version has been optimized for calculating D-score with equal nucleotide background (A:C:G:T=1:1:1:1). While changing the proportion of nucleotide background is supported in current version, it can be slow and precision maybe lost. Our internal test has indicated that changing the proportion of nucleotide background has very minimal impact on P-value. The future version should address this issue.
Motif logo of BEFORE and AFTER

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitignore		.gitignore
MotifTools.png		MotifTools.png
README.md		README.md
__init__.py		__init__.py
bed2vcf.py		bed2vcf.py
callMotif.py		callMotif.py
config-example.ini		config-example.ini
procMotif.py		procMotif.py
runBscore.py		runBscore.py
runDenovo.py		runDenovo.py
runDscore.py		runDscore.py
runUtil.py		runUtil.py
vcf2bed.py		vcf2bed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MotifTools

Dependencies

Config

Inputs

Usage

Output

How to interpret D-score

Future Improvements

About

Releases

Packages

Languages

hoondy/MotifTools

Folders and files

Latest commit

History

Repository files navigation

MotifTools

Dependencies

Config

Inputs

Usage

Output

How to interpret D-score

Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages