Skip to content

BINDetect

Mette Bentsen edited this page Sep 5, 2019 · 10 revisions

Input

  • --signals
    List of signal bigwigs with scores representing protein binding for each biological condition (higher scores = more evidence of binding). This can for example be coverage tracks or footprint scores calculated with TOBIAS FootprintScores.

  • --motifs
    File containing motifs in either PFM, JASPAR or MEME format. These are the motifs which will be used to scan for binding sites.

  • --genome
    The fasta file containing the full genome sequence for the given organism. Must fit to the names/lengths of the chromosomes in --signals bigwigs.

  • --peaks
    The peaks representing open chromatin regions.

Output

In the output folder given in --outdir, the following files and folder structure will be created:

  • <outdir>/<TF>/
    For each motif in --motifs, there will be a directory containing results from the scanning for this motif. The value of <TF> is given by the --naming parameter.

    • <outdir>/<TF>/<TF>_overview.{txt,xlsx}
      This is an overview of all motif occurrences in open regions (TFBS) for <TF>. The file exists in .txt (tab delimitered) and .xlsx format for easy filtering/sorting/etc.
    • <outdir>/<TF>/beds/
      The beds-directory contains bedfiles for all sites as well as bound/unbound splits per condition. The _all-file contains all scores from --signals whereas the bound/unbound files contains only the score for the given condition in the last column. Values of <condition> are given by the --cond_names parameter.
      - <outdir>/<TF>/beds/<TF>_all.bed
      - <outdir>/<TF>/beds/<TF>_<condition>_bound.bed
      - <outdir>/<TF>/beds/<TF>_<condition>_unbound.bed
  • <outdir>/bindetect_results.{txt,xlsx}
    This file contains results from the total bindetect run. Each line is a TF and columns are:

    • TF_name: Name as estimated by --naming
    • total_tfbs: number of binding sites found in input --peaks
    • <condition>_bound: Number of sites predicted bound in the given condition. This is estimated independently per condition based on the distribution of scores, and is therefore very dependent on how well the threshold for bound/unbound was set. It can therefore happen that a transcription factor has more bound sites in condition1 than in condition2, but has a negative <condition1>_<condition2>_change score, which would support more bound sites in condition2. In this case, the _change score is the more correct metric to use.
    • <condition1>_<condition2>_change: The differential binding score for the TF between the two conditions. Negative values imply more bound in condition2
    • <condition1>_<condition2>_pvalue: The pvalue of the statistical test against a background model. This can be very small due to the large number of transcription factor binding sites found, so this should always be considered in combination with the <condition1>_<condition2>_change column.
  • <outdir>/bindetect_figures.pdf
    A multi-page PDF containing an overview of score-distributions for each condition as well as log2fc and bindetect volcano-plots for each condition-comparison.

  • <outdir>/TF_distance_matrix.txt
    Distance matrix used to cluster the transcription factors in the bindetect_figures-dendrograms. This is based on the overlap of individual transcription factor binding sites.

Clone this wiki locally