Makes metagene plots for bedgraphs over given regions in bed files for any organism. Regions can be continuous or spliced. Useful in analysis of ChIRP-seq, ChIP-seq, GRO-seq, ATAC-seq, iCLIP, ribosome profiling, RNA-seq, and other NGS datasets.
Table of Contents
- Go to 'releases' above and download the latest tar.gz file. Unzip with
tar xvzf metagene-maker-0.x.tar.gz>
- Alternatively, you can clone this git repository using
git clone
. - Go into the folder:
cd <metagene-maker-0.x>
- Make sure you have the needed dependencies (below). Install:
sudo python setup.py install
. If you do not have sudo privileges, runpython setup.py install --user
orpython setup.py install --prefix=<desired directory>
. Be sure that the python you use to runsetup.py
is version 2.7; scripts WILL NOT WORK with lower versions (2.4, 2.5).
- Make config file (see below)
- Ensure that you have a bedgraph for every sample you want to analyze.
- Ensure that you have properly formatted BED6/12 files for every region for which you want to build average profiles. You can make these with the included
extractTranscriptRegions
module (see below). - Run:
metagene_maker <config file> <name> <outputDir>
where is the configuration file you make usingexample.conf
(provided) as the template. The example file is in thetest
folder. Instructions for making configuration file are below. Run this either inscreen
ornohup
. - Output: tab delimited files for each region in a new
averages
folder in the user-provided output directory, as well as raw files namedallchr_sorted.txt
in each subfolder that contains binned profiles for each region and can be used for custom analysis.
usage: metagene_maker [-h] [-l binLength] [-p processors] [--sample] config_file prefix output_directory
example: metagene_maker -p 10 --sample config/test.txt M3_ChIP chip/
positional arguments: | explanation |
---|---|
config_file | required configuration file |
prefix | Prefix of output files |
output_directory | Directory where output folders will be written |
optional arguments: | explanation |
---|---|
-h, --help | show this help message and exit |
-l binLength | Bases per window when processing bedgraph. Default is 2,000,000. |
-p processors | Number of cores to use. Default is 4. |
--sample | Run subsampling to make metagenes more robust. |
- Python (>=2.7)
- Numpy (a python module) (>=1.7)
- Pandas (a python module) (>=0.14)
At least 4 GB RAM if your largest bedgraph is 1 GB and you use 4 cores (empirical rule: n cores * m GB bedgraph --> mn GB RAM needed)
You can supply your own BED6/12 files or use genome-wide BED files made using an included script, extractTranscriptRegions. You can start from either GTF files or files downloaded from UCSC as follows:
- Download the GTF file into the desired directory.
- From UCSC Genome Browser, go to Table Browser and choose your favorite organism/assembly. Choose "Genes and Gene Predictions" in 'group' and one of the gene tracks (we recommend UCSC Genes, Ensembl, or RefSeq).
- Choose 'selected fields from primary and related tables' for 'output format'.
- Columns MUST be in this format:
- name
- chrom
- strand (+/-)
- txStart
- txEnd
- cdsStart
- cdsEnd
- exonCount
- exonStarts
- exonEnds
- score
- name2
- Download the file.
Run extractTranscriptRegions -i <gene_file.txt> -o <output_prefix> [--ucsc|--gtf]
. Output will be a list of bed files for UTRs, CDS's, exons, introns, splice sites, TSS's, and TES's that can be used for metagene-maker.
folder: the name of the sample (should also be the name of the folder where sample-specific intermediate files will be made)
bedGraphLoc: absolute path to bedgraph
stranded: + if plus only, - if minus only, 0 if no strand information. IMPORTANT if your regions are also strand specific.
pairName: If a bedgraph is stranded, it must be part of a pair of bedgraphs (one + and one -) that share the same pairName.
regionType: name of region
fileLoc: absolute path to file specifying the regions of interest
limitSize: y if only regions >200bp and <200kb should be considered; n if no limitation
numBins: number of bins for the central region. Use 1 to get the average coverage across the entire region. To make plots, anywhere between 100 and 500 is sufficient.
sideExtension: number of nt's to extend on each side of the provided regions. Default is 0.
sideNumBins: number of bins for each of the side extensions
the directory specified in the configuration file; contains all files generated by this pipeline
averages: contains one file for each region type; each file is an Excel spreadsheet with graphable metagenes for each sample
<sample>: a folder for each sample, named as described in the config file; contains intermediate files described below
bedGraphByChr: bedgraphs split by chromosome
bins: in this folder, there are subfolders for each region type, containing profiles for each instance of the region, intermediate RData files, and metagene plots