RHiCDB is an open-source R package based on HiCDB methods that detects the contact domain boundaries (CDBs) from Hi-C contact matrix. RHiCDB function takes raw or normalized contact matrix and outputs consistency annotated CDBs or differential CDBs. visHiCDB function takes raw or normalized contact matrix and HiCDB results and outputs visualization of CDBs on single Hi-C map or differential CDBs on two Hi-C maps. HiCDB are also implemented as MATLAB version.
Here is the general features of HiCDB.
Here is the general steps of how we detect CDBs.
Install RHiCDB with devtools
#install.packages("devtools")
devtools::install_github("ChenFengling/RHiCDB")
OR Download RHiCDB_1.0.tar.gz and install in R.
install.packages("RHiCDB_1.0.tar.gz")
RHiCDB depends on pracma,limma,Matrix,gridExtra,rasterVis and lattice.
Download the test data set with URL https://github.com/ChenFengling/HiCDB/raw/master/testdata.tar.gz .
Unzip the testdata.tar.gz, you will find the dense format Hi-C data of hESC (Doxin et al.) in directory named 'h1_rep1/'.
tar -zxvf testdata.tar.gz
library('RHiCDB')
hicfile='h1_rep1/'
resolution=40000
chrsizes='hg19'
outdir='h1_rep1/'
RHiCDB(hicfile,resolution,chrsizes,ref='hg19',outdir=outdir)
This will take the intra-chromosome matrix ('chr1.matrix',...,'chr23.matrix') in 'h1_rep1/' as input and set the resolution as 40000,chrsizes as 'hg19', the CTCF motif ref as 'hg19' and output the contact domain boundaries.
hicfile='h1_rep1/chr17.matrix'
resolution=40000
outdir='h1_rep1'
CDBfile='h1_rep1/CDB.txt'
chr=17
startloc=67100000
endloc=71100000
visHiCDB(hicfile,CDBfile,resolution,chr,startloc,endloc,outdir)
You will get this output in 17_67100000_71100000_HiCmap.pdf. The dot is CDB detected(dark blue:consistently detected CDBs; light blue:other CDBs)
As "chrX" is named as "chr23" and as "23" in the output CDB.txt file. You could use the following shell code to change CDB.txt into .bed file.
awk -v OFS="\t" '{ print "chr"$1,$2,$3,$4,$5}' CDB.txt >CDB.bed
sed -i 's/chr23/chrX/g' CDB.bed
hicfile: The directory of all intra-chromosome matrix of a sample. The intra-chromosome matrix must be named as "chr+number.matrix" according to the chromosome order like 'chr1.matrix','chr2.matrix',...,'chr23.matrix'. As HiCDB matches "chr*.matrix" to recognize the Hi-C matrix, avoid to use the "chr*.matrix" as the name of other files. The intra-chromosome matrix could be in a dense (a NxN matrix) or sparse (a Kx3 table,Rao et al.) format. hicfile should be set as 'SAMPLE_DIR' when option is "singlemap", list('SAMPLE_DIR1','SAMPLE_DIR2') or list(c(’SAMPLE1_rep1’,’SAMPLE1_rep2’),c(’SAMPLE2_rep1’,’SAMPLE2_rep2’)) when option is ‘comparemap’. This is required.
Dense format contains the contact frequencies of the Hi-C NxN matrix.
Sparse format (Rao et al.) has three fields: i, j, and M_i,j. (i and j are written as the left edge of the bin at a given resolution; for example, at 10 kb resolution, the entry corresponding to the first row and tenth column of the matrix would correspond to M_i,j, where i=0, j=90000). As the Hi-C matrix is symmetric, only the upper triangle of the matrix is saved in sparse format. An example is as following:
50000 | 50000 | 1.0 |
60000 | 60000 | 1.0 |
540000 | 560000 | 1.0 |
560000 | 560000 | 59.0 |
560000 | 570000 | 1.0 |
560000 | 600000 | 1.0 |
560000 | 700000 | 1.0 |
690000 | 710000 | 1.0 |
700000 | 710000 | 1.0 |
710000 | 710000 | 66.0 |
resolution: resolution of Hi-C matrix. This is required.
chrsizes: Ordered chromosome sizes of the genome. Optional setting is ‘hg19’, ‘hg38’, ‘mm9’, ‘mm10’ or any other chromosome size files which can be generated following the instructions in annotation/README.md. This is required.
ref: ref should be set when you want to get a cutoff using a CTCF motif or the option is 'comparemap'. Optional ref is ‘hg19’, ‘hg38’, ‘mm9’, ‘mm10’ or any other custom motif locus files which can be generated from instructions in annotation/README.md. Only ‘hg19’ and ‘hg38’ can be annotated with conservation. To decide the cutoff in other organisms, users could use the motif of other insulators as a reference instead of CTCF. According to our experience, it is reliable to check the CDBs on Hi-C map under several cutoff to decide the cutoff in other organisms. As HiCDB implements visualizations for the Hi-C maps with annotated CDBs and works well under a broad parameter range, it won’t be too hard. The current cutoff in 40kb and 10kb human sample are approximately the half and third quitile of the total local maximum peaks respectively.
RHiCDB('sample1/',10000,chrsizes='custom_chrsizes.txt');
RHiCDB('sample1/',10000,chrsizes='custom_chrsizes.txt',outdir='sample1/outputs/');
RHiCDB('sample1/',10000,chrsizes='hg19',ref='hg19');
RHiCDB('sample1/',10000,chrsizes='custom_chrsizes.txt',ref='custom_motiflocs.txt')
RHiCDB(list('sample1','sample2'),10000,'hg19',ref='hg19');
RHiCDB(list(c("sample1_rep1","sample1_rep2"),c("sample2_rep1","sample2_rep2")),10000,'hg19',ref='hg19');
1.CDB.txt:
chr | start | end | LRI | avgRI | conserve_or_not | consistent_or_differential |
---|---|---|---|---|---|---|
19 | 53100000 | 53140000 | 0.394707211 | 0.647392804 | 0 | 1 |
16 | 5060000 | 5100000 | 0.342727704 | 0.663101081 | 1 | 1 |
19 | 19620000 | 19660000 | 0.329837698 | 0.609237673 | 1 | 0 |
2. localmax.txt: all the local maximum peaks detected before cutoff decision. User can decide custom CDB cutoff upon this file.
3. EScurve.png: CTCF motif enrichment on ranked local maximum peaks.
These output files can be found in custom output directory or default directory namely the directory of the first sample.
4. aRI.txt: average RI score for each genomic bin.
5. LRI.txt: LRI score for each genomic bin.
hicfile: Hi-C matrix of the intersested chromosome.
CDBfile: CDBfile sould be a cell array storing the CDB location. The CDB files should be formatted as the output files of function HiCDB.
resolution: resolution of Hi-C map.
chr,startloc,endloc: observation locus on Hi-C map.
visHiCDB('sample1/chr18.matrix','CDB1.txt',40000,18,25000000,31150000)
visHiCDB(list('sample1/chr18.matrix','sample2/chr18.matrix'),list('CDB1.txt','CDB2.txt'),40000,18,25000000,31150000)
HiCmap.pdf: a pdf containing figure showing CDBs on single Hi-C map or different kinds of CDBs between two Hi-C maps.
These output files can be found in custom output directory or default directory namely the directory of the first sample.