coolf
## EH7702
-## "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
diff --git a/data-representation.html b/data-representation.html index 2086691..a8610ac 100644 --- a/data-representation.html +++ b/data-representation.html @@ -365,7 +365,7 @@
GRanges
classGRanges
is a shorthand for GenomicRanges
, a core class in Bioconductor
. This class is primarily used to describe genomic ranges of any nature, e.g. sets of promoters, SNPs, chromatin loop anchors, ….
-The data structure has been published in the seminal 2015 publication by the Bioconductor
team (Huber et al. (2015)).
Bioconductor
team (Huber et al. (2015)).
GRanges
fundamentalsThe easiest way to generate a GRanges
object is to coerce it from a vector of genomic coordinates in the UCSC format (e.g. "chr2:2004-4853"
):
Note how close from a TSS the 8th peak was. It could be worth considering this as an overlap!
GInteractions
classGRanges
describe genomic ranges and hence are of general use to study 1D genome organization. To study chromatin interactions, we need a way to link pairs of GRanges
. This is exactly what the GInteractions
class does. This data structure is defined in the InteractionSet
package and has been published in the 2016 paper by Lun et al.
(Lun, Perry, and Ing-Simmons (2016)).
GRanges
describe genomic ranges and hence are of general use to study 1D genome organization. To study chromatin interactions, we need a way to link pairs of GRanges
. This is exactly what the GInteractions
class does. This data structure is defined in the InteractionSet
package and has been published in the 2016 paper by Lun et al.
(Lun et al. (2016)).
GInteractions
object from scratchcoolf
## EH7702
-## "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
Similarly, example files are available for other file formats:
ContactFile
slotscf <- CoolFile(coolf)
cf
## CoolFile object
-## .mcool file: /github/home/.cache/R/ExperimentHub/1a594277bd62_7752
+## .mcool file: /github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752
## resolution: 1000
## pairs file:
## metadata(0):
@@ -1807,7 +1807,7 @@
hic
## `HiCExperiment` object with 8,757,906 contacts over 12,079 regions
## -------
-## fileName: "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
+## fileName: "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752"
## focus: "whole genome"
## resolutions(5): 1000 2000 4000 8000 16000
## active resolution: 1000
@@ -1849,7 +1849,7 @@
These pieces of information are called slots
. They can be directly accessed using getter
functions, bearing the same name than the slot.
fileName(hic)
-## [1] "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
+## [1] "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752"
focus(hic)
## NULL
@@ -1928,7 +1928,7 @@
hic
## `HiCExperiment` object with 13,681,280 contacts over 12,165 regions
## -------
-## fileName: "/github/home/.cache/R/ExperimentHub/1a5939a379f0_7836"
+## fileName: "/github/home/.cache/R/ExperimentHub/1a9a270f71fe_7836"
## focus: "whole genome"
## resolutions(5): 1000 2000 4000 8000 16000
## active resolution: 1000
@@ -2370,14 +2370,14 @@
yeast_hic
## `HiCExperiment` object with 8,757,906 contacts over 763 regions
## -------
-## fileName: "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
+## fileName: "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752"
## focus: "whole genome"
## resolutions(5): 1000 2000 4000 8000 16000
## active resolution: 16000
## interactions: 267709
## scores(2): count balanced
## topologicalFeatures: compartments(0) borders(0) loops(0) viewpoints(0) centromeres(16)
-## pairsFile: /github/home/.cache/R/ExperimentHub/1a594e4de0cf_7753
+## pairsFile: /github/home/.cache/R/ExperimentHub/1a9a1c034d7_7753
## metadata(3): ID org date
@@ -2693,8 +2693,8 @@
-
More information about the conventions related to this text file are provided by the 4DN consortium, which originally formalized the specifications of this file format.
+More information about the conventions related to this text file are provided by the 4DN consortium, which originally formalized the specifications of this file format.
1.2.2 Binned contact matrix files
@@ -532,12 +533,12 @@
In this context, the regions.bed
acts as a secondary “dictionary” describing the nature of i
and j
indices, i.e. the location of genomic bins.
1.2.2.2 Plain-text matrices: HiC-Pro style
-The HiC-Pro pipeline (Servant et al. (2015)) outputs 2 text files: a regions.bed
file and a count.matrix
file. They are generated by the exact process explained above.
+The HiC-Pro pipeline (Servant et al. (2015)) outputs 2 text files: a regions.bed
file and a count.matrix
file. They are generated by the exact process explained above.
Together, these two files can describe the interaction frequency between any pair of genomic loci. They are non-binarized text files, and as such are technically human-readable. However, it is relatively hard to get a grasp of these files compared to a plain .pairs
file, as information regarding genomic bins and interaction frequencies are stored in separate files. Moreover, because they are non-binarized, these files often end up using a large disk space and cannot be easily indexed. This prevents easy subsetting of the data stored in these files.
.(m)cool
and .hic
file formats are two standards addressing these limitations.
1.2.2.3 .(m)cool
matrices
-The .cool
format has been formally defined in Abdennur and Mirny (2020) and is a particular type of HDF5
(Hierarchical Data Format
) file. It is an indexed archive file storing rectangular tables called:
+The .cool
format has been formally defined in Abdennur & Mirny (2019) and is a particular type of HDF5
(Hierarchical Data Format
) file. It is an indexed archive file storing rectangular tables called:
-
bins
: containing the same information than the regions.bed
file;
@@ -560,7 +561,7 @@
Moreover, parsing .cool
files is possible using HDF
standard APIs.
1.2.2.4 .hic
matrices
-The .hic
format is another type of binarized, indexed and highly-compressed file (Durand et al. (2016)). It can store virtually the same information than a .cool
file. However, parsing .hic
files is not as straightforward as .cool
files, as it does not rely on a generic file standard. Still, the straw
library has been implemented in several computing languages to facilitate parsing of .hic
files (Durand et al. (2016)).
+The .hic
format is another type of binarized, indexed and highly-compressed file (Durand et al. (2016)). It can store virtually the same information than a .cool
file. However, parsing .hic
files is not as straightforward as .cool
files, as it does not rely on a generic file standard. Still, the straw
library has been implemented in several computing languages to facilitate parsing of .hic
files (Durand et al. (2016)).
1.3 Pre-processing Hi-C data
@@ -575,7 +576,7 @@
Normalization of contact matrix and multi-resolution matrix generation
-In practice, a minimal workflow to pre-process Hi-C data is the following (adapted from Open2C et al. (2023)):
+In practice, a minimal workflow to pre-process Hi-C data is the following (adapted from Open2C et al. (2023)):
## Note these fields have to be replaced by appropriate variables:
## <index>
@@ -596,9 +597,9 @@
nf-distiller
: a combination of an aligner + pairtools
+ cooler
-HiC-pro
(Servant et al. (2015))
+HiC-pro
(Servant et al. (2015))
-Juicer
(Durand et al. (2016))
+Juicer
(Durand et al. (2016))
@@ -610,7 +611,7 @@
-For larger genomes (> 1Gb) with more than few tens of M of reads per fastq (e.g. > 100M), we recommend pre-processing data on an HPC cluster. Aligners, pairs processing and matrix binning can greatly benefit from parallelization over multiple CPUs (Open2C et al. (2023))).
+
For larger genomes (> 1Gb) with more than few tens of M of reads per fastq (e.g. > 100M), we recommend pre-processing data on an HPC cluster. Aligners, pairs processing and matrix binning can greatly benefit from parallelization over multiple CPUs (Open2C et al. (2023))).
To scale up data pre-processing, we recommend to rely on an efficient read mapper such as bwa
, followed by pairs parsing, sorting and deduplication with pairtools
and binning with cooler
.
@@ -661,10 +662,10 @@
)
## HiCool :: Fetching bowtie genome index files from AWS iGenomes S3 bucket...
## HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
-## + /github/home/.cache/R/basilisk/1.13.1/0/bin/conda 'create' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.1/HiCool/1.1.0/env' 'python=3.7.12' '--quiet' '-c' 'conda-forge' '-c' 'bioconda'
-## + /github/home/.cache/R/basilisk/1.13.1/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.1/HiCool/1.1.0/env' 'python=3.7.12'
-## + /github/home/.cache/R/basilisk/1.13.1/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.1/HiCool/1.1.0/env' '-c' 'conda-forge' '-c' 'bioconda' 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1'
-## HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpyLujmT/WL4DIE]...
+## + /github/home/.cache/R/basilisk/1.13.4/0/bin/conda 'create' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.4/HiCool/1.1.0/env' 'python=3.7.12' '--quiet' '-c' 'conda-forge' '-c' 'bioconda'
+## + /github/home/.cache/R/basilisk/1.13.4/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.4/HiCool/1.1.0/env' 'python=3.7.12'
+## + /github/home/.cache/R/basilisk/1.13.4/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.4/HiCool/1.1.0/env' '-c' 'conda-forge' '-c' 'bioconda' 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1'
+## HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpIWmk55/WL4DIE]...
## HiCool :: Mapping fastq files...
## HiCool :: Removing unwanted chromosomes...
## HiCool :: Parsing pairs into .cool file...
@@ -674,12 +675,12 @@
## HiCool :: .fastq to .mcool processing done!
## HiCool :: Check ./HiCool/folder to find the generated files
## HiCool :: Generating HiCool report. This might take a while.
-## HiCool :: Report generated and available @ /__w/OHCA/OHCA/HiCool/148151d75a8_7833^mapped-R64-1-1^WL4DIE.html
+## HiCool :: Report generated and available @ /__w/OHCA/OHCA/HiCool/14976d56f7a_7833^mapped-R64-1-1^WL4DIE.html
## HiCool :: All processing successfully achieved. Congrats!
## CoolFile object
-## .mcool file: ./HiCool//matrices/148151d75a8_7833^mapped-R64-1-1^WL4DIE.mcool
+## .mcool file: ./HiCool//matrices/14976d56f7a_7833^mapped-R64-1-1^WL4DIE.mcool
## resolution: 4000
-## pairs file: ./HiCool//pairs/148151d75a8_7833^mapped-R64-1-1^WL4DIE.pairs
+## pairs file: ./HiCool//pairs/14976d56f7a_7833^mapped-R64-1-1^WL4DIE.pairs
## metadata(3): log args stats
@@ -708,16 +709,16 @@
fs::dir_tree('HiCool/')
## HiCool/
-## ├── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.html
+## ├── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.html
## ├── logs
-## │ └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.log
+## │ └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.log
## ├── matrices
-## │ └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.mcool
+## │ └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.mcool
## ├── pairs
-## │ └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.pairs
+## │ └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.pairs
## └── plots
-## ├── 148151d75a8_7833^mapped-R64-1-1^WL4DIE_event_distance.pdf
-## └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE_event_distribution.pdf
+## ├── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE_event_distance.pdf
+## └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE_event_distribution.pdf
- The
*.pairs
and *.mcool
files are the pairs and contact matrix files, respectively. These are the output files the end-user is generally looking for.
@@ -739,36 +740,93 @@
All the files generated by a single HiCool
pipeline execution contain the same 6-letter unique hash to make sure they are not overwritten if re-executing the same command.
Once Hi-C raw data has been transformed into a set of processed files, exploratory data analysis is typically conducted following two main routes:
+During the last decade, a number of softwares have been developed to unlock Hi-C data visualization and investigation. Here we provide a non-exhaustive list of notable tools developed throughout the recent years for downstream Hi-C analysis, selected from this longer list.
+2012-2015:
+2016-2019:
+2020-present:
+All references as well as many other softwares and references are available here.
-