From 6c8e64aee02ee74d369161f5784c15fe2f68ae05 Mon Sep 17 00:00:00 2001 From: js2264 Date: Thu, 19 Oct 2023 10:18:17 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20bioc=5Fdevel=20from=20@=20js22?= =?UTF-8?q?64/OHCA@6b56ba4525afc3238e98ef3b7b76ee24becb4ef7=20=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- data-representation.html | 42 +- disseminating.html | 8 +- index.html | 48 +- interactions-centric.html | 20 +- interoperability.html | 427 +++++++++--------- matrix-centric.html | 34 +- .../figure-html/unnamed-chunk-10-1.png | Bin 604634 -> 604793 bytes .../figure-html/unnamed-chunk-13-1.png | Bin 69450 -> 69601 bytes .../figure-html/unnamed-chunk-16-1.png | Bin 777386 -> 777914 bytes parsing.html | 48 +- preamble.html | 16 +- principles.html | 148 ++++-- search.json | 60 ++- sitemap.xml | 28 +- topological-features.html | 54 +-- visualization.html | 12 +- .../figure-html/unnamed-chunk-10-1.png | Bin 363003 -> 363160 bytes .../figure-html/unnamed-chunk-11-1.png | Bin 384261 -> 384417 bytes .../figure-html/unnamed-chunk-14-1.png | Bin 363384 -> 363537 bytes .../figure-html/unnamed-chunk-17-1.png | Bin 48267 -> 48420 bytes .../figure-html/unnamed-chunk-4-1.png | Bin 350766 -> 350926 bytes .../figure-html/unnamed-chunk-5-1.png | Bin 197748 -> 197876 bytes .../figure-html/unnamed-chunk-7-1.png | Bin 392977 -> 393128 bytes .../figure-html/unnamed-chunk-8-1.png | Bin 1159778 -> 1159931 bytes .../figure-html/unnamed-chunk-8-2.png | Bin 988795 -> 988955 bytes .../figure-html/unnamed-chunk-9-1.png | Bin 269040 -> 269200 bytes workflow-chicken.html | 10 +- 27 files changed, 509 insertions(+), 446 deletions(-) diff --git a/data-representation.html b/data-representation.html index 2086691..a8610ac 100644 --- a/data-representation.html +++ b/data-representation.html @@ -365,7 +365,7 @@

2.1 GRanges class

GRanges is a shorthand for GenomicRanges, a core class in Bioconductor. This class is primarily used to describe genomic ranges of any nature, e.g.  sets of promoters, SNPs, chromatin loop anchors, ….
-The data structure has been published in the seminal 2015 publication by the Bioconductor team (Huber et al. (2015)).

+The data structure has been published in the seminal 2015 publication by the Bioconductor team (Huber et al. (2015)).

2.1.1 GRanges fundamentals

The easiest way to generate a GRanges object is to coerce it from a vector of genomic coordinates in the UCSC format (e.g. "chr2:2004-4853"):

@@ -1099,7 +1099,7 @@

Note how close from a TSS the 8th peak was. It could be worth considering this as an overlap!

2.2 GInteractions class

-

GRanges describe genomic ranges and hence are of general use to study 1D genome organization. To study chromatin interactions, we need a way to link pairs of GRanges. This is exactly what the GInteractions class does. This data structure is defined in the InteractionSet package and has been published in the 2016 paper by Lun et al. (Lun, Perry, and Ing-Simmons (2016)).

+

GRanges describe genomic ranges and hence are of general use to study 1D genome organization. To study chromatin interactions, we need a way to link pairs of GRanges. This is exactly what the GInteractions class does. This data structure is defined in the InteractionSet package and has been published in the 2016 paper by Lun et al. (Lun et al. (2016)).

2.2.1 Building a GInteractions object from scratch

@@ -1597,7 +1597,7 @@

coolf
 ##                                                   EH7702 
-##  "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
+## "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752"

Similarly, example files are available for other file formats:

@@ -1667,7 +1667,7 @@

# ----- This creates a connection to a `.(m)cool` file (path stored in `coolf`) CoolFile(coolf) ## CoolFile object -## .mcool file: /github/home/.cache/R/ExperimentHub/1a594277bd62_7752 +## .mcool file: /github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752 ## resolution: 1000 ## pairs file: ## metadata(0): @@ -1675,7 +1675,7 @@

# ----- This creates a connection to a `.hic` file (path stored in `hicf`) HicFile(hicf) ## HicFile object -## .hic file: /github/home/.cache/R/ExperimentHub/1a5939a379f0_7836 +## .hic file: /github/home/.cache/R/ExperimentHub/1a9a270f71fe_7836 ## resolution: 1000 ## pairs file: ## metadata(0): @@ -1684,8 +1684,8 @@

HicproFile(hicpromatrixf, hicproregionsf) ## HicproFile object ## HiC-Pro files: -## $ matrix: /github/home/.cache/R/ExperimentHub/1a59dc812a9_7837 -## $ regions: /github/home/.cache/R/ExperimentHub/1a591fa0216e_7838 +## $ matrix: /github/home/.cache/R/ExperimentHub/1a9a6531ab2c_7837 +## $ regions: /github/home/.cache/R/ExperimentHub/1a9a3c1fca84_7838 ## resolution: 1000 ## pairs file: ## metadata(0): @@ -1693,7 +1693,7 @@

# ----- This creates a connection to a pairs file PairsFile(pairsf) ## PairsFile object -## resource: /github/home/.cache/R/ExperimentHub/1a594e4de0cf_7753

+## resource: /github/home/.cache/R/ExperimentHub/1a9a1c034d7_7753

2.3.3 ContactFile slots

@@ -1709,7 +1709,7 @@

cf <- CoolFile(coolf)
 cf
 ##  CoolFile object
-##  .mcool file: /github/home/.cache/R/ExperimentHub/1a594277bd62_7752 
+##  .mcool file: /github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752 
 ##  resolution: 1000 
 ##  pairs file: 
 ##  metadata(0):
@@ -1807,7 +1807,7 @@ 

hic ## `HiCExperiment` object with 8,757,906 contacts over 12,079 regions ## ------- -## fileName: "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752" +## fileName: "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752" ## focus: "whole genome" ## resolutions(5): 1000 2000 4000 8000 16000 ## active resolution: 1000 @@ -1849,7 +1849,7 @@

These pieces of information are called slots. They can be directly accessed using getter functions, bearing the same name than the slot.

fileName(hic)
-##  [1] "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752"
+##  [1] "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752"
 
 focus(hic)
 ##  NULL
@@ -1928,7 +1928,7 @@ 

hic ## `HiCExperiment` object with 13,681,280 contacts over 12,165 regions ## ------- -## fileName: "/github/home/.cache/R/ExperimentHub/1a5939a379f0_7836" +## fileName: "/github/home/.cache/R/ExperimentHub/1a9a270f71fe_7836" ## focus: "whole genome" ## resolutions(5): 1000 2000 4000 8000 16000 ## active resolution: 1000 @@ -2370,14 +2370,14 @@

yeast_hic
 ##  `HiCExperiment` object with 8,757,906 contacts over 763 regions 
 ##  -------
-##  fileName: "/github/home/.cache/R/ExperimentHub/1a594277bd62_7752" 
+##  fileName: "/github/home/.cache/R/ExperimentHub/1a9a4dc30249_7752" 
 ##  focus: "whole genome" 
 ##  resolutions(5): 1000 2000 4000 8000 16000
 ##  active resolution: 16000 
 ##  interactions: 267709 
 ##  scores(2): count balanced 
 ##  topologicalFeatures: compartments(0) borders(0) loops(0) viewpoints(0) centromeres(16) 
-##  pairsFile: /github/home/.cache/R/ExperimentHub/1a594e4de0cf_7753 
+##  pairsFile: /github/home/.cache/R/ExperimentHub/1a9a1c034d7_7753 
 ##  metadata(3): ID org date

@@ -2693,8 +2693,8 @@

pairsFile(yeast_hic) <- pairsf
 
 pairsFile(yeast_hic)
-##                                                   EH7703 
-##  "/github/home/.cache/R/ExperimentHub/1a594e4de0cf_7753"
+##                                                  EH7703 
+##  "/github/home/.cache/R/ExperimentHub/1a9a1c034d7_7753"
 
 readLines(pairsFile(yeast_hic), 25)
 ##   [1] "## pairs format v1.0"                                                             
@@ -2777,12 +2777,12 @@ 

References

-
+} @@ -291,6 +291,7 @@
  • 1.3.3 HiCool: hicstuff within R
  • +
  • 1.4 Exploratory data analysis of processed Hi-C files
  • References
  • @@ -328,14 +329,14 @@

    1.1 Experimental considerations

    1.1.1 Experimental approach

    -

    The Hi-C procedure (Lieberman-Aiden et al. (2009)) stems from the clever combination of high-throughput sequencing and Chromatin Conformation Capture (3C) experimental approach (Dekker et al. (2002)).
    +

    The Hi-C procedure (Lieberman-Aiden et al. (2009)) stems from the clever combination of high-throughput sequencing and Chromatin Conformation Capture (3C) experimental approach (Dekker et al. (2002)).
    In Hi-C, chromatin is crosslinked within intact nuclei and enzymatically digested (usually with one or several restriction enzymes, but Hi-C variants using MNase or DNase exist). End-repair introduces biotinylated dNTPs and is followed by religation, which generates chimeric DNA fragments consisting of genomic loci originally lying in spatial proximity, usually crosslinked to a shared protein complex. After religation, DNA fragments are sheared, biotin-containing fragments are pulled-down and converted into a sequencing library.

    1.1.2 C variants

    -

    A number of C variants have been proposed since the publication of the original 3C method (reviewed by J. O. et al. (2017)), the main ones being Capture-C and ChIA-PET (see procedure below).

    +

    A number of C variants have been proposed since the publication of the original 3C method (reviewed by Davies et al. (2017)), the main ones being Capture-C and ChIA-PET (see procedure below).

    -

    Capture-C is useful to quantify interactions between a set of regulatory elements of interest. ChIA-PET, on the other hand, can identify interactions mediated by a specific protein of interest. Finally, an increasing number of Hi-C approaches rely on long-read sequencing (e.g. Deshpande et al. (2022), Tavares-Cadete et al. (2020)) to identify clusters of 3D contacts.

    +

    Capture-C is useful to quantify interactions between a set of regulatory elements of interest. ChIA-PET, on the other hand, can identify interactions mediated by a specific protein of interest. Finally, an increasing number of Hi-C approaches rely on long-read sequencing (e.g. Deshpande et al. (2022), Tavares-Cadete et al. (2020)) to identify clusters of 3D contacts.

    1.1.3 Sequencing

    Hi-C libraries are traditionally sequenced with short-read technology, and are by essence paired-end libraries. For this reason, the end result of the experimental side of the Hi-C consists of two fastq files, each one containing sequences for one extremity of the DNA fragments purified during Hi-C. These are the two files we need to move on to the computational side of Hi-C.

    @@ -464,7 +465,7 @@

    -

    More information about the conventions related to this text file are provided by the 4DN consortium, which originally formalized the specifications of this file format.

    +

    More information about the conventions related to this text file are provided by the 4DN consortium, which originally formalized the specifications of this file format.

    1.2.2 Binned contact matrix files

    @@ -532,12 +533,12 @@

    In this context, the regions.bed acts as a secondary “dictionary” describing the nature of i and j indices, i.e. the location of genomic bins.

    1.2.2.2 Plain-text matrices: HiC-Pro style

    -

    The HiC-Pro pipeline (Servant et al. (2015)) outputs 2 text files: a regions.bed file and a count.matrix file. They are generated by the exact process explained above.

    +

    The HiC-Pro pipeline (Servant et al. (2015)) outputs 2 text files: a regions.bed file and a count.matrix file. They are generated by the exact process explained above.

    Together, these two files can describe the interaction frequency between any pair of genomic loci. They are non-binarized text files, and as such are technically human-readable. However, it is relatively hard to get a grasp of these files compared to a plain .pairs file, as information regarding genomic bins and interaction frequencies are stored in separate files. Moreover, because they are non-binarized, these files often end up using a large disk space and cannot be easily indexed. This prevents easy subsetting of the data stored in these files.

    .(m)cool and .hic file formats are two standards addressing these limitations.

    1.2.2.3 .(m)cool matrices

    -

    The .cool format has been formally defined in Abdennur and Mirny (2020) and is a particular type of HDF5 (Hierarchical Data Format) file. It is an indexed archive file storing rectangular tables called:

    +

    The .cool format has been formally defined in Abdennur & Mirny (2019) and is a particular type of HDF5 (Hierarchical Data Format) file. It is an indexed archive file storing rectangular tables called:

    • bins: containing the same information than the regions.bed file;
    • @@ -560,7 +561,7 @@

      Moreover, parsing .cool files is possible using HDF standard APIs.

    1.2.2.4 .hic matrices

    -

    The .hic format is another type of binarized, indexed and highly-compressed file (Durand et al. (2016)). It can store virtually the same information than a .cool file. However, parsing .hic files is not as straightforward as .cool files, as it does not rely on a generic file standard. Still, the straw library has been implemented in several computing languages to facilitate parsing of .hic files (Durand et al. (2016)).

    +

    The .hic format is another type of binarized, indexed and highly-compressed file (Durand et al. (2016)). It can store virtually the same information than a .cool file. However, parsing .hic files is not as straightforward as .cool files, as it does not rely on a generic file standard. Still, the straw library has been implemented in several computing languages to facilitate parsing of .hic files (Durand et al. (2016)).

    1.3 Pre-processing Hi-C data

    @@ -575,7 +576,7 @@

  • Normalization of contact matrix and multi-resolution matrix generation
  • -

    In practice, a minimal workflow to pre-process Hi-C data is the following (adapted from Open2C et al. (2023)):

    +

    In practice, a minimal workflow to pre-process Hi-C data is the following (adapted from Open2C et al. (2023)):

    ## Note these fields have to be replaced by appropriate variables: 
     ##    <index>
    @@ -596,9 +597,9 @@ 

    nf-distiller: a combination of an aligner + pairtools + cooler
  • -HiC-pro (Servant et al. (2015))
  • +HiC-pro (Servant et al. (2015))
  • -Juicer (Durand et al. (2016))
  • +Juicer (Durand et al. (2016))
    @@ -610,7 +611,7 @@

    -

    For larger genomes (> 1Gb) with more than few tens of M of reads per fastq (e.g. > 100M), we recommend pre-processing data on an HPC cluster. Aligners, pairs processing and matrix binning can greatly benefit from parallelization over multiple CPUs (Open2C et al. (2023))).
    +

    For larger genomes (> 1Gb) with more than few tens of M of reads per fastq (e.g. > 100M), we recommend pre-processing data on an HPC cluster. Aligners, pairs processing and matrix binning can greatly benefit from parallelization over multiple CPUs (Open2C et al. (2023))).
    To scale up data pre-processing, we recommend to rely on an efficient read mapper such as bwa, followed by pairs parsing, sorting and deduplication with pairtools and binning with cooler.

    @@ -661,10 +662,10 @@

    ) ## HiCool :: Fetching bowtie genome index files from AWS iGenomes S3 bucket... ## HiCool :: Recovering bowtie2 genome index from AWS iGenomes... -## + /github/home/.cache/R/basilisk/1.13.1/0/bin/conda 'create' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.1/HiCool/1.1.0/env' 'python=3.7.12' '--quiet' '-c' 'conda-forge' '-c' 'bioconda' -## + /github/home/.cache/R/basilisk/1.13.1/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.1/HiCool/1.1.0/env' 'python=3.7.12' -## + /github/home/.cache/R/basilisk/1.13.1/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.1/HiCool/1.1.0/env' '-c' 'conda-forge' '-c' 'bioconda' 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1' -## HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpyLujmT/WL4DIE]... +## + /github/home/.cache/R/basilisk/1.13.4/0/bin/conda 'create' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.4/HiCool/1.1.0/env' 'python=3.7.12' '--quiet' '-c' 'conda-forge' '-c' 'bioconda' +## + /github/home/.cache/R/basilisk/1.13.4/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.4/HiCool/1.1.0/env' 'python=3.7.12' +## + /github/home/.cache/R/basilisk/1.13.4/0/bin/conda 'install' '--yes' '--prefix' '/github/home/.cache/R/basilisk/1.13.4/HiCool/1.1.0/env' '-c' 'conda-forge' '-c' 'bioconda' 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1' +## HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpIWmk55/WL4DIE]... ## HiCool :: Mapping fastq files... ## HiCool :: Removing unwanted chromosomes... ## HiCool :: Parsing pairs into .cool file... @@ -674,12 +675,12 @@

    ## HiCool :: .fastq to .mcool processing done! ## HiCool :: Check ./HiCool/folder to find the generated files ## HiCool :: Generating HiCool report. This might take a while. -## HiCool :: Report generated and available @ /__w/OHCA/OHCA/HiCool/148151d75a8_7833^mapped-R64-1-1^WL4DIE.html +## HiCool :: Report generated and available @ /__w/OHCA/OHCA/HiCool/14976d56f7a_7833^mapped-R64-1-1^WL4DIE.html ## HiCool :: All processing successfully achieved. Congrats! ## CoolFile object -## .mcool file: ./HiCool//matrices/148151d75a8_7833^mapped-R64-1-1^WL4DIE.mcool +## .mcool file: ./HiCool//matrices/14976d56f7a_7833^mapped-R64-1-1^WL4DIE.mcool ## resolution: 4000 -## pairs file: ./HiCool//pairs/148151d75a8_7833^mapped-R64-1-1^WL4DIE.pairs +## pairs file: ./HiCool//pairs/14976d56f7a_7833^mapped-R64-1-1^WL4DIE.pairs ## metadata(3): log args stats

    @@ -708,16 +709,16 @@

    fs::dir_tree('HiCool/')
     ##  HiCool/
    -##  ├── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.html
    +##  ├── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.html
     ##  ├── logs
    -##  │   └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.log
    +##  │   └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.log
     ##  ├── matrices
    -##  │   └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.mcool
    +##  │   └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.mcool
     ##  ├── pairs
    -##  │   └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE.pairs
    +##  │   └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE.pairs
     ##  └── plots
    -##      ├── 148151d75a8_7833^mapped-R64-1-1^WL4DIE_event_distance.pdf
    -##      └── 148151d75a8_7833^mapped-R64-1-1^WL4DIE_event_distribution.pdf
    +## ├── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE_event_distance.pdf +## └── 14976d56f7a_7833^mapped-R64-1-1^WL4DIE_event_distribution.pdf

    • The *.pairs and *.mcool files are the pairs and contact matrix files, respectively. These are the output files the end-user is generally looking for. @@ -739,36 +740,93 @@

      All the files generated by a single HiCool pipeline execution contain the same 6-letter unique hash to make sure they are not overwritten if re-executing the same command.

      +

    +1.4 Exploratory data analysis of processed Hi-C files

    +

    Once Hi-C raw data has been transformed into a set of processed files, exploratory data analysis is typically conducted following two main routes:

    +
      +
    • Data visualization;
    • +
    • Data investigation.
    • +
    +

    During the last decade, a number of softwares have been developed to unlock Hi-C data visualization and investigation. Here we provide a non-exhaustive list of notable tools developed throughout the recent years for downstream Hi-C analysis, selected from this longer list.

    +
      +
    • +

      2012-2015:

      +
        +
      • HiTC (2012)
      • +
      • HiCCUPS (2014)
      • +
      • HiCseg (2014)
      • +
      • Fit-Hi-C (2014)
      • +
      • HiC-Pro (2015)
      • +
      • diffHic (2015)
      • +
      • cooltools (2015)
      • +
      • HiCUP (2015)
      • +
      • HiCPlotter (2015)
      • +
      • HiFive (2015)
      • +
      +
    • +
    • +

      2016-2019:

      +
        +
      • CHiCAGO (2016)
      • +
      • TADbit (2017)
      • +
      • HiCRep (2017)
      • +
      • HiC-DC (2017)
      • +
      • GoTHIC (2017)
      • +
      • HiCExplorer (2018)
      • +
      • Boost-HiC (2018)
      • +
      • HiCcompare (2018)
      • +
      • HiPiler (2018)
      • +
      • coolpuppy (2019)
      • +
      +
    • +
    • +

      2020-present:

      +
        +
      • Serpentine (2020)
      • +
      • CHESS (2020)
      • +
      • DeepHiC (2020)
      • +
      • Chromosight (2020)
      • +
      • Mustache (2020)
      • +
      • TADcompare (2020)
      • +
      • POSSUM (2021)
      • +
      • Calder (2021)
      • +
      • HICDCPlus (2021)
      • +
      • plotgardener (2021)
      • +
      • GENOVA (2021)
      • +
      +
    • +
    +

    All references as well as many other softwares and references are available here.

    -

    References

    -

    References

    +