-
Notifications
You must be signed in to change notification settings - Fork 0
FaceBase Bioinformatics Pipeline v1.0
Provide a common bioinformatics pipeline for RNA-Seq and ChiP-Seq data that will allow for cross-comparison of and integration with datasets across spokes.
- Hub: Rob Schuler, Alejandro Bugocov, Paul Thomas, Huaiyu Mi, Carl Kesselman, Cris Williams
- Responsible for FaceBase Hub data model, data exchange format, data export/import tools, data curation and quality control of dataset submissions.
- Visel Lab: Axel Visel, Matt Blow, Diane Dickel, Remo Monti, Iros Barozzi, Cailyn Spurrell
- Responsible for implementing the pipeline using the ENCODE Uniform Processing Pipelines on the DNAnexus cloud hosted service and for operating the pipeline to (re)process FaceBase 2 RNA-Seq and ChIP-Seq datasets.
This pipeline adopts (without modification) the ENCODE Uniform Processing Pipelines on the DNAnexus cloud hosted service. The workflows and reference data may be found on DNAnexus. Identifiers for pipelines/workflows and reference data on DNAnexus are given in the form project-NNNN:[workflow|file]-MMMMM
.
-
RNA-seq (Single-end) processing pipeline
- ENCODE pipeline ID: ENCPL002LSE
- DNAnexus pipeline name: ENCODE RNA-Seq (Long) Pipeline - 1 (single-end) replicate
- DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F8B6zP00VBPQg1zzG2596vF9
-
RNA-seq (Paired-end) processing pipeline
- ENCODE pipeline ID: ENCPL002LPE
- DNAnexus pipeline name: ENCODE RNA-Seq (Long) Pipeline - 1 (paired-end) replicate
- DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F8B6zJ80VBPVFxf1KKXbbq4f
-
ChIP-seq processing pipeline
- ENCODE pipeline ID: ENCPL841HGV
- DNAnexus pipeline name: ENCODE histone ChIP-seq Unary Control Unreplicated (specify reference)
- DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F7KQY800VBPxvJ6y0ZZzKpYG
-
ENCODE Uniform Processing Pipeline - mm10
- mm10_male_M4_ERCC_starIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BZGy9600P1JfFx5G0Zv7bzfj
- mm10_no_alt.chrom.sizes: project-BKpvFg00VBPV975PgJ6Q03v6:file-Bv77qQ00Qy5JK0kVkVxJ83q3
- mm10_male_M4_ERCC_rsemIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BX3bGBj0FkxFVqpKyVjx070v
-
ENCODE Uniform Processing Pipeline - hg19
- hg19_male_v19_ERCC_starIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BZGykk00Q72fkFvf54vB0Zzj
- male.hg19.chrom.sizes: project-BKpvFg00VBPV975PgJ6Q03v6:file-BP4pGg00Qy57pYPx35BQ02f3
- hg19_male_v19_ERCC_rsemIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BV59gF80Bg5Y6qyGyvY37fXg
Along with the raw sequencing data (fastq files), the following metadata are required as input for executing the workflows.
For RNA-Seq data processing the following metadata are required.
Field | Description |
---|---|
Filename | Filename of the raw sequencing data |
URL | Source URL for the data |
Experiment | Experiment (identified by RID) to which the replicate and therefore the data belong. |
Replicate | Replicate (identified by RID) to which the sequencing file belongs. |
Species | Mus musculus / Homo sapiens |
Paired | Pair-end / Single end |
Strandedness | Stranded / Non-stranded |
For ChIP-Seq data processing the following metadata are required.
Field | Description |
---|---|
Filename | Filename of the raw sequencing data |
URL | Source URL for the data |
Experiment | Experiment (identified by RID) to which the replicate and therefore the data belong. |
Replicate | Replicate (identified by RID) to which the sequencing file belongs. |
Species | Mus musculus / Homo sapiens |
Paired | Pair-end / Single end |
Target | Control / Histone / Transcription factor |
Control | If the target is not Control then an experiment record id (RID) is used here to reference the control data to be used by the workflow. |
The following metadata are built into the management scripts and therefore do not need to be provided from the FaceBase data catalog.
Field | Description |
---|---|
Reference genome | Mouse: mm10 Human: hg19 (then perform liftover in parallel to hg38) |
Reference gene set | Latest ENCODE reference release |
While the workflows and reference data are used without modification, the FaceBase Hub uses additional scripts to manage the process of extraction, job submission, and publication of results. These scripts are versioned in the FaceBase repository: https://github.com/informatics-isi-edu/facebase-dnanexus (private).
Data are exchange between the Hub and the Pipeline by staging data on a pipeline management node (using the pipeline management software referenced above).
Data are exported from the Hub in a semantic bundling BDBag format following the instructions for bulk download of raw sequencing data. The pipeline accepts pairs of raw sequencing data files in fastq
(gzipped) format, which may be single or paired ended and for RNA-seq may be stranded or non-stranded.
The data export is structured as:
{bundledir}/
{dataset_RID}/
{dataset_RID}-RNA-Seq.json # RNA-seq metadata
{dataset_RID}-ChIP-Seq.json # ChIP-seq metadata
{replicate_RID}/seq/
{filename}.fastq.gz... # data
For import of data from the pipeline back to the FaceBase data catalog, data are organized in the following structure and the deriva-upload[-cli]
is used to extract metadata and re-integrate the data into the data catalog.
{bundledir}/
{pipeline_RID}/
{replicate_RID}/
proc/
{mapping_assembly}/
{filename.ext}...
For data processed by the FaceBase hub v1.0 the pipeline_RID
must be 1-3YNY
which is the stable record identifier for the protocol record in the data catalog. The replicate_RID
must match the input data replicate_RID
. The mapping_assembly
must be a supported reference genome using the UCSC identifier format (e.g., mm10
, hg19
).
The following output files are ingested from the pipeline: bam
(and bai
), count
, various tsv
, fastqc
, broadPeak
, gappedPeak
, and narrowPeak
data files.
As of early 2019, the pipeline was used to process approximately 528 RNA-Seq experiments and 130 ChIP-Seq experiments.