FaceBase Bioinformatics Pipeline v1.0

Overview

Provide a common bioinformatics pipeline for RNA-Seq and ChiP-Seq data that will allow for cross-comparison of and integration with datasets across spokes.

Contributors

Hub: Rob Schuler, Alejandro Bugocov, Paul Thomas, Huaiyu Mi, Carl Kesselman, Cris Williams
- Responsible for FaceBase Hub data model, data exchange format, data export/import tools, data curation and quality control of dataset submissions.
Visel Lab: Axel Visel, Matt Blow, Diane Dickel, Remo Monti, Iros Barozzi, Cailyn Spurrell
- Responsible for implementing the pipeline using the ENCODE Uniform Processing Pipelines on the DNAnexus cloud hosted service and for operating the pipeline to (re)process FaceBase 2 RNA-Seq and ChIP-Seq datasets.

Pipelines and Reference Data

This pipeline adopts (without modification) the ENCODE Uniform Processing Pipelines on the DNAnexus cloud hosted service. The workflows and reference data may be found on DNAnexus. Identifiers for pipelines/workflows and reference data on DNAnexus are given in the form project-NNNN:[workflow|file]-MMMMM.

Data Processing Pipelines

RNA-seq (Single-end) processing pipeline
- ENCODE pipeline ID: ENCPL002LSE
- DNAnexus pipeline name: ENCODE RNA-Seq (Long) Pipeline - 1 (single-end) replicate
- DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F8B6zP00VBPQg1zzG2596vF9
RNA-seq (Paired-end) processing pipeline
- ENCODE pipeline ID: ENCPL002LPE
- DNAnexus pipeline name: ENCODE RNA-Seq (Long) Pipeline - 1 (paired-end) replicate
- DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F8B6zJ80VBPVFxf1KKXbbq4f
ChIP-seq processing pipeline
- ENCODE pipeline ID: ENCPL841HGV
- DNAnexus pipeline name: ENCODE histone ChIP-seq Unary Control Unreplicated (specify reference)
- DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F7KQY800VBPxvJ6y0ZZzKpYG

Reference Data

ENCODE Uniform Processing Pipeline - mm10
- mm10_male_M4_ERCC_starIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BZGy9600P1JfFx5G0Zv7bzfj
- mm10_no_alt.chrom.sizes: project-BKpvFg00VBPV975PgJ6Q03v6:file-Bv77qQ00Qy5JK0kVkVxJ83q3
- mm10_male_M4_ERCC_rsemIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BX3bGBj0FkxFVqpKyVjx070v
ENCODE Uniform Processing Pipeline - hg19
- hg19_male_v19_ERCC_starIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BZGykk00Q72fkFvf54vB0Zzj
- male.hg19.chrom.sizes: project-BKpvFg00VBPV975PgJ6Q03v6:file-BP4pGg00Qy57pYPx35BQ02f3
- hg19_male_v19_ERCC_rsemIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BV59gF80Bg5Y6qyGyvY37fXg

Metadata Requirements

Along with the raw sequencing data (fastq files), the following metadata are required as input for executing the workflows.

RNA-Seq Metadata Requirements

For RNA-Seq data processing the following metadata are required.

Field	Description
Filename	Filename of the raw sequencing data
URL	Source URL for the data
Experiment	Experiment (identified by RID) to which the replicate and therefore the data belong.
Replicate	Replicate (identified by RID) to which the sequencing file belongs.
Species	Mus musculus / Homo sapiens
Paired	Pair-end / Single end
Strandedness	Stranded / Non-stranded

ChIP-Seq Metadata Requirements

For ChIP-Seq data processing the following metadata are required.

Field	Description
Filename	Filename of the raw sequencing data
URL	Source URL for the data
Experiment	Experiment (identified by RID) to which the replicate and therefore the data belong.
Replicate	Replicate (identified by RID) to which the sequencing file belongs.
Species	Mus musculus / Homo sapiens
Paired	Pair-end / Single end
Target	Control / Histone / Transcription factor
Control	If the target is not `Control` then an experiment record id (RID) is used here to reference the control data to be used by the workflow.

Reference Metadata

The following metadata are built into the management scripts and therefore do not need to be provided from the FaceBase data catalog.

Field	Description
Reference genome	Mouse: mm10 Human: hg19 (then perform liftover in parallel to hg38)
Reference gene set	Latest ENCODE reference release

Pipeline Management

While the workflows and reference data are used without modification, the FaceBase Hub uses additional scripts to manage the process of extraction, job submission, and publication of results. These scripts are versioned in the FaceBase repository: https://github.com/informatics-isi-edu/facebase-dnanexus (private).

Data Exchange

Data are exchange between the Hub and the Pipeline by staging data on a pipeline management node (using the pipeline management software referenced above).

Export to Pipeline

Data are exported from the Hub in a semantic bundling BDBag format following the instructions for bulk download of raw sequencing data. The pipeline accepts pairs of raw sequencing data files in fastq (gzipped) format, which may be single or paired ended and for RNA-seq may be stranded or non-stranded.

The data export is structured as:

{bundledir}/
            {dataset_RID}/
                          {dataset_RID}-RNA-Seq.json   # RNA-seq metadata
                          {dataset_RID}-ChIP-Seq.json  # ChIP-seq metadata
                          {replicate_RID}/seq/
                                              {filename}.fastq.gz...  # data

Import from Pipeline

For import of data from the pipeline back to the FaceBase data catalog, data are organized in the following structure and the deriva-upload[-cli] is used to extract metadata and re-integrate the data into the data catalog.

{bundledir}/
            {pipeline_RID}/
                           {replicate_RID}/
                                           proc/
                                                {mapping_assembly}/
                                                                   {filename.ext}...

For data processed by the FaceBase hub v1.0 the pipeline_RID must be 1-3YNY which is the stable record identifier for the protocol record in the data catalog. The replicate_RID must match the input data replicate_RID. The mapping_assembly must be a supported reference genome using the UCSC identifier format (e.g., mm10, hg19).

The following output files are ingested from the pipeline: bam (and bai), count, various tsv, fastqc, broadPeak, gappedPeak, and narrowPeak data files.

Processed files estimate

As of early 2019, the pipeline was used to process approximately 528 RNA-Seq experiments and 130 ChIP-Seq experiments.

Back to Home

Submitting and Curating Data

More Data Howtos

Tools

DERIVA Clients

Working Groups

Bioinformatics Pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly