Skip to content

FaceBase Bioinformatics Pipeline v1.0

robes edited this page Sep 18, 2019 · 5 revisions

Overview

Provide a common bioinformatics pipeline for RNA-Seq and ChiP-Seq data that will allow for cross-comparison of and integration with datasets across spokes.

Contributors

  • Hub: Rob Schuler, Alejandro Bugocov, Paul Thomas, Huaiyu Mi, Carl Kesselman, Cris Williams
    • Responsible for FaceBase Hub data model, data exchange format, data export/import tools, data curation and quality control of dataset submissions.
  • Visel Lab: Axel Visel, Matt Blow, Diane Dickel, Remo Monti, Iros Barozzi, Cailyn Spurrell
    • Responsible for implementing the pipeline using the ENCODE Uniform Processing Pipelines on the DNAnexus cloud hosted service and for operating the pipeline to (re)process FaceBase 2 RNA-Seq and ChIP-Seq datasets.

Pipelines and Reference Data

This pipeline adopts (without modification) the ENCODE Uniform Processing Pipelines on the DNAnexus cloud hosted service. The workflows and reference data may be found on DNAnexus. Identifiers for pipelines/workflows and reference data on DNAnexus are given in the form project-NNNN:[workflow|file]-MMMMM.

Data Processing Pipelines

  • RNA-seq (Single-end) processing pipeline

    • ENCODE pipeline ID: ENCPL002LSE
    • DNAnexus pipeline name: ENCODE RNA-Seq (Long) Pipeline - 1 (single-end) replicate
    • DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F8B6zP00VBPQg1zzG2596vF9
  • RNA-seq (Paired-end) processing pipeline

    • ENCODE pipeline ID: ENCPL002LPE
    • DNAnexus pipeline name: ENCODE RNA-Seq (Long) Pipeline - 1 (paired-end) replicate
    • DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F8B6zJ80VBPVFxf1KKXbbq4f
  • ChIP-seq processing pipeline

    • ENCODE pipeline ID: ENCPL841HGV
    • DNAnexus pipeline name: ENCODE histone ChIP-seq Unary Control Unreplicated (specify reference)
    • DNAnexus pipeline ID: project-BKpvFg00VBPV975PgJ6Q03v6:workflow-F7KQY800VBPxvJ6y0ZZzKpYG

Reference Data

  • ENCODE Uniform Processing Pipeline - mm10

    • mm10_male_M4_ERCC_starIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BZGy9600P1JfFx5G0Zv7bzfj
    • mm10_no_alt.chrom.sizes: project-BKpvFg00VBPV975PgJ6Q03v6:file-Bv77qQ00Qy5JK0kVkVxJ83q3
    • mm10_male_M4_ERCC_rsemIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BX3bGBj0FkxFVqpKyVjx070v
  • ENCODE Uniform Processing Pipeline - hg19

    • hg19_male_v19_ERCC_starIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BZGykk00Q72fkFvf54vB0Zzj
    • male.hg19.chrom.sizes: project-BKpvFg00VBPV975PgJ6Q03v6:file-BP4pGg00Qy57pYPx35BQ02f3
    • hg19_male_v19_ERCC_rsemIndex.tgz: project-BKpvFg00VBPV975PgJ6Q03v6:file-BV59gF80Bg5Y6qyGyvY37fXg

Metadata Requirements

Along with the raw sequencing data (fastq files), the following metadata are required as input for executing the workflows.

RNA-Seq Metadata Requirements

For RNA-Seq data processing the following metadata are required.

Field Description
Filename Filename of the raw sequencing data
URL Source URL for the data
Experiment Experiment (identified by RID) to which the replicate and therefore the data belong.
Replicate Replicate (identified by RID) to which the sequencing file belongs.
Species Mus musculus / Homo sapiens
Paired Pair-end / Single end
Strandedness Stranded / Non-stranded

ChIP-Seq Metadata Requirements

For ChIP-Seq data processing the following metadata are required.

Field Description
Filename Filename of the raw sequencing data
URL Source URL for the data
Experiment Experiment (identified by RID) to which the replicate and therefore the data belong.
Replicate Replicate (identified by RID) to which the sequencing file belongs.
Species Mus musculus / Homo sapiens
Paired Pair-end / Single end
Target Control / Histone / Transcription factor
Control If the target is not Control then an experiment record id (RID) is used here to reference the control data to be used by the workflow.

Reference Metadata

The following metadata are built into the management scripts and therefore do not need to be provided from the FaceBase data catalog.

Field Description
Reference genome Mouse: mm10 Human: hg19 (then perform liftover in parallel to hg38)
Reference gene set Latest ENCODE reference release

Pipeline Management

While the workflows and reference data are used without modification, the FaceBase Hub uses additional scripts to manage the process of extraction, job submission, and publication of results. These scripts are versioned in the FaceBase repository: https://github.com/informatics-isi-edu/facebase-dnanexus (private).

Data Exchange

Data are exchange between the Hub and the Pipeline by staging data on a pipeline management node (using the pipeline management software referenced above).

Export to Pipeline

Data are exported from the Hub in a semantic bundling BDBag format following the instructions for bulk download of raw sequencing data. The pipeline accepts pairs of raw sequencing data files in fastq (gzipped) format, which may be single or paired ended and for RNA-seq may be stranded or non-stranded.

The data export is structured as:

{bundledir}/
            {dataset_RID}/
                          {dataset_RID}-RNA-Seq.json   # RNA-seq metadata
                          {dataset_RID}-ChIP-Seq.json  # ChIP-seq metadata
                          {replicate_RID}/seq/
                                              {filename}.fastq.gz...  # data

Import from Pipeline

For import of data from the pipeline back to the FaceBase data catalog, data are organized in the following structure and the deriva-upload[-cli] is used to extract metadata and re-integrate the data into the data catalog.

{bundledir}/
            {pipeline_RID}/
                           {replicate_RID}/
                                           proc/
                                                {mapping_assembly}/
                                                                   {filename.ext}...

For data processed by the FaceBase hub v1.0 the pipeline_RID must be 1-3YNY which is the stable record identifier for the protocol record in the data catalog. The replicate_RID must match the input data replicate_RID. The mapping_assembly must be a supported reference genome using the UCSC identifier format (e.g., mm10, hg19).

The following output files are ingested from the pipeline: bam (and bai), count, various tsv, fastqc, broadPeak, gappedPeak, and narrowPeak data files.

Processed files estimate

As of early 2019, the pipeline was used to process approximately 528 RNA-Seq experiments and 130 ChIP-Seq experiments.