Skip to content

Latest commit

 

History

History
341 lines (247 loc) · 18.7 KB

README.md

File metadata and controls

341 lines (247 loc) · 18.7 KB

scATACpipe

Table of Contents

Introduction
Pipeline summary

Quick Start

An example using human genome with matched scRNA-seq data

An example using plant genome without matched scRNA-seq data

Documentation
Credits
Bug report/Support
Citations
Release notes

Introduction

scATACpipe is a bioinformatic pipeline for single-cell ATAC-seq (scATAC-seq) data analysis.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible.

The development of the pipeline is guided by nf-core TEMPLATE.

Pipeline Summary

The pipeline consists of 2 relevant parts: preprocessing (from fastq to fragment file) and downstream analysis. If fragment files are directly available, you can choose to skip preprocessing and run downstream analysis only.

For preprocessing, 3 alternative strategies are available that are implemented in 3 sub-workflows respectively, namely, PREPROCESS_DEFAULT, PREPROCESS_10XGENOMICS, and PREPROCESS_CHROMAP. Each of them supports various input types that are demonstrated in further detail below (also see usage).

For downstream analysis, we implemented DOWNSTREAM_ARCHR sub-workflow that integrates ArchR and other tools (e.g. AMULET for doublet detection).

Below is a simplified diagram to illustrate the design logic and functionalities of scATACpipe.

The main functionalities of each sub-workflow are summarized below. You can also refer to the output - Result folders for more details.

PREPROCESS_DEFAULT:

  1. Add barcodes to reads
  2. Correct barcodes (optional)
    • if false: skip barcode correction
    • if pheniqs or naive: also filter out non-cells using "inflection point" method
  3. Trim off adapters
  4. Mapping
    • download genome/annotation or use custom genome
    • build genome index if not supplied
  5. Filter BAM
  6. Remove PCR duplicates
  7. Quality control
  8. Generate fragment file, etc.

PREPROCESS_10XGENOMICS:

  1. Build 10XGENOMICS index if not supplied
    • download genome/annotation or use custom genome
  2. Execute cellranger_atac count command
  3. Extract fragments from valid cells according to filtered_peak_bc_matrix/barcodes.tsv

PREPROCESS_CHROMAP:

  1. Build Chromap index if not supplied
    • download genome/annotation or use custom genome
  2. Execute chromap --preset atac command
  3. Filter out non-cells

Note that no BAM file will be generated for PREPROCESS_CHROMAP option.

DOWNSTREAM_ARCHR:

  1. Build ArchR-compatible genome/annotation files if not natively supported (ArchR supports hg19, hg38, mm9, and mm10 as of 02/2022)
    • download genome/annotation if not supplied
    • build ArchR genome/gene annotation files if needed
  2. Perform downstream analysis with ArchR and generate various analytical plots
    • filter doublets (with ArchR built-in method or AMULET)
    • dimension reduction
    • batch effect correction
    • clustering
    • embedding
    • pseudo-bulk clustering
    • scRNAseq integration if supplied
    • marker gene detection
    • call peaks
    • marker peak detection
    • pairwise testing
    • motif enrichment
    • footprinting
    • coaccessibility, etc.

The pipeline also splits BED and/or BAM files according to ArchR clusterings and summarizes all results into a single MultiQC report for easy view.

Quick Start

  1. Install nextflow(>=21.10.0).

  2. For full reproducibility, install either Docker or Singularity (Apptainer) and specify -profile singularity or -profile docker accordingly when running the pipeline so that all dependencies are satisfied. Otherwise, all of the dependencies must be available locally on your PATH, which is likely not true!

  3. Download the pipeline:

git clone https://github.com/hukai916/scATACpipe.git
  1. Download a minimal test dataset:
    • The test_data1 is prepared by downsampling (5% and 10%) a dataset named "500 Peripheral blood mononuclear cells (PBMCs) from a healthy donor (Next GEM v1.1)" provided by 10xgenomics. Note that, in test_data1, I1 refers to index1, which is for sample demultiplexing and not relevant in our case; R1 refers to Read1; R2 refers to index2, which represents the cell barcode fastq; R3 refers to Read2.
cd scATACpipe
wget https://www.dropbox.com/s/uyiq18zk7dts9fx/test_data1.zip
unzip test_data1.zip
  1. Edit the replace_with_full_path in the assets/sample_sheet_test_data1.csv to use the actual full path.

  2. Test the pipeline with this minimal test_data1:

    • At least 8GB memory is recommended for test_data1.
    • By default, the local executor will be used (-profile local) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g. -profile docker,lsf.
    • Please check nf-core/configs to see what other custom config files can be supplied.
  • Example run with Docker using local executor:
nextflow run main.nf -profile docker --preprocess default --outdir res_test_data1 --input_fastq assets/sample_sheet_test_data1.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens'
By executing the above command:
  - The `local executor` will be used.
  - `PREPROCESS_DEFAULT` will be used.
  - Output will be saved into `res_test_data1`.
  - Ensembl genome `homo_sapiens` will be downloaded and used as reference.
  • Example run with Singularity using LSF executor:
nextflow run main.nf -profile singularity,lsf --preprocess default --outdir res_test_data1 --input_fastq assets/sample_sheet_test_data1.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens'
  - By specifying `-profile lsf`, the `lsf` executor will be used for job submission.
  - By specifying `-profile singularity`, Singularity images will be downloaded and saved to `work/singularity` directory. It is recommended to config the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) settings to store the images in a central location.
  1. Run your own analysis:
  • A typical command:
nextflow run main.nf -profile <singularity/docker/lsf> --preprocess <default/10xgenomics/chromap> --outdir <path_to_result_dir> --input_fastq <path_to_samplesheet> --ref_fasta_ensembl <ENSEMBL_genome_name> --species_latin_name <e.g. 'homo sapiens'>
  • For help:
nextflow run main.nf --help

See documentation usage for all of the available options.

Web GUI

For easy generation of configuration file, we have implemented an interactive config generator.

It was implemented with pure HTML/JavaScript/CSS codes so that it can be used locally. To use it locally, simply click here to download the web_gui.

Then, then open it with your web browser by double clicking web_gui/index.html file or execute the following command in your terminal:

open web_gui/index.html

An example using human genome with matched scRNA-seq data

Commands and config

This section describes how the plots in the manuscript (to be added) were generated using scATACpipe. For comparison, the manuscript conducted 3 separate analyses, each using a different preprocessing strategy (default, 10xgenomics, chromap). Since the commands and preprocessed results are quite similar across the three methods, only the chromap option will be demonstrated here.

  1. The initial execution:
nextflow run main.nf -profile singularity,lsf --preprocess chromap --outdir ./results_chromap_initial --input_fastq ./assets/10X_human_scatac_fastq.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens' --archr_scrnaseq '/path/scRNA-Hematopoiesis-Granja-2019.rds' --archr_blacklist /home/hl84w/lucio_castilla/scATAC-seq/docs/hg38-blacklist.v2.bed.gz

Break down:

  • -profile singularity,lsf:

This option instructs scATACpipe to use Singularity containers and LSF as the executor. Multiple parameters are separated by commas. Since profile is pipeline-level flag, it is prefixed with a single dash (-). Module-level flags are prefixed with double dash (--).

  • --preprocess chromap:

This instructs scATACpipe to use Chromap preprocessing strategy.

  • --outdir ./results_chromap_initial:

Output will be saved into ./results_chromap_initial folder.

  • --input_fastq ./assets/10X_human_scatac_fastq.csv:

Please replace the /path/ in the 10X_human_scatac_fastq.csv with absolute paths. Details regarding the 6 samples can be found in the supplementary section of the paper. If you detect any outlier samples, you can remove them from the downstream analyses using the --filter_sample = 'PBMC_10K_C, PBMC_10K_X' flag.

  • --ref_fasta_ensembl homo_sapiens:

This specifies that the genome Homo Sapiens from ENSEMBLE will be used as reference. To view all supported genomes, check out nextflow run main.nf --support_genome.

  • --species_latin_name 'homo sapiens':

Simply the Latin name of the reference genome.

  • --archr_scrnaseq '/path/scRNA-Hematopoiesis-Granja-2019.rds'

Matching scRNA-seq data. Can ignore if not available. The example file can be downloaded here.

  • --archr_blacklist ./assets/hg38-blacklist.v2.bed.gz:

Blacklist to exclude for downstream analysis. Click here for other species.

Instead of passing each flag option via the command line, you can include them all in a configuration file and supply it with the -c option. Below is equivalent to above:

nextflow run main.nf -profile singularity,lsf -c ./conf/test_chromap_initial.config

What are inside test_chromap_initial.config:

params {
profile = 'singularity,lsf'
preprocess = 'chromap'
outdir = './results_chromap'
input_fastq = './assets/10X_human_scatac_fastq.csv'
ref_fasta_ensembl = 'homo_sapiens'
species_latin_name = 'homo sapiens'
archr_blacklist = '/home/hl84w/lucio_castilla/scATAC-seq/docs/hg38-blacklist.v2.bed.gz'
archr_scrnaseq = '/path/scRNA-Hematopoiesis-Granja-2019.rds'
}

Again, you have to replace /path/ with full absolute paths.

  1. The final execution:

After examining the results from the initial execution, we decided to remove the outlier clusters (C1, C6) from downstream analyses. These two clusters are considered problematic according to the following two plots:

  • The clustering heatmap plot from ./results_chromap_initial/archr_clustering/ folder: the cell proportions from PBMC_5K_N and PBMC_5K_V samples are unbalanced for C1, C6.

  • The marker gene heatmap plot from ./results_chromap_initial/archr_marker_gene_clusters/ folder: no distinct marker gene pattern detected in cluster C1, C6.

We used the following line to remove C1 and C6:

filter_seurat_harmony = 'C1, C6'

Also, we would like to perform constrained integration of scRNA-seq data in addition to the unconstrained integration. The following line was used to supply the grouping information:

archr_scrnaseq_grouplist = 'cTNK = c("19_CD8.N", "20_CD4.N1", "21_CD4.N2", "22_CD4.M", "23_CD8.EM", "24_CD8.CM", "25_NK"), cNonTNK = c("01_HSC", "03_Late.Eryth", "05_CMP.LMPP", "07_GMP", "08_GMP.Neut", "09_pDC", "11_CD14.Mono.1", "12_CD14.Mono.2", "16_Pre.B", "17_B")' // use full name to avoid double counting

To specify marker genes to plot, edit the following lines:

marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'

To specify a set of motifs for downstream analyses, edit the following lines:

motifs = 'default'
motifs = 'default'

To specify a set of motifs for footprinting analyses, edit the following lines:

motifs = 'default' // by default, first 3 will be analyzed
motifs = 'default' // by default, the first 3 will be plotted

We ended up with a final test_chromap_final.config:

params {
preprocess = 'chromap'
outdir = './results_chromap_final'
input_fastq = './assets/10X_human_scatac_fastq.csv'
ref_fasta_ensembl = 'homo_sapiens'
species_latin_name = 'homo sapiens'
archr_blacklist = '/home/hl84w/lucio_castilla/scATAC-seq/docs/hg38-blacklist.v2.bed.gz'
archr_scrnaseq = '/path/scRNA-Hematopoiesis-Granja-2019.rds'
// below are added after initial execution:
// for constrained integration:
archr_scrnaseq_grouplist = 'cTNK = c("19_CD8.N", "20_CD4.N1", "21_CD4.N2", "22_CD4.M", "23_CD8.EM", "24_CD8.CM", "25_NK"), cNonTNK = c("01_HSC", "03_Late.Eryth", "05_CMP.LMPP", "07_GMP", "08_GMP.Neut", "09_pDC", "11_CD14.Mono.1", "12_CD14.Mono.2", "16_Pre.B", "17_B")' // use full name to avoid double counting
// filter out undesired samples:
// filter_sample = 'PBMC_10K_C, PBMC_10K_X'
// filter out undesired clusters:
filter_seurat_harmony = 'C1, C6'
modules {
// specify a list of marker genes:
'archr_marker_gene_clusters' {
// args is for ArchR::getMarkerFeatures()
args = 'useMatrix = "GeneScoreMatrix", bias = c("TSSEnrichment", "log10(nFrags)"), testMethod = "wilcoxon"'
// getMarkers_cutoff is for ArchR::getMarkers()
getMarkers_cutoff = 'cutOff = "FDR <= 0.01 & Log2FC >= 1"'
// marker_genes is for plotting marker genes
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
// args2 is for visualizing embedding
args2 = 'colorBy = "GeneScoreMatrix", quantCut = c(0.01, 0.95)'
// args3 is for track plotting with ArchR::ArchRBrowser()
args3 = 'upstream = 50000, downstream = 50000'
}
'archr_marker_gene_clusters2' {
// args is for getMarkerFeatures
args = 'useMatrix = "GeneScoreMatrix", bias = c("TSSEnrichment", "log10(nFrags)"), testMethod = "wilcoxon"'
// marker_genes is for marker genes
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
// args2 is for visualizing embedding
args2 = 'colorBy = "GeneScoreMatrix", quantCut = c(0.01, 0.95)'
// args3 is for track plotting with ArchRBrowser
args3 = 'upstream = 50000, downstream = 50000'
}
'archr_scrnaseq_unconstrained' {
args = 'groupRNA = "BioClassification"' // alternative: cell_type
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
}
'archr_scrnaseq_constrained' {
marker_genes = 'CD33, CD86, FCGR1A, S100A8, S100A9, S100A12, FCGR1B, LYZ, FCGR2A, CD14, TLR1, TLR2, TLR4, TLR6, TLR7, TLR8, CCR2, CD1C, CST3, CD68, PTCRA, MS4A7, FCGR3A, CD4, ITGAM, LGALS3, FCER1G, S100A4, CD44, SELL, LTB, CCR7, CD27, CD3D, CD3E, CD3G, IL7R, CD28, CD8A, CD8B, FOXP3, IL2RA, KLRB1, GZMK, ALOX5AP, GNLY, NKG7, NCAM1, TRDC, GZMH, GZMB, KLRC2, KLRC1, CD79A, TCL1A, CD19, TLR10, IGHA1, IGHG1, TSPAN13, HLA-DRA, HLA-DOA, HLA-DQA1, MS4A1, FCER1A, IL1B, CD38, TYMS, PF4, PPBP'
}
// specify parameters for other modules:
// 'module_name' {
// 'param1_name' = value1
// 'param2_name' = value2
// }
}
}

The final execution command looks like below:

nextflow run main.nf -profile singularity,lsf -c ./conf/test_chromap_final.config -resume session_id

In order to skip already-performed analyses, you must provide the -resume session_id option. The corresponding session ID can be found using the nextflow log command. Please note that the --outdir directory is set to results_chromap_final.

When the final execution is complete, we can look at clustering heatmaps and marker gene heatmaps, and they both look good now:

  • The marker gene heatmap plot from ./results_chromap_final/archr_marker_gene_clusters/ folder:

  • The marker gene heatmap plot from ./results_chromap_final/archr_marker_gene_clusters/ folder:

Pipeline info

Upon pipeline execution completion, Nextflow will produce time and resource usage reports that are stored under pipeline_info:

An example using plant genome without matched scRNA-seq data

For this example (GSE155304), integrated analysis cannot be performed due to the lack of matched scRNA-seq data. Also note, for motifSet, when set to 'cisbp', only human and mouse are currently supported by ArchR. Therefore, for this dataset, we need to replace all occurrences of motifSet = "cisbp" to motifSet = "encode" in the ./conf/motif.config file.

  1. Command line used for default option:
nextflow run main.nf -c conf/test_default_plant.config -profile singularity,lsf

Results are stored here.

  1. Command used for chromap option:
nextflow run main.nf -c conf/test_chromap_plant.config -profile singularity,lsf

Results are stored here.

  1. Command used for 10xgenomics option:
nextflow run main.nf -c conf/test_10xgenomics_plant.config -profile singularity,lsf

Results are stored here.

Documentation

The scATACpipe workflow comes with documentation about the pipeline: usage and output.

Credits

scATACpipe was originally designed and written by Kai Hu, Haibo Liu, and Lihua Julie Zhu.

We thank the following people for their extensive assistance in the development of this pipeline: Nathan Lawson.

Bug report/Support

For help, bug report, or feature requests, the developers would prefer and appreciate that you create a GitHub issue by clicking here. If you would like to extend scATACpipe for your own good, feel free to fork the repo.

Citations

Please cite scATACpipe [to be added] if you use it for your research.

A Template of Method can be found here.

A complete list of references for the tools used by scATACpipe can be found here.

Release notes

v0.1.0
  • initial release
v0.1.1
  • add web_gui for easy generation of config file
  • add one more example for Arabidopsis Thaliana
  • minor improvements