Skip to content

Commit

Permalink
Merge pull request #228 from ggabernet/docs
Browse files Browse the repository at this point in the history
Improve docs
  • Loading branch information
ggabernet authored Feb 13, 2023
2 parents 4b1301f + e5a1077 commit 853c9d8
Show file tree
Hide file tree
Showing 4 changed files with 123 additions and 106 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
## Introduction

** nf-core/airrflow ** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io)
toolset. The input data can be (a) targeted amplicon bulk sequencing data of the V, D, J and C regions
of the B/T-cell receptor with multiplex PCR or 5' RACE protocol or (b) assembled reads (bulk or single cell).
toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions
of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, or assembled reads (bulk or single cell).

![nf-core/airrflow overview](docs/images/airrflow_workflow_overview.png)

Expand All @@ -26,14 +26,14 @@ On release, automated continuous integration tests run the pipeline on a full-si

## Pipeline summary

nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single cell targeted sequencing. Several protocols are supported, please see the [usage documenation](https://nf-co.re/airrflow/usage) for more details on the supported protocols.
nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single cell targeted sequencing data. Several protocols are supported, please see the [usage documenation](https://nf-co.re/airrflow/usage) for more details on the supported protocols.

![nf-core/airrflow overview](docs/images/metro-map-airrflow.png)

1. QC and sequence assembly (bulk only)

- Raw read quality control, adapter trimming and clipping (`Fastp`)
- Filtering sequences by sequencing quality (`pRESTO FilterSeq`).
- Raw read quality control, adapter trimming and clipping (`Fastp`).
- Filtering sequences by base quality (`pRESTO FilterSeq`).
- Mask amplicon primers (`pRESTO MaskPrimers`).
- Pair read mates (`pRESTO PairSeq`).
- For UMI-based sequencing:
Expand All @@ -45,7 +45,7 @@ nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single

2. V(D)J annotation and filtering (bulk and single-cell)

- Assigning gene segment alleles with `IgBlast` using the IMGT database (`Change-O AssignGenes`).
- Assigning gene segments with `IgBlast` using the IMGT database (`Change-O AssignGenes`).
- Annotate alignments in AIRR format (`Change-O MakeDB`)
- Filter by alignment quality (locus matching v_call chain, min 200 informative positions, max 10% N nucleotides)
- Filter productive sequences (`Change-O ParseDB split`)
Expand All @@ -66,7 +66,7 @@ nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single

4. Clonal analysis (bulk and single-cell)

- Find Hamming distance threshold for clone definition (`SHazaM`, `EnchantR`).
- Find threshold for clone definition (`SHazaM`, `EnchantR`).
- Create germlines and define clones, repertoire analysis (`Change-O`, `EnchantR`).
- Build lineage trees (`SCOPer`, `IgphyML`, `EnchantR`).

Expand Down
18 changes: 17 additions & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,14 @@ process {
ext.args = '--quiet'
}

withName: 'MERGE_UMI' {
publishDir = [
[
enabled: false
]
]
}

// -----------------
// sequence assembly
// -----------------
Expand Down Expand Up @@ -264,6 +272,14 @@ process {
]
}

withName: 'UNZIP_DB' {
publishDir = [
[
enabled: false
]
]
}

withName: CHANGEO_CONVERTDB_FASTA_FROM_AIRR {
publishDir = [
path: { "${params.outdir}/vdj_annotation/convert-db/${meta.id}" },
Expand Down Expand Up @@ -442,7 +458,7 @@ process {

withName: PARSE_LOGS {
publishDir = [
path: { "${params.outdir}/parsed-logs" },
path: { "${params.outdir}/parsed_logs" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
Expand Down
91 changes: 48 additions & 43 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,39 +10,48 @@ The directories listed below will be created in the results directory after the

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

TODO: update this to add/remove lines

- [FastP](#fastp) - read quality control, adapter trimming and read clipping
- [pRESTO](#presto) - read pre-processing
- [Filter by sequence quality](#filter-by-sequence-quality) - filter sequences by quality
- [Mask primers](#mask-primers) - Masking primers
- [Pair mates](#pair-mates) - Pairing sequence mates.
- [QC and sequence assembly (bulk only)](#sequence-assembly)
- [FastP](#fastp) - read quality control, adapter trimming and read clipping.
- [Filter by sequence quality](#filter-by-sequence-quality) - filter sequences by base quality.
- [Mask primers](#mask-primers) - Mask amplicon primers.
- [Pair mates](#pair-mates) - Pair read mates.
- [Cluster sets](#cluster-sets) - Cluster sequences according to similarity.
- [Build consensus](#build-UMI-consensus) - Build consensus of sequences with the same UMI barcode.
- [Re-pair mates](#re-pair-mates) - Re-pairing sequence mates.
- [Assemble mates](#assemble-mates) - Assemble sequence mates.
- [Remove duplicates](#remove-duplicates) - Remove and annotate read duplicates.
- [Filter sequences for at least 2 representative](#filter-sequences-for-at-least-2-representative) Filter sequences that do not have at least 2 duplicates.
- [FastQC](#fastqc) - read quality control post-assembly
- [Change-O](#change-o) - Assign genes and clonotyping
- [FastQC](#fastqc) - read quality control post-assembly
- [VDJ annotation](#vdj-annotation) - Assign genes and clonotyping
- [Convert to fasta](#convert-input-to-fasta-optional)
- [Assign genes with Igblast](#assign-genes-with-igblast)
- [Make database from assigned genes](#make-database-from-assigned-genes)
- [Quality filter alignments](#quality-filter-alignments)
- [Removal of non-productive sequences](#removal-of-non-productive-sequences)
- [Selection of IGH / TR sequences](#selection-of-IGH-/-TR-sequences)
- [Convert database to fasta](#convert-database-to-fasta)
- [Shazam](#shazam) - Genotyping and Clonal threshold
- [Genotyping and hamming distance threshold](#determining-hamming-distance-threshold)
- [Change-O define clones](#change-o-define-clones)
- [Define clones](#define-clones) - Defining clonal B-cell or T-cell groups
- [Reconstruct germlines](#reconstruct-germlines) - Reconstruct gene calls of germline sequences
- [Lineage reconstruction](#lineage-reconstruction) - Clonal lineage reconstruction.
- [Removal of sequences with junction length not multiple of 3](#removal-of-sequences-with-junction-length-not-multiple-of-3)
- [Annotate metadata](#annotate-metadata)
- [Bulk QC filtering](#bulk-qc-filtering)
- [Reconstruct germlines](#reconstruct-germlines)
- [Chimeric read filtering](#chimeric-read-filtering-optional)
- [Detect contamination](#detect-contamination-optional)
- [Collapse duplicates](#collapse-duplicates)
- [Single cell QC](#single-cell-qc)
- [Clonal analysis](#clonal-analysis)
- [Find clonal threshold](#find-clonal-threshold)
- [SCOPer define clones](#scoper-define-clones) - Defining clonal B-cell or T-cell groups
- [Dowser lineage reconstruction](#dowser-lineage-reconstruction) - Clonal lineage reconstruction.
- [Repertoire analysis](#repertoire-analysis) - Repertoire analysis and comparison.
- [Report file size](#report-file-size) - Log parsing.
- [Log parsing](#log-parsing) - Log parsing.
- [Databases](#databases)
- [MultiQC](#MultiQC) - MultiQC
- [Databases](#databases) - Downloaded databases.
- [MultiQC](#MultiQC) - MultiQC report.
- [Pipeline information](#pipeline-information) - Pipeline information

## Fastp
## Sequence assembly

> **NB:** If using the sans-UMI subworkflow by specifying `umi_length=0`, the presto directory ordering numbers will differ e.g., mate pair assembly results will be output to `presto/01-assemblepairs/<sampleID>` as this will be the first presto step.
### Fastp

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -57,10 +66,6 @@ TODO: update this to add/remove lines

[fastp](https://doi.org/10.1093/bioinformatics/bty560) gives general quality metrics about your sequenced reads, as well as allows filtering reads by quality, trimming adapters and clipping reads at 5' or 3' ends. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [fastp documentation](https://github.com/OpenGene/fastp).

## presto

> **NB:** If using the sans-UMI subworkflow by specifying `umi_length=0`, the presto directory ordering numbers will differ e.g., mate pair assembly results will be output to `presto/01-assemblepairs/<sampleID>` as this will be the first presto step.
### Filter by sequence quality

<details markdown="1">
Expand Down Expand Up @@ -187,7 +192,7 @@ Remove duplicates using [CollapseSeq](https://presto.readthedocs.io/en/stable/to

Remove sequences which do not have 2 representative using [SplitSeq](https://presto.readthedocs.io/en/stable/tools/SplitSeq.html) from the pRESTO Immcantation toolset.

## FastQC
### FastQC

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -209,9 +214,9 @@ Remove sequences which do not have 2 representative using [SplitSeq](https://pre

> **NB:** Two sets of FastQC plots are displayed in the MultiQC report: first for the raw _untrimmed_ and unmated reads and secondly for the assembled and QC filtered reads (but before collapsing duplicates). They may contain adapter sequence and potentially regions with low quality.
## Change-O
## VDJ annotation

### Convert input to fasta, if needed
### Convert input to fasta (optional)

<details markdown="1">
<summary>Output files. Optional. </summary>
Expand Down Expand Up @@ -253,7 +258,7 @@ Assign genes with Igblast, using the IMGT database is performed by the [AssignGe

IgBLAST's results are parsed and standardized with [MakeDB](https://changeo.readthedocs.io/en/stable/examples/igblast.html#processing-the-output-of-igblast) to follow the [AIRR Community standards](https://docs.airr-community.org/en/stable/datarep/rearrangements.html) for rearrangement data.

### Quality filter sequences
### Quality filter alignments

<details markdown="1">
<summary>Output files</summary>
Expand Down Expand Up @@ -290,7 +295,7 @@ Non-functional sequences identified with IgBLAST are removed with [ParseDb](http

</details>

### Add metadata
### Annotate metadata

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -301,7 +306,7 @@ Non-functional sequences identified with IgBLAST are removed with [ParseDb](http

</details>

## Shazam
## Bulk QC filtering

### Reconstruct germlines

Expand All @@ -317,7 +322,7 @@ Non-functional sequences identified with IgBLAST are removed with [ParseDb](http

Reconstructing the germline sequences with the [CreateGermlines](https://changeo.readthedocs.io/en/stable/tools/CreateGermlines.html#creategermlines) Immcantation tool.

### Chimera filter
### Chimeric read filtering (optional)

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -333,7 +338,7 @@ Reconstructing the germline sequences with the [CreateGermlines](https://changeo
Mutations patterns in different window sizes are analyzed with functions from
the Immcantation R package [SHazaM](https://shazam.readthedocs.io/en/stable/).

### Detect contamination
### Detect contamination (optional)

<details markdown="1">
<summary>Output files. Optional. </summary>
Expand Down Expand Up @@ -361,7 +366,7 @@ This folder is genereated when `detect_contamination` is set to `true`.

</details>

### Single cell QC
## Single cell QC

<details markdown="1">
<summary>Output files. </summary>
Expand All @@ -374,12 +379,14 @@ This folder is genereated when `detect_contamination` is set to `true`.

</details>

### Determining hamming distance threshold
## Clonal analysis

### Find clonal threshold

<details markdown="1">
<summary>Output files</summary>

- `clonal_analysis/find-threshold/`
- `clonal_analysis/find_threshold/`
- `*log`: Log of the process that will be parsed to generate a report.
- `all_reps_threshold-mean.tsv`: Mean of all hamming distance thresholds of the
Junction regions as determined by Shazam.
Expand All @@ -390,9 +397,7 @@ This folder is genereated when `detect_contamination` is set to `true`.

Determining the hamming distance threshold of the junction regions for clonal determination using [Shazam](https://shazam.readthedocs.io) when `clonal_threshold` is set to `auto`.

## SCOPer define clones

### Define clones
### SCOPer define clones

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -416,9 +421,7 @@ A similar output folder `clonal_analysis/define_clones/all_reps_clone_report` is

Assigning clones to the sequences obtained from IgBlast with the [scoper::hierarchicalClones](https://scoper.readthedocs.io/en/stable/topics/hierarchicalClones/) Immcantation tool.

#

## Lineage reconstruction
### Dowser Lineage reconstruction

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -432,7 +435,7 @@ Assigning clones to the sequences obtained from IgBlast with the [scoper::hierar
Reconstructing clonal lineage with [IgPhyML](https://igphyml.readthedocs.io/en/stable/) and
[dowser](https://dowser.readthedocs.io/en/stable/topics/getTrees/) from the Immcantation toolset.

## Repertoire comparison
## Repertoire analysis

<details markdown="1">
<summary>Output files</summary>
Expand All @@ -448,7 +451,7 @@ Reconstructing clonal lineage with [IgPhyML](https://igphyml.readthedocs.io/en/s

Calculation of several repertoire characteristics (diversity, abundance, V gene usage) for comparison between subjects, time points and cell populations. An Rmarkdown report is generated with the [Alakazam R package](https://alakazam.readthedocs.io/en/stable/).

## Tracking number of reads
## Report file size

<details markdown="1">
<summary>Output files</summary>
Expand Down Expand Up @@ -476,6 +479,8 @@ Parsing the logs from the previous processes. Summary of the number of sequences

Copy of the downloaded IMGT database by the process `fetch_databases`, used for the gene assignment step.

If databases are provided with `--imgtdb_base` and `--igblast_base` this folder will not be present.

## MultiQC

<details markdown="1">
Expand Down
Loading

0 comments on commit 853c9d8

Please sign in to comment.