nf-core · ggabernet · Jan 27, 2023 · Dec 21, 2022 · Dec 21, 2022 · Dec 21, 2022
diff --git a/README.md b/README.md
@@ -56,11 +56,14 @@ nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single
 3. QC filtering (bulk and single-cell)
 
 - Bulk sequencing filtering:
-  - Remove chimeric sequences (optional) (`EnchantR`)
+  - Remove chimeric sequences (optional) (`SHazaM`, `EnchantR`)
   - Detect cross-contamination (optional) (`EnchantR`)
-  - Collapse duplicates (`EnchantR`)
+  - Collapse duplicates (`Alakazam`, `EnchantR`)
 - Single-cell QC filtering (`EnchantR`)
-  - TODO: explain exactly what is done.
+  - Removes cells without heavy chains.
+  - Remove cells with multiple heavy chains.
+  - Remove sequences in different samples that share the same `cell_id` and nucleotide sequence.
+  - Modifies `cell_id`s to ensure they are unique in the project.
 
 4. Clonal analysis (bulk and single-cell)
 

diff --git a/conf/modules.config b/conf/modules.config
@@ -330,31 +330,31 @@ process {
 
     withName: CHANGEO_CREATEGERMLINES {
         publishDir = [
-            path: { "${params.outdir}/bulk-qc-filtering/01-create-germlines/${meta.id}" },
+            path: { "${params.outdir}/qc-filtering/bulk-qc-filtering/01-create-germlines/${meta.id}" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
         ]
     }
 
     withName: REMOVE_CHIMERIC {
         publishDir = [
-            path: { "${params.outdir}/bulk-qc-filtering/02-chimera-filter/${meta.id}" },
+            path: { "${params.outdir}/qc-filtering/bulk-qc-filtering/02-chimera-filter/${meta.id}" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
         ]
     }
 
     withName: DETECT_CONTAMINATION {
         publishDir = [
-            path: { "${params.outdir}/bulk-qc-filtering/03-detect_contamination" },
+            path: { "${params.outdir}/qc-filtering/bulk-qc-filtering/03-detect_contamination" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
         ]
     }
 
     withName: COLLAPSE_DUPLICATES {
         publishDir = [
-            path: { "${params.outdir}/bulk-qc-filtering/04-collapse-duplicates/${meta.id}" },
+            path: { "${params.outdir}/qc-filtering/bulk-qc-filtering/04-collapse-duplicates/${meta.id}" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
         ]

diff --git a/docs/output.md b/docs/output.md
@@ -264,7 +264,7 @@ IgBLAST's results are parsed and standardized with [MakeDB](https://changeo.read
 
 </details>
 
-A table is generated that retains sequences with concordant locus in the  `v_call` and `locus` fields, with a `sequence_alignment` with a maximum of 10% of Ns and a length of at least 200 informative nucleotides (not `-`, `.` or `N`).
+A table is generated that retains sequences with concordant locus in the `v_call` and `locus` fields, with a `sequence_alignment` with a maximum of 10% of Ns and a length of at least 200 informative nucleotides (not `-`, `.` or `N`).
 
 ### Removal of non-productive sequences
 
@@ -308,10 +308,10 @@ Non-functional sequences identified with IgBLAST are removed with [ParseDb](http
 <details markdown="1">
 <summary>Output files</summary>
 
-- `bulk-qc-filtering/01-create-germlines/<sampleID>`
+- `qc-filtering/bulk-qc-filtering/01-create-germlines/<sampleID>`
   - `*log.txt`: Log of the process that will be parsed to generate a report.
   - `*germ-pass.tsv`: Rearrangement table in AIRR-C format with an additional
-     field with the reconstructed germline sequence for each sequence.
+    field with the reconstructed germline sequence for each sequence.
 
 </details>
 
@@ -322,10 +322,10 @@ Reconstructing the germline sequences with the [CreateGermlines](https://changeo
 <details markdown="1">
 <summary>Output files</summary>
 
-- `bulk-qc-filtering/02-chimera-filter/<sampleID>`
+- `qc-filtering/bulk-qc-filtering/02-chimera-filter/<sampleID>`
   - `*log.txt`: Log of the process that will be parsed to generate a report.
   - `*chimera-pass.tsv`: Rearrangement table in AIRR-C format sequences that
-     passed the chimera removal filter.
+    passed the chimera removal filter.
   - `<sampleID>_chimera_report`: Report with plots showing the mutation patterns
 
 </details>
@@ -338,10 +338,10 @@ the Immcantation R package [SHazaM](https://shazam.readthedocs.io/en/stable/).
 <details markdown="1">
 <summary>Output files. Optional. </summary>
 
-- `bulk-qc-filtering/03-detect_contamination`
+- `qc-filtering/bulk-qc-filtering/03-detect_contamination`
   - `*log.txt`: Log of the process that will be parsed to generate a report.
   - `*cont-flag.tsv`: Rearrangement table in AIRR-C format with sequences that
-     passed the chimera removal filter.
+    passed the chimera removal filter.
   - `all_reps_cont_report`: Report.
 
 </details>
@@ -353,11 +353,11 @@ This folder is genereated when `detect_contamination` is set to `true`.
 <details markdown="1">
 <summary>Output files. </summary>
 
-- `bulk-qc-filtering/04-collapse-duplicates/<sampleID>`
+- `qc-filtering/bulk-qc-filtering/04-collapse-duplicates/<sampleID>`
   - `*log.txt`: Log of the process that will be parsed to generate a report.
   - `*collapse_report/`: Report.
     - `repertoires/*collapse-pass.tsv`: Rearrangement table in AIRR-C format with duplicated
-       sequences removed.
+      sequences removed.
 
 </details>
 
@@ -370,7 +370,7 @@ This folder is genereated when `detect_contamination` is set to `true`.
   - `*log.txt`: Log of the process that will be parsed to generate a report.
   - `*all_reps_scqc_report/`: Report.
     - `*scqc-pass.tsv`: Rearrangement table in AIRR-C format with sequences that
-       passed the quality filtering.
+      passed the quality filtering.
 
 </details>
 
@@ -382,73 +382,84 @@ This folder is genereated when `detect_contamination` is set to `true`.
 - `clonal_analysis/find-threshold/`
   - `*log`: Log of the process that will be parsed to generate a report.
   - `all_reps_threshold-mean.tsv`: Mean of all hamming distance thresholds of the
-     Junction regions as determined by Shazam.
+    Junction regions as determined by Shazam.
   - `all_reps_threshold-summary.tsv`: Thresholds for each group of `--cloneby` samples.
   - `all_reps_dist_report`: Report
 
 </details>
 
 Determining the hamming distance threshold of the junction regions for clonal determination using [Shazam](https://shazam.readthedocs.io) when `clonal_threshold` is set to `auto`.
 
-## TODO updata scoper: Change-O define clones
+## SCOPer define clones
 
 ### Define clones
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `changeo/06-define_clones/<subjectID>`
-  - `tab`: Table in AIRR format containing the assigned gene information and an additional field with the clone id.
-
-</details>
-
-Assigning clones to the sequences obtained from IgBlast with the [DefineClones](https://changeo.readthedocs.io/en/version-0.4.5/tools/DefineClones.html?highlight=DefineClones) Immcantation tool.
-
-### Reconstruct germlines
-
-<details markdown="1">
-<summary>Output files</summary>
+- `clonal_analysis/define_clones/<subjectID>`
+  - `*log`: Log of the process that will be parsed to generate a report.
+  - `repertoires/<sampleID>_clone-pass.tsv`: Rearrangement tables in AIRR-C format with sequences that
+    passed the clonal assignment step. The field `clone_id` contains the clonal clusters identifiers.
+  - `tables/`: Table in AIRR format containing the assigned gene information and an additional field with the clone id.
+    - `clonal_abundance.tsv`
+    - `clonal_diversity.tsv`
+    - `clone_sizes_table.tsv`
+    - `num_clones_table_nosingle.tsv`
+    - `num_clones_table.tsv`
+  - `ggplots/`: Diversity and abundance plots as `ggplot` objects.
+  - `figures/`: Clone size, diversity and abundance `png` plots.
 
-- `changeo/07-create_germlines/<subjectID>`
-  - `tab`: Table in AIRR format contaning the assigned gene information and an additional field with the germline reconstructed gene calls.
+A similar output folder `clonal_analysis/define_clones/all_reps_clone_report` is generated for all data.
 
 </details>
 
-Reconstructing the germline sequences with the [CreateGermlines](https://changeo.readthedocs.io/en/version-0.4.5/tools/CreateGermlines.html#creategermlines) Immcantation tool.
+Assigning clones to the sequences obtained from IgBlast with the [scoper::hierarchicalClones](https://scoper.readthedocs.io/en/stable/topics/hierarchicalClones/) Immcantation tool.
+
+#
 
 ## Lineage reconstruction
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `lineage_reconstruction/`
-  - `tab`
-    - `Clones_table_patient.tsv`: contains a summary of the clones found for the patient, and the number of unique and total sequences identified in each clone.
-    - `Clones_table_patient_filtered_between_3_and_1000.tsv`: contains a summary of the clones found for the patient, and the number of unique and total sequences identified in each clone, filtered by clones of size between 3 and 1000, for which the lineages were reconstructed and the trees plotted.
-    - `xxx_germ-pass.tsv`: AIRR format table with all the sequences from a patient after the germline annotation step.
-  - `Clone_tree_plots`: Contains a rooted graphical representation of each of the clones, saved in pdf format.
-  - `Graphml_trees`: All lineage trees for the patient exported in a GraphML format: `All_graphs_patient.graphml`.
+- `clonal_analysis/dowser_lineages/`
+  - `<sampleID>*log`: Log of the process that will be parsed to generate a report.
+  - `<sample1ID>_dowser_report`: Report
 
 </details>
 
-Reconstructing clonal linage with the [Alakazam R package](https://alakazam.readthedocs.io/en/stable/) from the Immcantation toolset.
+Reconstructing clonal lineage with [IgPhyML](https://igphyml.readthedocs.io/en/stable/) and
+[dowser](https://dowser.readthedocs.io/en/stable/topics/getTrees/) from the Immcantation toolset.
 
 ## Repertoire comparison
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `repertoire_comparison/`
+- `repertoire_analysis/repertoire_comparison/`
   - `all_data.tsv`: AIRR format table containing the processed sequence information for all subjects.
   - `Abundance`: contains clonal abundance calculation plots and tables.
   - `Diversity`: contains diversity calculation plots and tables.
   - `V_family`: contains V gene and family distribution calculation plots and tables.
-- `Bcellmagic_report.html`: Contains the repertoire comparison results in an html report form: Abundance, Diversity, V gene usage tables and plots. Comparison between treatments and subjects.
+- `Airrflow_report.html`: Contains the repertoire comparison results in an html report form: Abundance, Diversity, V gene usage tables and plots. Comparison between treatments and subjects.
 
 </details>
 
 Calculation of several repertoire characteristics (diversity, abundance, V gene usage) for comparison between subjects, time points and cell populations. An Rmarkdown report is generated with the [Alakazam R package](https://alakazam.readthedocs.io/en/stable/).
 
+## Tracking number of reads
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `report_file_size/file_size_report`: Report summarizing the number of sequences after the most important pipeline steps.
+  - `tables/*tsv`: Tables with the number of sequences at each processing step.
+
+</details>
+
+Parsing the logs from the previous processes. Summary of the number of sequences left after each of the most important pipeline steps.
+
 ## Log parsing
 
 <details markdown="1">