- use verion 1.0

- improve changelog - use fastp instead of cutadapt - update cufflinks - add multiQC even if no additional QC are performed. - hide intermediate steps - change some parameter names - update README to better description of parameters - add missing asserts in tests
galaxyproject · Nov 13, 2024 · a46b479 · a46b479
1 parent c4c76ef
commit a46b479
Show file tree

Hide file tree

Showing 4 changed files with 520 additions and 329 deletions.
diff --git a/workflows/transcriptomics/rnaseq-sr/CHANGELOG.md b/workflows/transcriptomics/rnaseq-sr/CHANGELOG.md
@@ -1,11 +1,17 @@
 # Changelog
 
-## [0.10] 2024-10-22
+## [1.0] 2024-10-22
 
-### Manual update
+### Changes in workflows
+- Add an optional subworkflow with more QC: FastQC, Picard, Read distribution on genomic features, gene body coverage, reads per chromosomes.
+- Add featureCounts as an alternative way to generate count files
+- Use fastp instead of cutadapt which uses pair overlap and allows to have optional adapter sequences
+
+### Tool update
+- `toolshed.g2.bx.psu.edu/repos/devteam/cufflinks/cufflinks/2.2.1.3` was updated to `toolshed.g2.bx.psu.edu/repos/devteam/cufflinks/cufflinks/2.2.1.4`
+
+### Test dataset
 - Using a new subsampled Yeast test data from Zenodo record https://zenodo.org/records/13987631
-- Added a subworkflow with MultiQC on FastQC, Cutadapt, STAR, featureCounts and Picard reports
-- Added featureCounts as an alternative way to generate count files
 
 ## [0.9] 2024-09-23
 

diff --git a/workflows/transcriptomics/rnaseq-sr/README.md b/workflows/transcriptomics/rnaseq-sr/README.md
@@ -2,9 +2,9 @@
 
 ## Inputs dataset
 
-- The workflow needs a list of datasets of fastqsanger.
-- As well as a gtf file with genes
-- Optional, but recommended: a gtf file with regions to exclude from normalization in Cufflinks.
+- Collection of FASTQ files: The workflow needs a list of datasets of fastqsanger.
+- GTF file of annotation: A gtf file with genes annotation.
+- GTF with regions to exclude from FPKM normalization with Cufflinks: Optional, but recommended. A gtf file with regions to exclude from normalization in Cufflinks.
 
   - For instance a gtf that masks chrM for the mm10 genome:
 
@@ -15,11 +15,13 @@ chrM	chrM_gene	exon	0	16299	.	-	.	gene_id "chrM_gene_minus"; transcript_id "chrM
 
 ## Inputs values
 
-- forward adapter sequence: this depends on the library preparation. Usually classical Illumina RNA libraries are Truseq and ISML (relatively new Illumina library) is Nextera. If you don't know, use FastQC to determine if it is Truseq or Nextera. If the read length is relatively short (50bp), there is probably no adapter so it will not impact your results.
-- reference_genome: this field will be adapted to the genomes available for STAR
-- strandedness: For stranded RNA, reverse means that the read is complementary to the coding sequence, forward means that the read is in the same orientation as the coding sequence. This will only count alignments that are compatible with your library preparation strategy. This is also used for the stranded coverage and for FPKM computation with cufflinks/StringTie.
-- cufflinks_FPKM: Whether you want to get FPKM with Cufflinks (pretty long)
-- stringtie_FPKM: Whether you want to get FPKM/TPM etc... with Stringtie.
+- Forward adapter (optional): If not provided, fastp will try to guess the adapter sequence from the data. Its sequences  depends on the library preparation. Usually classical Illumina RNA libraries are Truseq and ISML (relatively new Illumina library) is Nextera. If you don't know, use FastQC to determine if it is Truseq or Nextera. If the read length is relatively short (50bp), there is probably no adapter so it will not impact your results.
+- Generate additional QC reports: whether to compute additional QC: FastQC, Picard, Read distribution on genomic features, gene body coverage, reads per chromosomes.
+- Reference genome: this field will be adapted to the genomes available for STAR.
+- Strandedness: For stranded RNA, reverse means that the read is complementary to the coding sequence, forward means that the read is in the same orientation as the coding sequence. This will only count alignments that are compatible with your library preparation strategy. This is also used for the stranded coverage and for FPKM computation with cufflinks/StringTie.
+- Use featureCounts for generating count tables: Whether to use count tables from featureCounts instead of from STAR.
+- Compute Cufflinks FPKM: Whether you want to get FPKM with Cufflinks (pretty long).
+- Compute StringTie FPKM: Whether you want to get FPKM/TPM etc... with StringTie.
 
 ## Processing
 
@@ -41,6 +43,12 @@ chrM	chrM_gene	exon	0	16299	.	-	.	gene_id "chrM_gene_minus"; transcript_id "chrM
 
 ## Contribution
 
+### Version 0.1
+
 @lldelisle wrote the workflow and the tests.
 
 @nagoue updated the tools, made it work in usegalaxy.org, fixed some best practices.
+
+### Version 1.0
+
+@pavanvidem added the new features (featurecount + additional QC) and found a smaller test dataset.
diff --git a/workflows/transcriptomics/rnaseq-sr/rnaseq-sr-tests.yml b/workflows/transcriptomics/rnaseq-sr/rnaseq-sr-tests.yml
@@ -4,15 +4,15 @@
       class: File
       location: https://zenodo.org/records/13987631/files/Saccharomyces_cerevisiae.R64-1-1.113.gtf
       filetype: gtf
-    Collection paired FASTQ files:
+    Collection of FASTQ files:
       class: Collection
       collection_type: list
       elements:
       - class: File
         identifier: SRR5085167
         location: https://zenodo.org/records/13987631/files/SRR5085167_forward.fastqsanger.gz
     Forward adapter: AGATCGGAAGAG
-    Generate QC reports: true
+    Generate additional QC reports: true
     Reference genome: sacCer3
     Strandedness: stranded - forward
     Use featureCounts for generating count tables: true
@@ -23,54 +23,54 @@
     MultiQC stats:
       asserts:
         has_text_matching:
-            expression: "SRR5085167\t0.11[0-9]*\t18.14[0-9]*\t69.79[0-9]*\t0.37[0-9]*\t0.35[0-9]*\t94.81\t0.12[0-9]*\t34.32\t0.22[0-9]*\t37.78[0-9]*\t36.33[0-9]*\t46.0\t75.0\t75\t27.27[0-9]*\t0.39[0-9]*"
-    FeatureCounts Summary Table:
-      element_tests:
-        SRR5085167:
-            has_line:
-              line: "Assigned	115717"
+          expression: "SRR5085167\t0.11[0-9]*\t18.3[0-9]*\t69.6[0-9]*\t0.3[0-9]*\t0.3[0-9]*\t94.62\t0.12[0-9]*\t34.43\t0.2[0-9]*\t28.[0-9]*\t90.[0-9]*\t16.[0-9]*\t0.36[0-9]*\t43.[0-9]*\t91.[0-9]*\t70.[0-9]*\t36.[0-9]*\t46.0\t75.0\t75\t27.27[0-9]*\t0.39[0-9]*"
     Counts Table:
       element_tests:
         SRR5085167:
+          asserts:
             has_line:
-              line: "YAL038W	1813"
+              line: "YAL038W	1775"
     Mapped Reads:
       element_tests:
         SRR5085167:
-          has_size:
-            value: 56913572
-            delta: 2500000
+          asserts:
+            has_size:
+              value: 31570787
+              delta: 3000000
     Gene Abundance Estimates from StringTie:
       element_tests:
         SRR5085167:
           asserts:
             has_text_matching:
-              expression: "YAL038W\tCDC19\tchrI\t\\+\t71786\t73288\t57.46[0-9]*\t3549.28[0-9]*\t3066.13[0-9]*"
+              expression: "YAL038W\tCDC19\tchrI\t\\+\t71786\t73288\t57.[0-9]*\t3575.[0-9]*\t3084.[0-9]*"
     Genes Expression from Cufflinks:
       element_tests:
         SRR5085167:
           asserts:
             has_line:
-              line: "YAL038W	-	-	YAL038W	CDC19	-	chrI:71785-73288	-	-	3350.92	3139.33	3562.52	OK"
+              line: "YAL038W	-	-	YAL038W	CDC19	-	chrI:71785-73288	-	-	3375.85	3161.36	3590.33	OK"
     Transcripts Expression from Cufflinks:
       element_tests:
         SRR5085167:
           asserts:
             has_line:
-              line: "YAL038W_mRNA	-	-	YAL038W	CDC19	-	chrI:71785-73288	1503	57.4859	3350.92	3139.33	3562.52	OK"
+              line: "YAL038W_mRNA	-	-	YAL038W	CDC19	-	chrI:71785-73288	1503	57.5601	3375.85	3161.36	3590.33	OK"
     Stranded Coverage:
       element_tests:
         SRR5085167_forward:
-          has_size:
-            value: 635210
-            delta: 30000
+          asserts:
+            has_size:
+              value: 555489
+              delta: 50000
         SRR5085167_reverse:
-          has_size:
-            value: 618578
-            delta: 30000
+          asserts:
+            has_size:
+              value: 526952
+              delta: 50000
     Unstranded Coverage:
       element_tests:
         SRR5085167:
-          has_size:
-            value: 1140004
-            delta: 50000
+          asserts:
+            has_size:
+              value: 978542
+              delta: 90000