Skip to content

Commit

Permalink
Merge pull request #544 from pavanvidem/rna-seq-star-fc-deseq
Browse files Browse the repository at this point in the history
Update PE RNA-seq workflow
  • Loading branch information
lldelisle authored Nov 13, 2024
2 parents 03cf296 + 93049d0 commit c3b1516
Show file tree
Hide file tree
Showing 5 changed files with 2,107 additions and 586 deletions.
2 changes: 2 additions & 0 deletions workflows/transcriptomics/rnaseq-pe/.dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ workflows:
authors:
- name: Lucille Delisle
orcid: 0000-0002-1964-4960
- name: Pavankumar Videm
orcid: 0000-0002-5192-126X
13 changes: 13 additions & 0 deletions workflows/transcriptomics/rnaseq-pe/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Changelog

## [1.0] 2024-09-23

### Changes in workflows
- Add an optional subworkflow with more QC: FastQC, Picard, Read distribution on genomic features, gene body coverage, reads per chromosomes.
- Add featureCounts as an alternative way to generate count files
- Use fastp instead of cutadapt which uses pair overlap and allows to have optional adapter sequences

### Tool update
- `toolshed.g2.bx.psu.edu/repos/devteam/cufflinks/cufflinks/2.2.1.3` was updated to `toolshed.g2.bx.psu.edu/repos/devteam/cufflinks/cufflinks/2.2.1.4`

### Test dataset
- Using a new subsampled Yeast test data from Zenodo record https://zenodo.org/records/13987631

## [0.9] 2024-09-23

### Automatic update
Expand Down
22 changes: 13 additions & 9 deletions workflows/transcriptomics/rnaseq-pe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## Inputs dataset

- The workflow needs a list of dataset pairs of fastqsanger.
- As well as a gtf file with genes
- Optional, but recommended: a gtf file with regions to exclude from normalization in Cufflinks.
- Collection paired FASTQ files: The workflow needs a list of dataset pairs of fastqsanger.
- GTF file of annotation: A gtf file with genes annotation.
- GTF with regions to exclude from FPKM normalization with Cufflinks: Optional, but recommended. A gtf file with regions to exclude from normalization in Cufflinks.

- For instance a gtf that masks chrM for the mm10 genome:

Expand All @@ -15,19 +15,23 @@ chrM chrM_gene exon 0 16299 . - . gene_id "chrM_gene_minus"; transcript_id "chrM

## Inputs values

- adapter sequences: this depends on the library preparation. Usually classical Illumina RNA libraries are Truseq and ISML (relatively new Illumina library) is Nextera. If you don't know, use FastQC to determine if it is Truseq or Nextera. If the read length is relatively short (50bp), there is probably no adapter so it will not impact your results.
- reference_genome: this field will be adapted to the genomes available for STAR
- strandedness: For stranded RNA, reverse means that the first read in a pair is complementary to the coding sequence, forward means that the first read in a pair is in the same orientation as the coding sequence. This will only count alignments that are compatible with your library preparation strategy. This is also used for the stranded coverage and for FPKM computation with cufflinks/StringTie.
- cufflinks_FPKM: Whether you want to get FPKM with Cufflinks (pretty long)
- stringtie_FPKM: Whether you want to get FPKM/TPM etc... with StringTie.
- Forward and Reverse adapter (optional): By default, fastp will try to overlap both reads and will only use these sequences if R1/R2 are found not overlapped. Their sequences depends on the library preparation. Usually classical Illumina RNA libraries is Truseq and ISML (relatively new Illumina library) is Nextera.
- Generate additional QC reports: whether to compute additional QC: FastQC, Picard, Read distribution on genomic features, gene body coverage, reads per chromosomes.
- Reference genome: this field will be adapted to the genomes available for STAR.
- Strandedness: For stranded RNA, reverse means that the first read in a pair is complementary to the coding sequence, forward means that the first read in a pair is in the same orientation as the coding sequence. This will only count alignments that are compatible with your library preparation strategy. This is also used for the stranded coverage and for FPKM computation with cufflinks/StringTie.
- Use featureCounts for generating count tables: Whether to use count tables from featureCounts instead of from STAR.
- Compute Cufflinks FPKM: Whether you want to get FPKM with Cufflinks (pretty long).
- Compute StringTie FPKM: Whether you want to get FPKM/TPM etc... with StringTie.

## Processing

- The workflow will remove adapters and low quality bases and filter out any read smaller than 15bp.
- The filtered reads are mapped with STAR with ENCODE parameters (for long RNA-seq but I use it for short also). STAR is also used to count reads per gene and generate strand-specific normalized coverage (on uniquely mapped reads).
- Optionally featureCounts is used to generate count files when this option enabled.
- Optionally FastQC, Picard, read_distribution, geneBody_coverage, samtools idxstats, Picard are run to get additional QC.
- A multiQC is run to have an overview of the QC. This can also be used to get the strandedness.
- FPKM values for genes and transcripts are computed with cufflinks using correction for multi-mapped reads (this step is optionnal).
- FPKM/TPM values for genes are computed with StringTie.
- FPKM/TPM values for genes are computed with StringTie (this step is optional).
- The BAM is filtered to keep only uniquely mapped reads (tag NH:i:1).
- Unstranded coverage is computed with bedtools and normalized to the number of million uniquely mapped reads.
- The three coverage files are converted to bigwig.
Expand Down
133 changes: 61 additions & 72 deletions workflows/transcriptomics/rnaseq-pe/rnaseq-pe-tests.yml
Original file line number Diff line number Diff line change
@@ -1,99 +1,88 @@
- doc: Test outline for RNAseq_PE
job:
gtf:
GTF file of annotation:
class: File
location: https://zenodo.org/record/4541751/files/Drosophila_melanogaster.BDGP6.87.gtf
location: https://zenodo.org/records/13987631/files/Saccharomyces_cerevisiae.R64-1-1.113.gtf
filetype: gtf
PE fastq input:
Collection paired FASTQ files:
class: Collection
collection_type: list:paired
elements:
- class: Collection
type: paired
identifier: GSM461177
identifier: SRR5085167
elements:
- identifier: forward
class: File
location: https://zenodo.org/record/4541751/files/GSM461177_1_subsampled.fastqsanger
filetype: fastqsanger
- identifier: reverse
class: File
location: https://zenodo.org/record/4541751/files/GSM461177_2_subsampled.fastqsanger
filetype: fastqsanger
forward_adapter: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
reverse_adapter: GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
reference_genome: dm6
strandedness: unstranded
cufflinks_FPKM: false
stringtie_FPKM: true
- class: File
identifier: forward
location: https://zenodo.org/records/13987631/files/SRR5085167_forward.fastqsanger.gz
- class: File
identifier: reverse
location: https://zenodo.org/records/13987631/files/SRR5085167_reverse.fastqsanger.gz
Forward adapter: AGATCGGAAGAG
Reverse adapter: GATCGTCGGACT
Generate additional QC reports: true
Reference genome: sacCer3
Use featureCounts for generating count tables: true
Strandedness: stranded - forward
Compute Cufflinks FPKM: true
GTF with regions to exclude from FPKM normalization with Cufflinks: null
Compute StringTie FPKM: true
outputs:
output_log:
MultiQC stats:
asserts:
- that: "has_text_matching"
expression: "SRR5085167\t0.10[0-9]*\t16.45[0-9]*\t43.38[0-9]*\t0.32[0-9]*\t0.30[0-9]*\t93.75\t0.11[0-9]*\t34.29\t0.19[0-9]*\t17.6[0-9]*\t91.8[0-9]*\t30.0[0-9]*\t0.64[0-9]*\t43.5[0-9]*\t81.0[0-9]*\t72.6[0-9]*\t"
- that: "has_text_matching"
expression: "SRR5085167_forward\t*36.33[0-9]*\t46.0\t75.0\t75\t27.27[0-9]*\t0.39[0-9]*"
- that: "has_text_matching"
expression: "SRR5085167_reverse\t*35.31[0-9]*\t46.0\t75.0\t75\t45.45[0-9]*\t0.39[0-9]*"
Stranded Coverage:
element_tests:
GSM461177:
SRR5085167_forward:
asserts:
- that: "has_text"
text: "Number of input reads |\t1032407"
- that: "has_text"
text: "Uniquely mapped reads number |\t854812"
- that: "has_text"
text: "Number of reads mapped to multiple loci |\t82072"
mapped-reads:
element_tests:
GSM461177:
has_size:
value: 635210
delta: 30000
SRR5085167_reverse:
asserts:
has_size:
value: 89048730
delta: 8000000
'MultiQC on input dataset(s): Stats':
asserts:
has_line:
line: "Sample STAR_mqc_generalstats_star_total_reads_1 STAR_mqc_generalstats_star_mapped_1 STAR_mqc_generalstats_star_mapped_percent_1 STAR_mqc_generalstats_star_uniquely_mapped_1 STAR_mqc_generalstats_star_uniquely_mapped_percent_1 STAR_mqc_generalstats_star_multimapped_1 Cutadapt_mqc_generalstats_cutadapt_percent_trimmed"
has_text_matching:
expression: "GSM461177\t1.0[0-9]*\t0.93[0-9]*\t90.[0-9]*\t0.8[0-9]*\t82.8[0-9]*\t0.08[0-9]*\t6.0[0-9]*"
MultiQC webpage:
asserts:
- that: "has_text"
text: "GSM461177"
- that: "has_text"
text: "<a href=\"#cutadapt_filtered_reads\" class=\"nav-l2\">Filtered Reads</a>"
- that: "has_text"
text: "<a href=\"#star\" class=\"nav-l1\">STAR</a>"
reads_per_gene from STAR:
value: 618578
delta: 30000
Gene Abundance Estimates from StringTie:
element_tests:
GSM461177:
SRR5085167:
asserts:
- that: "has_text"
text: "N_ambiguous 25107 5900 5518"
- that: "has_text"
text: "FBgn0010247 13 5 8"
HTS count like output:
has_text_matching:
expression: "YAL038W\tCDC19\tchrI\t\\+\t71786\t73288\t92.5[0-9]*\t3273.9[0-9]*\t3900.[0-9]*"
Unstranded Coverage:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_text:
text: "FBgn0010247\t13"
both strands coverage:
has_size:
value: 1140004
delta: 50000
Mapped Reads:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_size:
value: 9885639
delta: 900000
stranded coverage:
value: 56913572
delta: 2500000
Genes Expression from Cufflinks:
element_tests:
GSM461177_reverse:
SRR5085167:
asserts:
has_size:
value: 5920766
delta: 600000
GSM461177_forward:
has_line:
line: "YAL038W - - YAL038W CDC19 - chrI:71785-73288 - - 3437.1 3211.44 3662.76 OK"
Transcripts Expression from Cufflinks:
element_tests:
SRR5085167:
asserts:
has_size:
value: 5920766
delta: 600000
genes_expression_stringtie:
has_line:
line: "YAL038W_mRNA - - YAL038W CDC19 - chrI:71785-73288 1503 102.837 3437.1 3211.44 3662.76 OK"
Counts Table:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_text:
text: "FBgn0031217\tCG11377\tchr2L\t+\t102380\t104142\t1.963814\t33.024075\t62.009155"
has_line:
line: "YAL038W 1591"
Loading

0 comments on commit c3b1516

Please sign in to comment.