You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discussion: Best practices for RNA-seq analysis
Merging technical replicates before of after running pipeline
Wald vs. LRT - What is the question?
Scaling continuous variables
Adding covariates, confounding, and batch correction
SVA/Combatseq vs. ruvseq
Note VST won’t regress out this when normalizing
Interaction terms vs. creating merged variables.
DE - pairwise vs. full models; pull up a favorite report to share
Which cutoffs to use? Both PADJ and LFC?
FA - which approaches, when and why?
Up/Down vs All
GSEA|ORA
For GSEA using only significant genes or everything?
Background gene set:
Non-zero rowsums across all samples
Non-zero rowsumsacross all samples and non-NA unadjusted p values
Non-zero rowsumsacross all samples and non-NA adjusted p values
Discussion: Best practices for RNA-seq analysis
Filtering genes by expression (considering groups or not). The issue with a lot of zeros.
Turn off DESeq2 filtering and do own filter? Not recommended for DESeq2? Relies on DESeq2’s filter, but removes those genes when returning normalized counts.
Filter on number of samples per group that are not zero (when you have large dropouts rates, e.g. in exome hybrid capture kits) -> one motivation is reduce computational requirements for large projects?
Cook’s cutoff does handle outliers, identify them based on the NA values
DESeq2 has trouble if you have unbalanced groups, with the smaller group having expression and the larger group having low expression
Summary: apply DESeq2 without pre-filtering to identify DEGs; evaluate NAs to determine if pre-filtering is required
Checking variance of samples by group
If one group is different, what to do https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups
Summary: if one group has high variance, remove from the full model for comparisons that do not include the high-variance group. Then subset for comparisons with the high-variance group.
DE - contrast vs coefficient when shrinking (apeglm vs ashr)
Apeglm only works with coefficients
Need to use contrasts to get every comparison
Ashr is using a model for many comparisons
Apeglm in simple cases
Summary: Use ashr for comparisons with many groups to be able to pull out all the contrasts; otherwise apeglm is fine. It shrinks less.
Propose to take this out of group meeting with fewer people and then get back with the conclusion: (small group meeting notes)
Discussion: Best practices for RNA-seq analysis
Wald vs. LRT - What is the question?
Using LRT for time series. Examples: 1) LRT with developmental timepoints to get genes, then Wald pairwise 2) LRT across timepoints and then cluster to find time-based pattern
Interaction terms vs. creating merged variables.
Mike Love recommends combination dummy variables instead of interaction terms: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions
One example: interaction between genotype and sex
Decision: use dummy variables as default
Possible rare exception: control for individual where individuals are in multiple categories. Talk to Elizabeth about this.
Adding covariates, confounding, and batch correction
Combatseq (part of the SVA package) - corrected count, removing the effects while retaining the structure of the data. Used in a scenario where you know what covariate/batch is. Do not add now-removed known covariates to DESeq2 formula. Also, don’t attempt to remove biological effect (e.g. donor), this is not conceptually valid; best for technical variation
Not needed in template – Limma::removeBatchEffect - getting corrected counts for visualization. Meeta used once, to prove that including batch variable in model and removingBatchEffect yielded same DEGs
ruvseq (used when you don’t know where the unwanted variation is coming from. Package utilizes dummy variable(s), 1-5 used, start with 1, look at PCA, decide if you want more separation) Add any now-created RUV variables to DESeq2 formula.
Normalized matrix produced – only for visualization, not for input into DESeq2
Add to template: check correlation of dummy variables produced by ruvseq with existing covariates in metadata
Compare before and after batch correction; using ranks
Note VST won’t regress out this when normalizing
Worth a read: https://academic.oup.com/biostatistics/article/24/3/635/6459158
DE - pairwise vs. full models; pull up a favorite report to share
FA - which approaches, when and why? Clusterprofiler, fgsea, topgeo for non model organisms.
Up/Down vs All
GSEA|ORA
For GSEA using only significant genes or everything? Answer: all genes with non-NA padj values. Heather to find documentation to support this.
For ORA, Hallmark and Reactome most promising. GO database more fruitful than KEGG database but need to be mindful of nested GO pathways. Can provide client with link to msigdb and ask them which subsets they want
Run all databases together or separate? More accurate statistics if together but more digestible for clients if separate
Background gene set for ORA:
Non-zero rowsums across all samples
Non-zero rowsums across all samples and non-NA unadjusted p values
Non-zero rowsums across all samples and non-NA adjusted p values, at least as strict as above bulletpoint b/c cannot have non-NA adjusted p value without non-NA raw pvalue
DO NOT use all genes in genome
Need code for querying Broad gene set database, ensure that everyone is using same downloaded version of gene sets
The text was updated successfully, but these errors were encountered:
Discussion: Best practices for RNA-seq analysis
Merging technical replicates before of after running pipeline
Wald vs. LRT - What is the question?
Scaling continuous variables
Adding covariates, confounding, and batch correction
SVA/Combatseq vs. ruvseq
Note VST won’t regress out this when normalizing
Interaction terms vs. creating merged variables.
DE - pairwise vs. full models; pull up a favorite report to share
Which cutoffs to use? Both PADJ and LFC?
FA - which approaches, when and why?
Up/Down vs All
GSEA|ORA
For GSEA using only significant genes or everything?
Background gene set:
Non-zero rowsums across all samples
Non-zero rowsumsacross all samples and non-NA unadjusted p values
Non-zero rowsumsacross all samples and non-NA adjusted p values
Discussion: Best practices for RNA-seq analysis
Filtering genes by expression (considering groups or not). The issue with a lot of zeros.
Turn off DESeq2 filtering and do own filter? Not recommended for DESeq2? Relies on DESeq2’s filter, but removes those genes when returning normalized counts.
Filter on number of samples per group that are not zero (when you have large dropouts rates, e.g. in exome hybrid capture kits) -> one motivation is reduce computational requirements for large projects?
Cook’s cutoff does handle outliers, identify them based on the NA values
DESeq2 has trouble if you have unbalanced groups, with the smaller group having expression and the larger group having low expression
Summary: apply DESeq2 without pre-filtering to identify DEGs; evaluate NAs to determine if pre-filtering is required
Checking variance of samples by group
If one group is different, what to do
https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups
Summary: if one group has high variance, remove from the full model for comparisons that do not include the high-variance group. Then subset for comparisons with the high-variance group.
DE - contrast vs coefficient when shrinking (apeglm vs ashr)
Apeglm only works with coefficients
Need to use contrasts to get every comparison
Ashr is using a model for many comparisons
Apeglm in simple cases
Summary: Use ashr for comparisons with many groups to be able to pull out all the contrasts; otherwise apeglm is fine. It shrinks less.
Propose to take this out of group meeting with fewer people and then get back with the conclusion: (small group meeting notes)
Discussion: Best practices for RNA-seq analysis
Wald vs. LRT - What is the question?
Using LRT for time series. Examples: 1) LRT with developmental timepoints to get genes, then Wald pairwise 2) LRT across timepoints and then cluster to find time-based pattern
Interaction terms vs. creating merged variables.
Mike Love recommends combination dummy variables instead of interaction terms: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions
One example: interaction between genotype and sex
Decision: use dummy variables as default
Possible rare exception: control for individual where individuals are in multiple categories. Talk to Elizabeth about this.
Adding covariates, confounding, and batch correction
Combatseq (part of the SVA package) - corrected count, removing the effects while retaining the structure of the data. Used in a scenario where you know what covariate/batch is. Do not add now-removed known covariates to DESeq2 formula. Also, don’t attempt to remove biological effect (e.g. donor), this is not conceptually valid; best for technical variation
Not needed in template – Limma::removeBatchEffect - getting corrected counts for visualization. Meeta used once, to prove that including batch variable in model and removingBatchEffect yielded same DEGs
ruvseq (used when you don’t know where the unwanted variation is coming from. Package utilizes dummy variable(s), 1-5 used, start with 1, look at PCA, decide if you want more separation) Add any now-created RUV variables to DESeq2 formula.
Normalized matrix produced – only for visualization, not for input into DESeq2
Add to template: check correlation of dummy variables produced by ruvseq with existing covariates in metadata
Compare before and after batch correction; using ranks
Note VST won’t regress out this when normalizing
Worth a read: https://academic.oup.com/biostatistics/article/24/3/635/6459158
DE - pairwise vs. full models; pull up a favorite report to share
FA - which approaches, when and why? Clusterprofiler, fgsea, topgeo for non model organisms.
Up/Down vs All
GSEA|ORA
For GSEA using only significant genes or everything? Answer: all genes with non-NA padj values. Heather to find documentation to support this.
For ORA, Hallmark and Reactome most promising. GO database more fruitful than KEGG database but need to be mindful of nested GO pathways. Can provide client with link to msigdb and ask them which subsets they want
Run all databases together or separate? More accurate statistics if together but more digestible for clients if separate
Background gene set for ORA:
Non-zero rowsums across all samples
Non-zero rowsums across all samples and non-NA unadjusted p values
Non-zero rowsums across all samples and non-NA adjusted p values, at least as strict as above bulletpoint b/c cannot have non-NA adjusted p value without non-NA raw pvalue
DO NOT use all genes in genome
Need code for querying Broad gene set database, ensure that everyone is using same downloaded version of gene sets
The text was updated successfully, but these errors were encountered: