Merge branch 'main' into mary-episode4-changes

carpentries-incubator · Apr 2, 2024 · 6e2ced4 · 6e2ced4
2 parents 755a00a + 59e3ff8
commit 6e2ced4
Show file tree

Hide file tree

Showing 15 changed files with 376 additions and 177 deletions.
diff --git a/CITATION b/CITATION
@@ -1 +1,2 @@
-FIXME: describe how to cite this lesson.
+O’Callaghan A, Robertson G, LLewellyn M, Becher H, Meynert A, Vallejos C, Ewing A. (2024). High dimensional statistics with R. https://github.com/
+carpentries-incubator/high-dimensional-stats-r.
diff --git a/README.md b/README.md
@@ -2,21 +2,7 @@
 
 [![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/)
 
-**Thanks for contributing to The Carpentries Incubator!**
-This repository provides a blank starting point for lessons to be developed
-here.
-
-A member of the [Carpentries Curriculum Team](https://carpentries.org/team/)
-will work with you to get your lesson listed on the
-[Community Developed Lessons page][community-lessons]
-and make sure you have everything you need to begin developing your new lesson.
-
-## What to do next
-
-Before you begin developing your new lesson,
-here are a few things we recommend you do:
-
-* [ ] [Add relevant topic tags to your lesson repository][cdh-topic-tags].
+This repository is part of The Carpentries Incubator, a place for The Carpentries community to collaboratively create, test, and improve lessons.
 
 ## Contributing
 
@@ -42,6 +28,10 @@ Look for the tag
 This indicates that the maintainers will welcome a pull request fixing this
 issue.
 
+## Reviews
+
+The lesson has been iteratively developed and improved. For information on the development process, reviews and feedback from instructors following teaching see [REVIEWS](reviews.md).
+
 ## Maintainer(s)
 
 Current maintainers of this lesson are

diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
@@ -142,18 +142,18 @@ of the challenges we are facing when working with high-dimensional data.
 > > 
 > > 
 > > ```{r dim-prostate, eval = FALSE}
-> > dim(prostate)   #print the number of rows and columns
+> > dim(prostate)    # print the number of rows and columns
 > > ```
 > >
 > > ```{r head-prostate, eval = FALSE}
-> > names(prostate) # examine the variable names
-> > head(prostate)   #print the first 6 rows
+> > names(prostate)  # examine the variable names
+> > head(prostate)   # print the first 6 rows
 > > ```
 > > 
-> > ```{r pairs-prostate}
-> > names(prostate)  #examine column names
+> > ```{r pairs-prostate, fig.cap="Pairwise plots of the 'prostate' dataset.", fig.alt="A set of pairwise scatterplots of variables in the 'prostate' dataset, namely lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa. The plots are shown in a grid."}
+> > names(prostate)  # examine column names
 > >
-> > pairs(prostate)  #plot each pair of variables against each other
+> > pairs(prostate)  # plot each pair of variables against each other
 > > ```
 > > The `pairs()` function plots relationships between each of the variables in
 > > the `prostate` dataset. This is possible for datasets with smaller numbers

diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd
@@ -94,7 +94,7 @@ methyl_mat <- assay(methylation)
 The distribution of these M-values looks like this:
 
 ```{r histx, fig.cap="Methylation levels are generally bimodally distributed.", fig.alt="Histogram of M-values for all features. The distribution appears to be bimodal, with a large number of unmethylated features as well as many methylated features, and many intermediate features."}
-hist(methyl_mat, breaks = "FD", xlab = "M-value")
+hist(methyl_mat, xlab = "M-value")
 ```
 
 You can see that there are two peaks in this distribution, corresponding
@@ -105,7 +105,11 @@ sample-level metadata we have relating to these data. In this case, the
 metadata, phenotypes, and groupings in the `colData` look like this for
 the first 6 samples:
 
-```{r datatable}
+```{r, eval=FALSE}
+head(colData(methylation))
+```
+
+```{r datatable, echo=FALSE}
 knitr::kable(head(colData(methylation)), row.names = FALSE)
 ```
 
@@ -1029,15 +1033,10 @@ conservative, especially with a lot of features!
 ```{r p-fwer, fig.cap="Bonferroni correction often produces very large p-values, especially with low sample sizes.", fig.alt="Plot of Bonferroni-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
 p_raw <- toptab_age$P.Value
 p_fwer <- p.adjust(p_raw, method = "bonferroni")
-library("ggplot2")
-ggplot() +
-    aes(p_raw, p_fwer) +
-    geom_point() +
-    scale_x_log10() + scale_y_log10() +
-    geom_abline(slope = 1, linetype = "dashed") +
-    geom_hline(yintercept = 0.05, linetype = "dashed", col = "red") +
-    geom_vline(xintercept = 0.05, linetype = "dashed", col = "red") +
-    labs(x = "Raw p-value", y = "Bonferroni p-value")
+plot(p_raw, p_fwer, pch = 16, log="xy")
+abline(0:1, lty = "dashed")
+abline(v = 0.05, lty = "dashed", col = "red")
+abline(h = 0.05, lty = "dashed", col = "red")
 ```
 
 You can see that the p-values are exactly one for the vast majority of
@@ -1090,7 +1089,7 @@ experiment over and over.
 > >          \frac{0.05}{100} = 0.0005
 > >     $$
 > >
-> > 2.  Trick question! We can't say what proportion of these genes are
+> > 2.  We can't say what proportion of these genes are
 > >     truly different. However, if we repeated this experiment and
 > >     statistical test over and over, on average 5% of the results
 > >     from each run would be false discoveries.
@@ -1100,25 +1099,17 @@ experiment over and over.
 > >
 > >     ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
 > >     p_fdr <- p.adjust(p_raw, method = "BH")
-> >     ggplot() +
-> >         aes(p_raw, p_fdr) +
-> >         geom_point() +
-> >         scale_x_log10() + scale_y_log10() +
-> >         geom_abline(slope = 1, linetype = "dashed") +
-> >         geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
-> >         geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
-> >         labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value")
+> >     plot(p_raw, p_fdr, pch = 16, log="xy")
+> >     abline(0:1, lty = "dashed")
+> >     abline(v = 0.05, lty = "dashed", col = "red")
+> >     abline(h = 0.05, lty = "dashed", col = "red")
 > >     ```
 > >
 > >     ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
-> >     ggplot() +
-> >         aes(p_fdr, p_fwer) +
-> >         geom_point() +
-> >         scale_x_log10() + scale_y_log10() +
-> >         geom_abline(slope = 1, linetype = "dashed") +
-> >         geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
-> >         geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
-> >         labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value")
+> >     plot(p_fwer, p_fdr, pch = 16, log="xy")
+> >     abline(0:1, lty = "dashed")
+> >     abline(v = 0.05, lty = "dashed", col = "red")
+> >     abline(h = 0.05, lty = "dashed", col = "red")
 > >     ```
 > >
 > {: .solution}