diff --git a/CITATION b/CITATION index 56ece3c4..b020a963 100644 --- a/CITATION +++ b/CITATION @@ -1 +1,2 @@ -FIXME: describe how to cite this lesson. \ No newline at end of file +O’Callaghan A, Robertson G, LLewellyn M, Becher H, Meynert A, Vallejos C, Ewing A. (2024). High dimensional statistics with R. https://github.com/ +carpentries-incubator/high-dimensional-stats-r. diff --git a/README.md b/README.md index 3dcf224b..6bc3efbc 100644 --- a/README.md +++ b/README.md @@ -2,21 +2,7 @@ [![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) -**Thanks for contributing to The Carpentries Incubator!** -This repository provides a blank starting point for lessons to be developed -here. - -A member of the [Carpentries Curriculum Team](https://carpentries.org/team/) -will work with you to get your lesson listed on the -[Community Developed Lessons page][community-lessons] -and make sure you have everything you need to begin developing your new lesson. - -## What to do next - -Before you begin developing your new lesson, -here are a few things we recommend you do: - -* [ ] [Add relevant topic tags to your lesson repository][cdh-topic-tags]. +This repository is part of The Carpentries Incubator, a place for The Carpentries community to collaboratively create, test, and improve lessons. ## Contributing @@ -42,6 +28,10 @@ Look for the tag This indicates that the maintainers will welcome a pull request fixing this issue. +## Reviews + +The lesson has been iteratively developed and improved. For information on the development process, reviews and feedback from instructors following teaching see [REVIEWS](reviews.md). + ## Maintainer(s) Current maintainers of this lesson are diff --git a/_episodes_rmd/05-factor-analysis.Rmd b/_episodes_rmd/05-factor-analysis.Rmd index eca10e68..78d19295 100644 --- a/_episodes_rmd/05-factor-analysis.Rmd +++ b/_episodes_rmd/05-factor-analysis.Rmd @@ -80,29 +80,10 @@ components are ordered by the amount of variance they account for. # Prostate cancer patient data -The prostate dataset represents data from 97 men who have prostate cancer. -The data come from a study which examined the correlation between the level -of prostate specific antigen and a number of clinical measures in men who were -about to receive a radical prostatectomy. The data have 97 rows and 9 columns. +We revisit the prostate dataset of 97 men who have prostate cancer. Although not strictly a high-dimensional dataset, as with other episodes, we use this dataset to explore the method. - -Columns are: - - -- `lcavol`: log (cancer volume) -- `lweight`: log (prostate weight) -- `age`: age (years) -- `lbph`: log (benign prostatic hyperplasia amount) -- `svi`: seminal vesicle invasion -- `lcp`: log (capsular penetration); amount of spread of cancer in outer walls - of prostate -- `gleason`: [Gleason score](https://en.wikipedia.org/wiki/Gleason_grading_system) -- `pgg45`: percentage Gleason scores 4 or 5 -- `lpsa`: log (prostate specific antigen) - - In this example, we use the clinical variables to identify factors representing various clinical variables from prostate cancer patients. Two principal components have already been identified as explaining a large proportion diff --git a/_extras/data.md b/_extras/data.md new file mode 100644 index 00000000..c725628f --- /dev/null +++ b/_extras/data.md @@ -0,0 +1,88 @@ +--- +title: "Data" +--- + +# Prostate cancer data +[Source](https://search.r-project.org/CRAN/refmans/bayesQR/html/Prostate.html) + +Prostate specific antigen values and clinical measures for 97 patients hospitalised for a radical prostatectomy. Prostate specimens underwent histological and morphometric analysis. The column names refer to + +- lcavol: log(cancer volume) +- lweight: log(prostate weight) +- age: age +- lbph: log(benign prostatic hyperplasia amount) +- svi: seminal vesicle invasion +- lcp: log(capsular penetration) +- gleason: Gleason score +- pgg45: percentage Gleason scores 4 or 5 +- lpsa: log(prostate specific antigen) + +# Methylation data + +[Source](https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.Blood.EPIC.html) + +Illumina Human Methylation data from EPIC on sorted peripheral adult blood cell populations. The data record DNA methylation assays for each individual, which measure, for many sites in the genome, the proportion of DNA that carries a methyl mark (a chemical modification that does not alter the DNA sequence). The methylation assays are recorded as normalised methylation levels (M-values), where negative values correspond to unmethylated DNA and positive values correspond to methylated DNA. The data object also contains phenotypic metadata for each individual such as age and BMI. Precisely, the data object contains: + +- assay(data): normalised methylation levels +- colData(data): individual-level information + - Sample_Well: sample well + - Sample_Name: name of sample + - purity: sample cell purity + - Sex: sex + - Age: age in years + - weight_kg: weight in kilograms + - height_m: height in metres + - bmi: BMI + - bmi_clas: BMI class + - Ethnicity_wide: ethnicity, wide class + - Ethnic_self: ethnicity, self-identified + - smoker: yes/no indicator of smoker status + - Array: type of array from the EPIC array library + - Slide: slide identifier + +# Horvath data + +[Source](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014821#s5) + +Methylation markers across different age groups. The CpGmarker variable used in this lesson are CpG site encodings. + +# Breast cancer gene expression data + +[Source](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2990) + +Gene expression data showing microarray results for different probes used to examine gene expression profiles in 91 different breast cancer patient samples and metdata for the sampled patients. + +- assay(data): gene expression data for each individual +- colData(data): individual-level information + - Study: study identifier + - Age: age in years + - Distant.RFS: indicator of distant relapse free survival + - ER: estrogen receptor positive or negative status + - GGI: gene expression grade index + - Grade: histologic grade + - Size: tumour size in cm + - Time.RFS: time between the date of surgery and diagnosis of relapse (time in relapse free survival, RFS) + +# Single-cell RNA sequencing data + +[Source](https://pubmed.ncbi.nlm.nih.gov/25700174/) + +Gene expression measurements for over 9000 genes in over 3000 mouse cortex and hippocampus cells. These data are an excerpt of the original source. + +- assay(data): gene expression data +- colData(data): individual cell-level information + - tissue: tissue type + - group #: group number + - total mRNA mol: total number of observed mRNA molecules corresponding to this cell's unique barcode identifier + - well: the well that this cell's cDNA was stored in during processing + - sex: sex of the donor animal + - age: age of the donor animal + - diameter: estimated cell diameter + - cell_id: cell identifier + - level1class: a cluster label identified using a mix of computational techniques and manual annotation + - level2class: a cluster label identified using a mix of computational techniques and manual annotation + - sizeFactor: estimate size factor calculated for scaling normalisation using (e.g., **`scran`**). + + +{% include links.md %} + diff --git a/reference.md b/reference.md index f7cdcb6c..24bac376 100644 --- a/reference.md +++ b/reference.md @@ -2,7 +2,6 @@ layout: reference --- -## Glossary {% include links.md %} diff --git a/reviews.md b/reviews.md new file mode 100644 index 00000000..c95990d1 --- /dev/null +++ b/reviews.md @@ -0,0 +1,137 @@ +# Reviews +The purpose of this document is to summarise and track how the lesson has developed in response to peer reviews, feedback from instructors and Carpentries advice. We also detail the main changes that still need to be made and thus define a roadmap to publication. + +Note that the lesson has been developed over around 3 years and iteratively improved. This document only highlights reviews contributed by reviewers external to the main authors, except following rounds of teaching. Details of other improvements can be found throughout the repository and the list of authors is given in [AUTHORS](AUTHORS). + +Thank you to our reviewers and instructors for their feedback. If you would like to submit a review or pull request, please see our [Contribution Guide](https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/main/CONTRIBUTING.md) for more information. + +## Peer reviews +**Review by Emma Rand on Episode 1: Introduction to high-dimensional data ([#39](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/39))** + +The reviewer liked this episode as an introduction to the course, particularly that high-dimensional data were defined explicitly with examples, that important points were reiterated in the text, and that the motivation for using alternative methods when considering high-dimensional data was given. The comments pertained to the entire episode, with the big changes relating to elaborating and expanding the questions or solutions for the challenges, inline code formatting and elaborating reason for package use. + +☑ Changes were made in line with all the suggestions exactly, and are itemised in the issue and the associated points in [#64](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/64). + +**Review by Emma Rand on Episode 2: Regression with many outcomes ([#47](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/47]))** + +The reviewer particularly liked that this episode demonstrates why we need alternative approaches to regression for high-dimensional data and the multiple testing section. Although many comments were given, the reviewer highlighted that the episode was long, that new concepts should be removed from Challenge 1 and that the smoking model figure should be corrected. The review also highlighted issues with the remote theme. + +☑ Changes were made in line with the suggestions exactly, including reducing the length of the lesson. The changes are itemised in the issue, the associated points in [#64](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/64) and the current version of the lesson. + +**Review by Emma Rand on Episode 3: Regularised regression ([#49](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/49))** + +The reviewer liked that the episode emphasises a genuine understanding of the methods. Amongst the full review comments, the reviewer commented that the episode is long and suggested some sections to remove. The reviewer also suggested several points that could be expanded to improve the use of statistical 'jargon' and drawing links between jargon to make the episode more approachable to a biological sciences audience. + +☑ All the suggested changes were made and are detailed in the issue. + + +**Review by Christie Barron on Episode 5: Factor analysis ([#53](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/53))** + +Christie commented that it may be useful to discuss confirmatory factor analysis in addition to exploratory factor analysis to clarify that this is another approach that can be used. In addition, approaches to factor enumeration could be discussed and R packages that make factor analysis easier. + +☑ All of the suggested changes were made and are detailed in the issue and in commits 14584c8 and 3419337. + +**Review by Mary Llewellyn on Episode 1: Introduction to high-dimensional data ([#112](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/112))** + +The reviewer liked that this episode struck a good balance between motivating the course clearly whilst avoiding cognitive overload. The reviewer suggested changes largely related to adding signposting, foreshadowing to motivate the entire lesson from the start and re-ordering paragraphs. The reviewer also suggested minor wording changes to, for example, clarify the difference between the "Challenges" (exercises) and "challenges" in the general sense and to ensure learners, especially independent learners, had completed the setup instructions. + +☑ All the suggested changes were made, detailed in the issue. From the discussions following this review, we have also clarified the definition of high-dimensional data and plan to set up a data description page [#132](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/132). + + + +**Review by Mary Llewellyn on Episode 2: Regression with many outcomes ([#114](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/114))** + +The reviewer wrote that they really liked the episode and believed it's really valuable to explore many outcomes as well as many predictors. They had a few queries on the episode and suggested mainly that some of the more complex programming concepts could be removed to avoid cognitive overload, and clarification about the motivation of the episode as avoiding data dredging. + +☑ All of the suggested changes were made and are detailed in the issue. + +**Review by Mary Llewellyn on Episode 3: Regularised regression ([#115](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/115))** + +The reviewer liked the episode and commented that, although it's long, it makes challenging ideas approachable. The suggestions largely related to how regularisation is motivated and linking ideas to the previous lesson, signposting, how singularities are described and the placement of the linear regression section within the episode. Further adjustments were recommended for independent learners. + +☑ All of the suggested changes were made, detailed in the issue. + + +**Review by Mary Llewellyn on Episode 4: Principal Component Analysis ([#117](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/117))** + +The reviewer really liked the way that PCA is presented practically. The main comments related to clarifying the motivation for various parts of the episode, moving discussion of advantages and disadvantages to the end of the episode, signposting, making it clearer when examples are demonstrative and streamlining package use. + +☑ All of the suggested changes were made and are detailed in the issue. Various additional changes were made following the comments (detailed in the issue), including refining the number of PCA packages used to one, simplifying the scree plots and adding further detail to the code comments. + +**Review by Mary Llewellyn on Episode 5: Factor analysis ([#118](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/118))** + +The reviewer thought that this episode was well-balanced with the previous episode and had a few suggestions to differentiate between factor analysis and PCA, how latent variables are defined, signposting, moving discussion of advantages and disadvantages to the end of the episode, some wording around the hypothesis tests, and some adaptations for the individual learner. + +☑ All of the suggested changes were made, detailed in the issue. Additional changes were made following the comments with respect to removing discussion of the rotations to reduce the likelihood of cognitive overload. + +**Review by Mary Llewellyn on Episode 6: K-means ([#119](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/119))** + +The reviewer really liked that this episode builds gradually from an initial example and stated that this makes the narrative very clear. Most of the suggestions were with respect to wording, minor re-ordering of sections, signposting and differentiating K-means from the methods already introduced. + +☑ All of the suggested changes were made and are detailed in the issue. + +**Review by Mary Llewellyn on Episode 7: Hierarchical clustering ([#120](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/120))** + +The reviewer liked how this episode built on K-means clustering on the second episode and the use of visualisation to illustrate the concepts in the episode. The suggestions related to adding further motivation for the episode, structural re-ordering, annotating plots and code, and signposting. + +☑ All of the suggested changes were made and are detailed in the issue. + + + +## Instructor feedback +**Feedback from teaching 21st October 2021 ([#33](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/33))** + +Overall, the learners liked that the practical examples were clear and easily understood by biologists, the slides were informative and well-presented, and they liked how useful the lesson is. They particularly noted that they liked the pace and depth of the first two episodes and the visualisations in episode 7. + +There were some issues with equation rendering in chrome and some learners found that the pace could be a little faster in places. Episode-specific comments noted that episode 3 was too theoretical, episodes 4 and 5 could contain more code comments for learners looking back on the course and episodes 6 and 7 could contain more examples and give an overview of the general steps of each method/when each is useful. + +☑ The lesson as been iteratively improved over time. As such, episode 3 is now presented much more practically (fewer mathematical expressions, existing theoretical concepts are more clearly and practically motivated, additional content such as Bayesian methods have been removed to focus on the concepts already introduced). Episodes 4 and 5 have almost completely changed and the code is commented and motivated much more clearly. Episodes 6 and 7 now motivate the practical uses of each method more clearly. From teaching, the course was also improved by shortening the introduction to focus only on the difficulties of high-dimensional data, episode 3 was presented from a more practical perspective and episode 5 was made more detailed. + +**Additional changes following teaching from February-June 2022 ([#52](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/33), [#57](https://github.com/carpentries-incubator/high-dimensional-stats-r/pull/57), [#63](https://github.com/carpentries-incubator/high-dimensional-stats-r/pull/63), [#64](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/64), [#76](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/76))** + +Several other changes following notes and feedback from teaching are detailed in these issues. + +**Feedback from Edward Wallace from teaching September 2022 ([#86](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/86), [#88](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/88))** + +The learners found the lessons relevant to their work, particularly episodes 4 and 5, which they thought were explained really well, were practical and easily to follow, and introduced concepts they found important to their work at a level that was understandable to them. They said that the way these episodes were presented helped them to fill the gaps in their understanding from practical implementation. They also particularly liked the coding and visualisation in episode 3. + +Many of the comments related to timings (allowing more time) and clarifying wording. + +☑ Changes were made in response to most comments in [#86](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/86), [#89](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/88) and [#167](https://github.com/carpentries-incubator/high-dimensional-stats-r/pull/167), and any remaining changes are evident in the current lesson. + +**Feedback from February 2024 teaching ([#145](https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/145))** + +The instructors liked teaching the course and found it fun to teach. Comments largely related to timing adjustments, explaining packages, adjusting the way the factor analysis episode is presented and some structural changes to the first three episodes. + +☑ Changes in response to this feedback are documented in the issue. + +## Carpentries-specific + +☑ The lesson has been developed using The Carpentries template. As such, a number of requirements are fulfilled: + +- Alt text and captions complete in line with [The Carpentries guide](https://carpentries.org/blog/2022/11/accathon/). +- Conforms to the [The Carpentries Code of Conduct](https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html). +- Testing that the lesson is appropriate for the target audience identified, is accurate, descriptive and easy to understand and is structured to manage cognitive load. +- Does not use dismissive language. +- All lesson tools are open source and the data sets are accessible. +- Tools and data checked for CC0 license compatibility. +- Data sets are representative of data typically encountered in the domain. +- Tests that the example tasks and narrative of the lesson are appropriate and realistic. +- Tested that the solutions to all exercises are accurate and sufficiently explained, and that the tasks and formats are appropriate for the expected experience level of the target audience. +- Exercises are designed with diagnostic power. +- The learning objectives are clear, descriptive and measurable, and focus on the skills being taught and not the functions/tools e.g. "filter the rows of a data frame based on the contents of one or more columns," rather than "use the filter function on a data frame." +- The target audience identified for the lesson is specific and realistic. +- Tested that the list of required prior skills and/or knowledge is complete and accurate. +- The setup and installation instructions are complete, accurate, and easy to follow. +- It has been taught at least two times by Instructors who had not been heavily involved in the development of the +lesson before that point. +- Check that the lesson includes exercises in a variety of formats. +- The example data sets are described. +- Key terms are contained in the internal glossary in the form of key points. +- All lesson and episode objectives are assessed by exercises or another opportunity for formative assessment. +- The lesson does not make use of superfluous data sets. + + +☒ The points still to be addressed are: + +- Conversion to The Carpentries Workbench