Skip to content

Commit

Permalink
Merge branch 'main' into mary-suggestions-tasks28plus-ep4
Browse files Browse the repository at this point in the history
  • Loading branch information
mallewellyn authored Mar 25, 2024
2 parents 5844f4b + c6e2ed0 commit 30b2846
Show file tree
Hide file tree
Showing 5 changed files with 276 additions and 212 deletions.
44 changes: 22 additions & 22 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -124,11 +124,11 @@ of the challenges we are facing when working with high-dimensional data.
> encountered when working with many features in a high-dimensional data set.
>
> First, make sure you have completed the setup instructions [here](https://carpentries-incubator.github.io/high-dimensional-stats-r/setup.html).
> Next, let's Load the `Prostate` dataset as follows:
> Next, let's Load the `prostate` dataset as follows:
>
> ```{r prostate}
> library("here")
> Prostate <- readRDS(here("data/prostate.rds"))
> prostate <- readRDS(here("data/prostate.rds"))
> ```
>
> Examine the dataset (in which each row represents a single patient) to:
Expand All @@ -142,21 +142,21 @@ of the challenges we are facing when working with high-dimensional data.
> >
> >
> > ```{r dim-prostate, eval = FALSE}
> > dim(Prostate) #print the number of rows and columns
> > dim(prostate) #print the number of rows and columns
> > ```
> >
> > ```{r head-prostate, eval = FALSE}
> > names(Prostate) # examine the variable names
> > head(Prostate) #print the first 6 rows
> > names(prostate) # examine the variable names
> > head(prostate) #print the first 6 rows
> > ```
> >
> > ```{r pairs-prostate}
> > names(Prostate) #examine column names
> > names(prostate) #examine column names
> >
> > pairs(Prostate) #plot each pair of variables against each other
> > pairs(prostate) #plot each pair of variables against each other
> > ```
> > The `pairs()` function plots relationships between each of the variables in
> > the `Prostate` dataset. This is possible for datasets with smaller numbers
> > the `prostate` dataset. This is possible for datasets with smaller numbers
> > of variables, but for datasets in which $p$ is larger it becomes difficult
> > (and time consuming) to visualise relationships between all variables in the
> > dataset. Even where visualisation is possible, fitting models to datasets
Expand Down Expand Up @@ -211,7 +211,7 @@ explore why high correlations might be an issue in a Challenge.
> ## Challenge 3
>
> Use the `cor()` function to examine correlations between all variables in the
> `Prostate` dataset. Are some pairs of variables highly correlated using a threshold of
> `prostate` dataset. Are some pairs of variables highly correlated using a threshold of
> 0.75 for the correlation coefficients?
>
> Use the `lm()` function to fit univariate regression models to predict patient
Expand All @@ -224,11 +224,11 @@ explore why high correlations might be an issue in a Challenge.
>
> > ## Solution
> >
> > Create a correlation matrix of all variables in the Prostate dataset
> > Create a correlation matrix of all variables in the `prostate` dataset
> >
> > ```{r cor-prostate}
> > cor(Prostate)
> > round(cor(Prostate), 2) # rounding helps to visualise the correlations
> > cor(prostate)
> > round(cor(prostate), 2) # rounding helps to visualise the correlations
> > ```
> >
> > As seen above, some variables are highly correlated. In particular, the
Expand All @@ -238,15 +238,15 @@ explore why high correlations might be an issue in a Challenge.
> > as predictors.
> >
> > ```{r univariate-prostate}
> > model1 <- lm(age ~ gleason, data = Prostate)
> > model2 <- lm(age ~ pgg45, data = Prostate)
> > model_gleason <- lm(age ~ gleason, data = prostate)
> > model_pgg45 <- lm(age ~ pgg45, data = prostate)
> > ```
> >
> > Check which covariates have a significant efffect
> >
> > ```{r summary-prostate}
> > summary(model1)
> > summary(model2)
> > summary(model_gleason)
> > summary(model_pgg45)
> > ```
> >
> > Based on these results we conclude that both `gleason` and `pgg45` have a
Expand All @@ -257,8 +257,8 @@ explore why high correlations might be an issue in a Challenge.
> > as predictors
> >
> > ```{r multivariate-prostate}
> > model3 <- lm(age ~ gleason + pgg45, data = Prostate)
> > summary(model3)
> > model_multivar <- lm(age ~ gleason + pgg45, data = prostate)
> > summary(model_multivar)
> > ```
> >
> > Although `gleason` and `pgg45` have statistically significant univariate effects,
Expand Down Expand Up @@ -298,7 +298,7 @@ In this course, we will cover four methods that help in dealing with high-dimens
(3) dimensionality reduction, and (4) clustering. Here are some examples of when each of
these approaches may be used:
(1) Regression with numerous outcomes refers to situations in which there are
1. Regression with numerous outcomes refers to situations in which there are
many variables of a similar kind (expression values for many genes, methylation
levels for many sites in the genome) and when one is interested in assessing
whether these variables are associated with a specific covariate of interest,
Expand All @@ -308,7 +308,7 @@ predictor) could be fitted independently. In the context of high-dimensional
molecular data, a typical example are *differential gene expression* analyses.
We will explore this type of analysis in the *Regression with many outcomes* episode.
(2) Regularisation (also known as *regularised regression* or *penalised regression*)
2. Regularisation (also known as *regularised regression* or *penalised regression*)
is typically used to fit regression models when there is a single outcome
variable or interest but the number of potential predictors is large, e.g.
there are more predictors than observations. Regularisation can help to prevent
Expand All @@ -318,14 +318,14 @@ been often used when building *epigenetic clocks*, where methylation values
across several thousands of genomic sites are used to predict chronological age.
We will explore this in more detail in the *Regularised regression* episode.
(3) Dimensionality reduction is commonly used on high-dimensional datasets for
3. Dimensionality reduction is commonly used on high-dimensional datasets for
data exploration or as a preprocessing step prior to other downstream analyses.
For instance, a low-dimensional visualisation of a gene expression dataset may
be used to inform *quality control* steps (e.g. are there any anomalous samples?).
This course contains two episodes that explore dimensionality reduction
techniques: *Principal component analysis* and *Factor analysis*.
(4) Clustering methods can be used to identify potential grouping patterns
4. Clustering methods can be used to identify potential grouping patterns
within a dataset. A popular example is the *identification of distinct cell types*
through clustering cells with similar gene expression patterns. The *K-means*
episode will explore a specific method to perform clustering analysis.
Expand Down
5 changes: 2 additions & 3 deletions _episodes_rmd/02-high-dimensional-regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -621,7 +621,7 @@ head(design_age)
> that minimises the differences between outcome values and those values
> predicted by using the covariates (or predictor variables). But how do we get
> from a set of predictors and regression coefficients to predicted values? This
> is done via matrix multipliciation. The matrix of predictors is (matrix)
> is done via matrix multiplication. The matrix of predictors is (matrix)
> multiplied by the vector of coefficients. That matrix is called the
> **model matrix** (or design matrix). It has one row for each observation and
> one column for each predictor plus (by default) one aditional column of ones
Expand Down Expand Up @@ -669,8 +669,7 @@ of the input matrix.

```{r ebayes-toptab}
toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age))
orderEffSize <- rev(order(abs(toptab_age$logFC))) # order by effect size (absolute log-fold change)
head(toptab_age[orderEffSize, ])
head(toptab_age)
```

The output of `topTable` includes the coefficient, here termed a log
Expand Down
Loading

0 comments on commit 30b2846

Please sign in to comment.