Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 01-introduction-to-high-dimensional-data.Rmd #67

Merged
merged 13 commits into from
Jun 7, 2022
182 changes: 102 additions & 80 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,30 +37,34 @@ knitr_fig_path("01-")

# What are high-dimensional data?

*High-dimensional data* are defined as data in which the number of features
in the data, $p$, are equal or larger than the number of observations (or data
points), $n$. Unlike for *low-dimensional data* in which the number of observations,
$n$, far outnumbers the number of features, $p$, analyses of high-dimensional data requires
consideration of potential problems that come from having a large number of
features.
*High-dimensional data* are defined as data in which the number of features (variables observed),
$p$, are close to or larger than the number of observations (or data points), $n$.
The opposite is *low-dimensional data* in which the number of observations,
$n$, far outnumbers the number of features, $p$. A related concept is *wide data*, which

efers to data with numerous features irrespective of the number of observations (similarly, *tall data* is often used to denote data with a large number of observations)
Analyses of high-dimensional data require consideration of potential problems that
come from having more features than observations.


High-dimensional data have become more common in many scientific fields as new
automated data collection techniques have been developed. More and more datsets
have a large number of features (or variables) and some have as many features as
there are rows in the dataset. Datasets in which $p$>=$n$ are becoming more
common. Such datasets pose a challenge for data analysis as standard methods of
analysis, such as linear regression, are no longer appropriate.
automated data collection techniques have been developed. More and more datasets
have a large number of features and some have as many features as there are rows
in the dataset. Datasets in which $p$>=$n$ are becoming more common. Such datasets
pose a challenge for data analysis as standard methods of analysis, such as linear
regression, are no longer appropriate.

High-dimensional datasets are common in the biological sciences. Subjects like
genomics and medical sciences often use both large (in terms of $n$) and wide
genomics and medical sciences often use both tall (in terms of $n$) and wide

(in terms of $p$) datasets that can be difficult to analyse or visualise using
standard statistical tools. An example of high-dimensional data in biological
sciences may include data collected from hospital patients recording symptoms,
blood test results, behaviours, and general health, resulting in datasets with
large numbers of features. Researchers often want to relate these features to
specific patient outcomes (e.g. survival, length of time spent in hospital).
An example of what high-dimensional data might look like in a biomedical study
is shown in Figure 1.
is shown in the figure below.

```{r table-intro, echo = FALSE}
knitr::include_graphics("../fig/intro-table.png")
Expand Down Expand Up @@ -90,7 +94,11 @@ knitr::include_graphics("../fig/intro-table.png")
>
> > ## Solution
> >
> > 2 and 4
> > 1. No. The number of observations (100 patients) is far greater than the number of features (3).
> > 2. Yes, this is an example of high-dimensional data. There are only 100 observations but 200,000+3 features.
> > 3. No. There are many more observations (200 patients) than features (5).

> > 4. Yes. There is only one observation of more than 20,000 features.
> {: .solution}
{: .challenge}

Expand Down Expand Up @@ -125,11 +133,17 @@ of the challenges we are facing when working with high-dimensional data.

> ## Challenge 2
>
> Load the `Prostate` dataset from the `lasso2` package and examine the column
> names.
> Load the `Prostate` dataset from the **`lasso2`** package.

> names. Although technically not a high-dimensional dataset, the `Prostate` data
> will allow us explore the problems encountered when working with many features.
>
> Examine the dataset (in which each row represents a single patient) and plot
> relationships between the variables using the `pairs` function. Why does it
> Examine the dataset (in which each row represents a single patient) to:

> a) Determine how many observations ($n$) and features ($p$) are available (hint: see the `dim()` function)
> b) Examine what variables were measured (hint: see the `names()` and `head()` functions)
> c) Plot the relationship between the variables (hint: see the `pairs()` function).

> become more difficult to plot relationships between pairs of variables with
> increasing numbers of variables? Discuss in groups.
>
Expand All @@ -140,8 +154,14 @@ of the challenges we are facing when working with high-dimensional data.
> > data(Prostate) #load the Prostate dataset
> > ```
> >
> > ```{r view-prostate, eval = FALSE}
> > View(Prostate) #view the dataset
> > ```{r dim-prostate, eval = FALSE}
> > dim(Prostate) #print the number of rows and columns
> > ```
> >
> > ```{r head-prostate, eval = FALSE}
>> names(Prostate) # examine the variable names
> > head(Prostate) #print the first 6 rows
hannesbecher marked this conversation as resolved.
Show resolved Hide resolved

> > ```
> >
> > ```{r pairs-prostate}
Expand All @@ -154,20 +174,22 @@ of the challenges we are facing when working with high-dimensional data.
> > of variables, but for datasets in which $p$ is larger it becomes difficult
> > (and time consuming) to visualise relationships between all variables in the
> > dataset. Even where visualisation is possible, fitting models to datasets
> > with large numbers of variables is difficult due to the potential for
> > with many variables is difficult due to the potential for
> > overfitting and difficulties in identifying a response variable.
> >
> {: .solution}
{: .challenge}

Imagine we are carrying out least squares regression on a dataset with 25
observations. Fitting a best fit line through these data produces a plot shown
in Figure 2a.
in the left-hand panel of the figure below.

However, imagine a situation in which the number of observations and features in a dataset are almost equal.

In that situation the effective number of

However, imagine a situation in which the ratio of observations to features in
a dataset is almost equal. In that situation the effective number of
observations per features is low. The result of fitting a best fit line through
few observations can be seen in Figure 2b.
few observations can be seen in the right-hand panel below.

```{r intro-figure, echo = FALSE}
knitr::include_graphics("../fig/intro-scatterplot.png")
Expand All @@ -189,7 +211,7 @@ in these datasets makes high correlations between variables more likely.

> ## Challenge 3
>
> Use the `cor` function to examine correlations between all variables in the
> Use the `cor()` function to examine correlations between all variables in the
hannesbecher marked this conversation as resolved.
Show resolved Hide resolved
> Prostate dataset. Are some variables highly correlated (i.e. correlation
> coefficients > 0.6)? Fit a multiple linear regression model predicting patient age
> using all variables in the Prostate dataset.
Expand Down Expand Up @@ -309,60 +331,60 @@ plot(xgroups, col = selected, pch = 19)
{: .challenge}


# Using Bioconductor to access high-dimensional data in the biosciences

In this workshop, we will look at statistical methods that can be used to
visualise and analyse high-dimensional biological data using packages available
from Bioconductor, open source software for analysing high throughput genomic
data. Bioconductor contains useful packages and example datasets as shown on the
website [https://www.bioconductor.org/](https://www.bioconductor.org/).

Bioconductor packages can be installed and used in `R` using the `BiocManager`
package. Let's install the `minfi` package from Bioconductor (a package for
analysing Illumina Infinium DNA methylation arrays).

```{r libminfi}
library("minfi")
```

```{r vigminfi, eval=FALSE}
browseVignettes("minfi")
```

We can explore these packages by browsing the vignettes provided in
Bioconductor. Bioconductor has various packages that can be used to load and
examine datasets in `R` that have been made available in Bioconductor, usually
along with an associated paper or package.

Next, we load the `methylation` dataset which represents data collected using
Illumina Infinium methylation arrays which are used to examine methylation
across the human genome. These data include information collected from the
assay as well as associated metadata from individuals from whom samples were
taken.

```{r libsload}
library("minfi")
library("here")
library("ComplexHeatmap")

methylation <- readRDS(here("data/methylation.rds"))
head(colData(methylation))

methyl_mat <- t(assay(methylation))
## calculate correlations between cells in matrix
cor_mat <- cor(methyl_mat)
```

```{r view-cor, eval=FALSE}
View(cor_mat[1:100, ])
```

The `assay` function creates a matrix-like object where rows represent probes
for genes and columns represent samples. We calculate correlations between
features in the `methylation` dataset and examine the first 100 cells of this
matrix. The size of the dataset makes it difficult to examine in full, a
common challenge in analysing high-dimensional genomics data.

> ## Using Bioconductor to access high-dimensional data in the biosciences
>
> In this workshop, we will look at statistical methods that can be used to
> visualise and analyse high-dimensional biological data using packages available
> from Bioconductor, open source software for analysing high throughput genomic
> data. Bioconductor contains useful packages and example datasets as shown on the
> website [https://www.bioconductor.org/](https://www.bioconductor.org/).
>
> Bioconductor packages can be installed and used in `R` using the **`BiocManager`**
> package. Let's install the **`minfi`** package from Bioconductor (a package for
> analysing Illumina Infinium DNA methylation arrays).
>
> ```{r libminfi}
> library("minfi")
> ```
>
> ```{r vigminfi, eval=FALSE}
> browseVignettes("minfi")
> ```
>
> We can explore these packages by browsing the vignettes provided in
> Bioconductor. Bioconductor has various packages that can be used to load and
> examine datasets in `R` that have been made available in Bioconductor, usually
> along with an associated paper or package.
>
> Next, we load the `methylation` dataset which represents data collected using
> Illumina Infinium methylation arrays which are used to examine methylation
> across the human genome. These data include information collected from the
> assay as well as associated metadata from individuals from whom samples were
> taken.
>
> ```{r libsload}
> library("minfi")
> library("here")
> library("ComplexHeatmap")
>
> methylation <- readRDS(here("data/methylation.rds"))
> head(colData(methylation))
>
> methyl_mat <- t(assay(methylation))
> ## calculate correlations between cells in matrix
> cor_mat <- cor(methyl_mat)
> ```
>
> ```{r view-cor, eval=FALSE}
> cor_mat[1:10, 1:10] # print the top-left corner of the correlation matrix
> ```
>
> The `assay()` function creates a matrix-like object where rows represent probes
> for genes and columns represent samples. We calculate correlations between
> features in the `methylation` dataset and examine the first 100 cells of this
> matrix. The size of the dataset makes it difficult to examine in full, a
> common challenge in analysing high-dimensional genomics data.
{: .callout}

# Further reading

Expand Down