Skip to content

Commit

Permalink
Merge pull request #167 from mallewellyn/mary-review-changes
Browse files Browse the repository at this point in the history
Initial changes in response to instructor feedback
  • Loading branch information
alanocallaghan authored Mar 25, 2024
2 parents e57d5d4 + 62e191b commit 1c9ce40
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 15 deletions.
6 changes: 5 additions & 1 deletion _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ high-dimensional datasets it can also be difficult to identify a single response
variable, making standard data exploration and analysis techniques less useful.

Let's have a look at a simple dataset with lots of features to understand some
of the challenges we are facing when working with high-dimensional data.
of the challenges we are facing when working with high-dimensional data.


> ## Challenge 2
Expand Down Expand Up @@ -166,6 +166,10 @@ of the challenges we are facing when working with high-dimensional data.
> {: .solution}
{: .challenge}
Note that function documentation and information on function arguments will be useful throughout
this lesson. We can access these easily in R by running `?` followed by the package name.
For example, the documentation for the `dim` package can be accessed by running `?dim`.
> ## Locating data with R - the **`here`** package
>
> It is often desirable to access external datasets from inside R and to write
Expand Down
19 changes: 6 additions & 13 deletions _episodes_rmd/04-principal-component-analysis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -76,31 +76,24 @@ resulting principal component could also be used as an effect in further analysi
> hospital with infectious respiratory disease. They would like to determine
> whether length of stay in hospital differs in patients with different
> respiratory diseases.
> 2. An online retailer has collected data on user interactions with its online
> app and has information on the number of times each user interacted with
> the app, what products they viewed per interaction, and the type and cost
> of these products. The retailer would like to use this information to
> predict whether or not a user will be interested in a new product.
> 3. A scientist has assayed gene expression levels in 1000 cancer patients and
> 2. A scientist has assayed gene expression levels in 1000 cancer patients and
> has data from probes targeting different genes in tumour samples from
> patients. She would like to create new variables representing relative
> abundance of different groups of genes to i) find out if genes form
> subgroups based on biological function and ii) use these new variables
> in a linear regression examining how gene expression varies with disease
> severity.
> 4. All of the above.
> 3. Both of the above.
>
> > ## Solution
> >
> >
> > In the first case, a regression model would be more suitable; perhaps a
> > survival model.
> > In the second, again a regression model, likely linear or logistic, would
> > be more suitable.
> > In the third example, PCA can help to identify modules of correlated
> > In the second example, PCA can help to identify modules of correlated
> > features that explain a large amount of variation within the data.
> >
> > Therefore the answer here is 3.
> > Therefore the answer here is 2.
> {: .solution}
{: .challenge}

Expand Down Expand Up @@ -241,8 +234,8 @@ deviation of 1.
> > It also won't affect how quickly the output will be calculated, whether
> > continuous and categorical variables are present or not.
> >
> > It is done to ensure that all features have equal weighting in the resulting
> > PCs.
> > It is done to ensure that features with different ranges of values
> > have equal weighting in the resulting PCs (point 2).
> >
> > 2. You may not want to standardise datasets which contain continuous variables
> > all measured on the same scale (e.g. gene expression data or RNA sequencing
Expand Down
2 changes: 1 addition & 1 deletion _episodes_rmd/05-factor-analysis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ text(
> > biologically, as we would expect prostate enlargement to be associated
> > with greater weight. The groupings of lcavol, lcp, and lpsa also make
> > sense biologically, as larger cancer volume may be expected to be
> > associated with greater cancer spead and therefore higher PSA in the blood.
> > associated with greater cancer spread and therefore higher PSA in the blood.
> {: .solution}
{: .challenge}

Expand Down

0 comments on commit 1c9ce40

Please sign in to comment.