Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback from September 2022 delivery #88

Closed
ewallace opened this issue Sep 29, 2022 · 4 comments
Closed

Feedback from September 2022 delivery #88

ewallace opened this issue Sep 29, 2022 · 4 comments

Comments

@ewallace
Copy link
Contributor

ewallace commented Sep 29, 2022

DRAFT TO BE UPDATED AFTER DAY 4 - saved here to get started, currently updated to day 3.

EdCarp delivery 2022-09-27 to 2022-09-30, with instructors @hannesbecher, @luciewoellenstein44, @ewallace.
https://edcarp.github.io/2022-09-27_ed-dash_high-dim-stats/

Collaborative document:
https://pad.carpentries.org/2022-09-27_ed-dash_high-dim-stats

Overall went very well, good material, happy and engaged students.

Day 1 - Introduction, Regression with many features

Learner feedback

Please list 1 thing that you liked or found particularly useful

  • Well, all this is exactly what I need right now for my work. So, it was all very useful. (very useful help on the model.matrix - thank you!! (Pete)
  • It's nice to go through every function/word in R and know what they mean all the time.
  • Very helpful, especially in explaining what each part of the function actually means
  • Great learning experience
  • Very useful and insightful first day!
  • I really appreciate getting a chance to go through the code step by step. It's useful to be able to hear what it is exactly, and how it works.

Please list another thing that you found less useful, or that could be improved

  • While this is out of your control, moving between windows and internet tabs on a small screen takes a little time so, from time to time I missed something. (Pete) +1
  • Sometimes it is hard to read the material in time for the group sessions. +1
  • maybe more breaks to people could catch up +1
  • Perhaps a glossary/definitions of functions used could be useful in case you miss anything that has been spoken
  • I spent a bit of time trying to find the column header for the smoking exercise! Should have checked the question first, but I didn't and wasted loads of time trying to figure out it was $smoking.

Instructor feedback

Day 2 - Regularised regression

Learner feedback

Please list 1 thing that you liked or found particularly useful

  • the detailed explaination of regression models, from ridge to Lasso and eslatic, it is just fantastic to know how those algorithms relate to each other, been using them for many years, never understanded the links
    the coding and visulisation of the results are really helpful.
  • The depth of the models and the background was great. +1
  • Very happy with the explanations of how the maths works. Also it's great to be finally able to make a predictive model, even if it was very simple.
  • Increased my understanding of regression, but it was a tough day! Lots to take in.

Please list another thing that you found less useful, or that could be improved

  • There were a few times I was confused with the R syntax being used, mainly because I am not used to them. Are there any supporting documents that could be displayed for some of the exercises to help us solve the tasks?
  • Although in contridiction to my "positive" comment, it was heavy going :) - although I did enjoy it. The material is there for us to go back over. +1
  • I think its good to do some examples. I struggled to keep up at times and got a little lost. I think this is just my lack of familiarity. Maybe a little slower would be good.
  • I found it tough going, and there was a lot of detail. Felt a bit out of my depth at times, but I did learn a bit more.

Instructor feedback

Learners had several questions about extra arguments in calls to lm(), glmnet(), and so on. See etherpad day 2. Those should give clues to places to simplify:

  • Why as.data.frame? Comparing simplerfit_horvath <- lm(train_age ~ train_mat) to the example
    fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
  • What does the -1 do to the methyl_mat matrix in k-fold cross validation? (in lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)

Day 3 - Principal component analyses, Factor analysis

Learner feedback

Please list 1 thing that you liked or found particularly useful

  • I thought this lesson was explained really well! I finally understand what these two models do. The run-through in R was in depth and really helpful.
  • Very practical lesson! Easy to follow.
  • For me this was great as I kind of do this sort of thing anyway. So, to actually be taught it filled in some gaps. The course materail, as everyday, is excellent. Very detailed. Course delivery excellent too.
  • perfect level for me today - I've used PCA in genetics to look for relatedness, so had a bit of understanding into how it works, but didn't know how to use it on non-genetic data. Really helpful, and I get it now!! I can see how to use it in my research.
  • likewise - only ever used PCA for pop genomics as a bit of a black box so great to develop my understanding. v interested in factor analysis
  • Fantastic, I can see the material here coming in very good use!
  • very interesting and detailed explaination on PCA and factor analysis, love it

Please list another thing that you found less useful, or that could be improved

  • difficult one: Maybe the time for coding could be expanded slightly?
  • I am curious about factor analysis and would be great to discuss it more

Instructor feedback

PCA (Episode 4)

  • Really nice introductory explanations.
  • Episode 4 PCA, Challenge 1, example 2 is ambiguous as it could be interpreted as PCA-appropriate. Could that be clarified or discussed?

An online retailer has collected data on user interactions with its online app and has information on the number of times each user interacted with the app, what products they viewed per interaction, and the type and cost of these products. The retailer would like to use this information to predict whether or not a user will be interested in a new product.

  • Challenge 2 some of the students said "seems like a trick question"
  • Loading is introduced approximately 3 times, but only explained later in the lesson. Could that be rationalised so it's introduced strongly once? Understanding the loadings helps understand how you calculate PCs, and that could come before you decide how many PCs you want to keep
  • The distance between the plot styles with base plot earlier and ggplot2-based later is striking and perhaps distracting. For example, one biplot looks very different from another biplot. This could also make the code fragile for learners as in the same lesson biplot is used for PCAtools::biplot and stats::biplot.
  • Are the labels in the biplot needed in PCAtools/microarray example? It seems like unnecessary and distracting information here given we are not going to explain GSMxxxxx or 211122_s_at. Also they are hard to read - too small and/or overlapping and give ggrepel error messages.
  • This lesson introduced to me the terms "screeplot" and "biplot" as I didn't have special names for them before. Maybe an extra sentence of explanation each would be helpful.
  • "Remove the lower 20% of PCs with lower variance" was unclear to learners.
  • In some code snippets, the comments happening after the code means they appear after the output instead of next to the code they refer to. Maybe more helpful to more the comments immediately before the line of code they refer to?
  • plotloadings was unclear to instructors and to learners. We wondered how the included variables chosen, and is it important to include it? Reading the ?plotloadings, it's says that the rangeRetain argument gives a "Cut-off value for retaining variables" in terms of "top/bottom fraction of the loadings range". I (Edward) find that unintuitive. For example there are still many points in 1/10000th of the loadings range: plotloadings(pc, labSize = 3, rangeRetain = 1e-5)

Factor analysis (Episode 5)

  • There's some confusion about the difference between PCA and FA. Current introduction says "we introduce another method", "Factor analysis is used to identify latent features in a dataset from among a set of original variables ... FA does this in a similar way to PCA", "Unlike with PCA, researchers using FA have to specify the number of latent variables.". This overall gives the impression of "similar but different" and doesn't explain well either why you'd need to learn both or the ideas underlying the difference.
  • Some online materials give clearer PCA vs FA explanations, e.g. https://towardsdatascience.com/what-is-the-difference-between-pca-and-factor-analysis-5362ef6fa6f9 and https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/203-30.pdf
  • Still the learners seemed very happy. That seems to reflect the hands-on approach of the lesson that they can follow along with, less mathy than previous episodes.

Day 4 - K-means clustering, Hierarchical clustering

Learner feedback

Instructor feedback

@alanocallaghan
Copy link
Collaborator

alanocallaghan commented Sep 29, 2022

I don't know if these are rhetorical, but

Why as.data.frame? Comparing simplerfit_horvath <- lm(train_age ~ train_mat) to the example fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))

The second example preserves the variable names as is, so when you use predict with newdata it doesn't throw a warning. Should probably work with a dataframe from the start there

What does the -1 do to the methyl_mat matrix in k-fold cross validation? (in lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)

I'm not 100% but presumably this is removing the intercept as glmnet automatically adds one. Again probably would be better to set the data up so the code is similar across lm and glmnet calls, although I think that's actually rather difficult

@ewallace
Copy link
Contributor Author

ewallace commented Sep 29, 2022

@alanocallaghan thanks, it wasn't rhetorical and sorry for being unclear. I agree that it would be helpful to either set up the code to be more similar, or to explain the details.

@alanocallaghan
Copy link
Collaborator

The first is mentioned in this issue for a fuller explanation #52

@hannesbecher
Copy link
Collaborator

Many of these are now implemented now. Others have become obsolete due to restructuring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants