From 9c269c467abb54d3158a723b3e5b8f6f2d5fc852 Mon Sep 17 00:00:00 2001 From: air Date: Tue, 19 Dec 2023 09:50:18 -0600 Subject: [PATCH] Add files via upload --- .../videos/day6/stat220_regression_prelab.qmd | 134 ++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 content/page/videos/day6/stat220_regression_prelab.qmd diff --git a/content/page/videos/day6/stat220_regression_prelab.qmd b/content/page/videos/day6/stat220_regression_prelab.qmd new file mode 100644 index 0000000..e63bce9 --- /dev/null +++ b/content/page/videos/day6/stat220_regression_prelab.qmd @@ -0,0 +1,134 @@ +--- +title: 'Correlation and Linear regression pre-lab' +author: "Professor Amelia McNamara" +format: + html: + toc: true + toc-depth: 2 + toc-location: left + code-tools: true +editor: visual +--- + + + +Download + + +https://github.com/tidy-MN/R-camp-penguins/raw/main/content/page/videos/day6/stat220_regression_prelab.qmd + +## Setup and packages + +We use three packages in this course: `Lock5Data`, `tidyverse` and `infer`. To load a package, you use the `library()` function, wrapped around the name of a package. I've put the code to load one package into the chunk below. Add the other two you need. + +```{r} +#| label: install-load-packages +#| warning: false + +# Install if needed +#install.packages("Lock5Data") +#install.packages("infer") + +# Load +library(Lock5Data) +library(tidyverse) +library(infer) +``` + +## Load the data + +Let's load the example dataset I have for this course, `GSS_clean.csv`. It should be inside the `data` folder in your RStudio Cloud. We'll use the `read_csv()` function to read in the data. + +```{r} +#| label: data-load + +GSS <- read_csv("https://tidy-mn.github.io/R-camp-penguins/data/GSS_clean.csv") +``` + +## Data visualization for two quantitative variables + +Our main data visualization for two quantitative variables is a scatterplot. + +To make our visualization, we will use the `ggplot()` function, with our data inside, and then use a `+` to add on a `geom_` to tell R what kind of plot we want. In this case, we want a `geom_point()` to make a scatterplot. Inside the `geom_` function, we use the `aes()` function and tell R how to map between variables in our dataset and variables in the plot. Here's an example: + +```{r} +#| label: scatterplot + +ggplot(GSS) + + geom_point(aes(x = highest_year_school_completed_spouse, y = highest_year_of_school_completed)) +``` + +As with our plots in the last lab, we get a warning message about missing values being removed. + +Does this plot show a positive association, negative association, or no association? If it shows an association, how strong would you say it is? + +## Summary statistics for two quantitative variables + +One way to quantify the strength and direction of the relationship is with correlation. To find this, we use the `cor()` function inside our `summarize()` in a data pipeline. + +```{r} +GSS %>% + summarize(correlation = cor(y = highest_year_of_school_completed, + x = highest_year_school_completed_spouse)) +``` + +Much like with our other summary statistics, this gives an `NA` value, because there are `NA` values in one or more of the variables in the dataset. We can use `drop_na()` on those two variables, + +```{r} +GSS %>% + drop_na(highest_year_of_school_completed, highest_year_school_completed_spouse) %>% + summarize(correlation = cor( + y = highest_year_of_school_completed, + x = highest_year_school_completed_spouse)) +``` + +## Linear models + +Another way to study the relationship between two quantitative variables is to fit a linear model. R has a function that does least squares regression, called `lm()` for linear model. This uses the formula syntax, which we're not focusing on in this class. The syntax requires you to write $y~x$ to specify your response and explanatory variable, with a "tilde" in between. (The tilde is up above your Tab key on the keyboard, along with the "tick" \`\``). You can run the`lm()\` function in your Console or RMarkdown document and get a little information about your model, + +```{r} +lm(highest_year_of_school_completed ~ highest_year_school_completed_spouse, data = GSS) +``` + +but, it's better practice to assign your model object to a name, so you can refer back to it later, and get more information about it. Recall that the assignment operator in R is `<-`. I usually call my models `m1`, `m2`, etc., but you could think of a better name to use (like `schoolmodel` or similar) to remind you what it's about. + +```{r} +m1 <- lm(highest_year_of_school_completed ~ highest_year_school_completed_spouse, data = GSS) +``` + +When we run this code, nothing prints out. But, a new object should appear in our RStudio Environment pane. Now, we can run R functions on that model object. The most useful one is `summary()`, + +```{r} +m1 %>% + summary() +``` + +The same model coefficients are shown here, but there is also a lot more information about things like the $R^2$ value. + +### Interpreting coefficients + +Let's interpret our model coefficients. Here are the generic sentences I prefer: + +**Intercept** If the value of \[explanatory variable\] was zero, our model would predict \[response variable\] to be \[intercept value\]. + +**Slope** For a one-\[unit\] increase in \[explanatory variable\], our model would predict a \[slope value\]-\[unit\] \[increase/decrease\] in \[response variable\]. + +Let's apply that to this model: + +## Predicting values + +Models are also useful for prediction. There are more programmatic ways to do prediction in R, but for now I recommend you just use R as a big calculator. Let's see what the model would predict for a person whose spouse had completed 12 years of school (high school): + +```{r} + +``` + +Does that make sense? + +## Residuals + +We are also interested in whether our model overpredicts or underpredicts certain points. We can compute a residual, which is $observed - expected$ or $y_i-\hat{y}_i$. Look at row 21 in the dataset. It represents a 34-year-old person with one child whose spouse completed 12 years of school. What is the observed value of `highest_year_of_school_completed` for this person? What is their residual? + +```{r} + +```