DATA605/MSilva_Exercise11.Rmd

---
title: "DATA 605 Week 11"
author: "Mike Silva"
date: "10 April 2019"
output: 
  rmdformats::readthedown:
    highlight: kate
---

## Introduction

As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we needed to complete a simple linear regression of data of our choice.  I have selected to look at the relationship between the total arcade revenue and the number of PhD computer science graduates.

## Data Acquisition

The data was scraped from [Tyler Vigen's site](http://www.tylervigen.com/view_correlation?id=97).  The total revenue from arcades is originally from the U.S. Census Bureau.  The computer science PhD's originally came from the National Science foundation.

```{r message=FALSE}
library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
library(tibble)

df <- read_html("http://www.tylervigen.com/view_correlation?id=97") %>%
  html_nodes(xpath = "//table[@class='alldata']") %>%
  html_table(fill = TRUE) %>%
  as.data.frame() %>%
  select(-X12) %>% # Drop empty column
  slice(1:3) %>% # Keep the first 3 rows
  mutate(X1 = ifelse(X1 == "", "Year", str_extract(X1, "(.*?)\\s?\\(.*?\\)"))) %>% # Cleanup first column
  mutate(X1 = str_remove(X1, " \\(.*\\)")) %>% # More first column cleanup
  mutate(X1 = str_replace_all(X1, " ", ".")) %>% # Last bit of first column cleanup
  rownames_to_column %>% # Transpose dataframe
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>%
  select(-var) # Drop unneeded column

# Pull name from first row
names(df) <- df[1,]

df <- df %>%
  slice(2:11) %>% # Drop first row
  mutate(Total.revenue.generated.by.arcades = gsub(",", "", Total.revenue.generated.by.arcades)) %>% # Remove the commas
  mutate(Computer.science.doctorates.awarded = gsub(",", "", Computer.science.doctorates.awarded)) %>% # Remove the commas
  mutate_if(is.character, as.numeric) # Convert to numeric
```

The end result is the following data

```{r echo=FALSE, message=FALSE}
library(kableExtra)
df %>%
  rename("Computer Science Doctorates Awarded" = Computer.science.doctorates.awarded) %>%
  rename("Total Revenue Generated by Arcades" = Total.revenue.generated.by.arcades) %>%
  kable() %>%
  kable_styling()
```

## Model Building

Now that the data is cleaned up we will fit a line to the data

```{r}
model <- lm(Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, data = df)
```

## Model Evaluation

```{r echo = FALSE}
library(ggplot2)
ggplot(df, aes(Computer.science.doctorates.awarded, Total.revenue.generated.by.arcades)) +
  geom_point() +
  geom_abline(slope = model$coefficients[2], intercept = model$coefficients[1], color="red") +
  ylab("Total Revenue Generated by Arcades (millions $)") +
  xlab("Computer Science Doctorates Awarded")
```


```{r comment=NA}
summary(model)
```

### Residual Analysis

```{r}
plot(model)
```

## Conclusion

This model explains almost all of the variablitiy.  The R-squared is about 0.97.  The residual plots do not raise any concern.  The Q-Q is fairly linear.  The coefficients are statistically significant.  The only problem with this model is that the relationship is a [spurious correlations](http://www.tylervigen.com/view_correlation?id=97).