-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path36-rvest.Rmd
158 lines (104 loc) · 7.76 KB
/
36-rvest.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# Intro to rvest
All the way back in Chapter 2, we used Google Sheets and importHTML to get our own data out of a website. For me, that's a lot of pointing and clicking and copying and pasting. R has a library that can automate the harvesting of data from HTML on the internet. It's called `rvest`.
Let's grab [a simple, basic HTML table from College Football Stats](http://www.cfbstats.com/2021/leader/national/team/offense/split01/category09/sort01.html). There's nothing particularly strange about this table -- it's simply formatted and easy to scrape.
First we'll need some libraries. We're going to use a library called `rvest`, which you can get by running `install.packages('rvest')` in the console.
```{r, message=FALSE, warning=FALSE}
library(rvest)
library(tidyverse)
```
The rvest package has functions that make fetching, reading and parsing HTML simple. The first thing we need to do is specify a url that we're going to scrape.
```{r}
scoringoffenseurl <- "http://www.cfbstats.com/2021/leader/national/team/offense/split01/category09/sort01.html"
```
Now, the most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a `<table>` tag that has all of our data table in it. But how do you tell R which one that is? Well, it's easy, once you know what to do. But it's not simple. So I've made a short video to show you how to find it.
<iframe width="560" height="315" src="https://www.youtube.com/embed/kYkSE3zWa9Y" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
When you have simple tables, the code is very simple. You create a variable to receive the data, then pass it the url, read the html that was fetched, find the node you need using your XPath value you just copied and you tell rvest that it's a table.
```{r}
scoringoffense <- scoringoffenseurl %>%
read_html() %>%
html_nodes(xpath = '//*[@id="content"]/div[2]/table') %>%
html_table()
```
What we get from this is ... not a dataframe. It's a list with one element in it, which just so happens to be our dataframe. When you get this, the solution is simple: just overwrite the variable you created with the first list element.
```{r}
scoringoffense <- scoringoffense[[1]]
```
And what do we have?
```{r}
head(scoringoffense)
```
We have data, ready for analysis.
## A slightly more complicated example
What if we want more than one year in our dataframe?
This is a common problem. What if we want to look at every scoring offense going back several years? The website has them going back to 2009. How can we combine them?
First, we should note, that the data does not have anything in it to indicate what year it comes from. So we're going to have to add that. And we're going to have to figure out a way to stack two dataframes on top of each other.
So let's grab 2020.
```{r}
scoringoffenseurl20 <- "http://www.cfbstats.com/2020/leader/national/team/offense/split01/category09/sort01.html"
scoringoffense20 <- scoringoffenseurl20 %>%
read_html() %>%
html_nodes(xpath = '//*[@id="content"]/div[2]/table') %>%
html_table()
scoringoffense20 <- scoringoffense20[[1]]
```
First, how are we going to know, in the data, which year our data is from? We can use mutate.
```{r error=TRUE}
scoringoffense20 <- scoringoffense %>% mutate(YEAR = 2020)
```
Uh oh. Error. What does it say? It's ... not clear, but a hint is that our first column doesn't have a name. Each column must be named. If you look at our data in the environment tab in the upper right corner, you'll see that indeed, the first column has no name. It's the FBS rank of each team. So we can fix that and mutate in the same step. We'll do that using `rename` and since the field doesn't have a name to rename it, we'll use a position argument. We'll say rename column 1 as Rank.
```{r}
scoringoffense21 <- scoringoffense %>% rename(Rank = 1) %>% mutate(YEAR = 2021)
scoringoffense20 <- scoringoffense20 %>% rename(Rank = 1) %>% mutate(YEAR = 2020)
```
And now, to combine the two tables together length-wise -- we need to make long data -- we'll use a dpylr function called `bind_rows`. The good thing is bind_rows is simple.
```{r}
combined <- bind_rows(scoringoffense21, scoringoffense20)
```
Note in the environment tab we now have a data frame called combined that has 257 observations -- which just so happens to be what 127 from 2020 and 130 from 2021 add up to.
```{r}
head(combined)
```
## An even more complicated example
What do you do when the table has non-standard headers?
Unfortunately, non-standard means there's no one way to do it -- it's going to depend on the table and the headers. But here's one idea: Don't try to make it work.
I'll explain.
Let's try to get [season team stats from Sports Reference](https://www.sports-reference.com/cbb/seasons/2020-school-stats.html). If you look at that page, you'll see the problem right away -- the headers span two rows, and they repeat. That's going to be all kinds of no good. You can't import that. Dataframes must have names all in one row. If you have two-line headers, you have a problem you have to fix before you can do anything else with it.
First we'll grab the page.
```{r}
url <- "https://www.sports-reference.com/cbb/seasons/2020-school-stats.html"
```
Now, similar to our example above, we'll read the html, use XPath to find the table, and then read that table with a directive passed to it setting the header to FALSE. That tells rvest that there isn't a header row. Just import it as data.
```{r}
stats <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="basic_school_stats"]') %>%
html_table(header=FALSE)
```
What we get back is a list of one element (similar to above). So let's pop it out into a data frame.
```{r}
stats <- stats[[1]]
```
And we'll take a look at what we have.
```{r}
head(stats)
```
So, that's not ideal. We have headers and data mixed together, and our columns are named X1 to X38. Also note: They're all character fields. Because the headers are interspersed with data, it all gets called character data. So we've got to first rename each field.
```{r}
stats <- stats %>% rename(Rank=X1, School=X2, Games=X3, OverallWins=X4, OverallLosses=X5, WinPct=X6, OverallSRS=X7, OverallSOS=X8, Blank1=X9, ConferenceWins=X10, ConferenceLosses=X11, Blank2=X12, HomeWins=X13, HomeLosses=X14, Blank3=X15, AwayWins=X16, AwayLosses=X17, Blank4=X18, ForPoints=X19, OppPoints=X20, Blank5=X21, Minutes=X22, FieldGoalsMade=X23, FieldGoalsAttempted=X24, FieldGoalPCT=X25, ThreePointMade=X26, ThreePointAttempts=X27, ThreePointPct=X28, FreeThrowsMade=X29, FreeThrowsAttempted=X30, FreeThrowPCT=X31, OffensiveRebounds=X32, TotalRebounds=X33, Assists=X34, Steals=X35, Blocks=X36, Turnovers=X37, PersonalFouls=X38)
```
Now we have to get rid of those headers interspersed in the data. We can do that with filter that say keep all the stuff that isn't this.
```{r}
stats <- stats %>% filter(Rank != "Rk" & Games != "Overall")
```
And finally, we need to change the file type of all the fields that need it. We're going to use a clever little trick, which goes like this: We're going to use `mutate_at`, which means mutate these fields. The pattern for `mutate_at` is `mutate_at` these variables and do this thing to them. But instead of specifying which of 38 variables we're going to mutate, we're going to specify the one we don't want to change, which is the name of the school. And we just want to convert them to numeric, which is simple. Here's what it looks like:
```{r}
stats <- stats %>% mutate_at(vars(-School), as.numeric)
```
One last thing: Who needs columns called Blank1, Blank2, Blank3, etc?
```{r}
stats <- stats %>% select(-starts_with("Blank"))
```
And just like that, we have a method for getting up to the minute season stats for every team in Division I.
```{r}
head(stats)
```