-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-parsing.Rmd
270 lines (175 loc) · 10.2 KB
/
03-parsing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Our World in Data (parsing)
## Practice parsing
Let's practice with a small file. Make sure you are in a **screen**.
1. Make a directory under your home directory called "parse".
2. Go into that directory.
3. Copy the file "pandemics.csv" from this directory: "/home/data/nise" to the directory you just made.
4. Take a look at the file.
5. Sort by year an put it into a new file called "sort.pandemic.csv"
Note that you will need to change the delimiter to a comma (-t), tell it to sort numerically (-n), and tell it which column to sort (-k) sort -t, -n -k3 pandemics.csv > sort.pandemics.csv
6. Pull out all the instances that mention plague
Note that grep is case sensitive by default so you might want to use the -i flag to make it case insensitive in case the is inconsistent
7. Pull out all the instance of the flu
8. Let's count the number of each type of pathogen (column 2). You could do this with a series of grep commands but with a big file you might not know all the possible pathogens and it would get tedious. So let's do this in a step-wise fashion and make sure each piece is as expected, then pipe it into the next step.\
a. First pull out column 2 using awk or cut\
b. Get the unique values (make sure you "sort" before you "uniq" or you will only deredundify adjacent identical values).\
c. Get the count (hint: add -c to the uniq step)
9. Grab and count the organisms for the instances that mention plague
Hint: It is the same as #8 except you need to grab the plague lines first. When you put a command after the pipe it will read in the output of the previous command so only use the file name on the first command.
<details>
<summary>Click for All Answers</summary>
Note that there are often several ways to do things in linux and not all methods are shown.
```
1. mkdir ~/parse
2. cd ~/parse
3. cp /home/data/nise/pandemics.csv . [Don't forget the space and period at the end; the period says to name it the same thing.]
4. cat pandemics.csv [you could also use more, less, head, tail, or other commands]
5. sort -t, -nk 3 pandemics.csv > sort.pandemics.csv
6. grep -i plague sort.pandemics.csv
7. grep -i flu sort.pandemics.csv
8. awk -F, '{print $2}' sort.pandemics.csv | sort | uniq -c
[Note that you can also use: "cut -d, -f2 pandemics.csv" for the first step]
9. grep -i plague sort.pandemics.csv | awk -F, '{print $2}' | sort | uniq -c
```
</details>
## Our World in Data
Now let's use the data from "Our World In Data". For things that are new, we have added some in-line answers but before you reveal the answer, try it yourself first using the hints provided.
10. Make sure you are still in the parse directory.
11. Link to the file "owid-covid-data.csv". It is in this directory: "/home/data/nise/".
<details>
<summary>Click for Answer</summary>
```
ln -s /home/data/nise/owid-covid-data.csv .
```
</details>
\
12. How many rows are in the file? Use wc -l (wc = word count, -l = lines)
<details>
<summary>Click for Answer</summary>
```
wc -l owid-covid-data.csv
```
</details>
\
13. How many columns? There is a special variable, NF, in awk, which prints the number of fields
Note: this will print the number of fields in each row. You can hit ctrl-c if you don't want to go through the whole file
<details>
<summary>Click for Answer</summary>
```
awk -F, '{print NF}' owid-covid-data.csv
```
</details>
\
14. Let's look at hospital_beds_per_thousand (column 60). We also need to get the location (country) from column 3 and the date from column 4. We'll pipe it into less so that we can scroll through it (to get out of less, hit "q").
<details>
<summary>Click for Answer</summary>
```
cut -d, -f 3,4,60 owid-covid-data.csv | less
```
</details>
\
15. There are lots of dates for each country. Let's do the same search but limit it to early in the pandemic (2020-03-11), which is the day that the World Health Organization declared COVID-19 to be pandemic. Put it into a file called "beds.2020-03-11.csv" so we don't have to keep generating it.
16. What country had the least beds per thousand on 2020-01-03 and which has the most? Hint: Sort by the number of beds and pipe it into less.
Note: Some lines have missing data and will sort at the top of the file so you will have to scroll down.
17. How many beds were there in the United States on 2020-01-03?
Hint: There is a "United States" and a "United States Virgin Islands". To avoid getting the latter, have the match end with a comma. Also, require grep to start the match at the beginning of the line (^). It is good practice to be specific when using grep.
<details>
<summary>Click for Answer</summary>
```
grep "^United States," beds.2020-03-11.csv
```
</details>
\
18. How many beds were there in your three countries on 2020-01-03?
If your countries don't have data on that date, go back to the original file and see if you can find data for any date.
Hint: we'll use the extended version of grep which will allow us to search all 3 at once. Use the -E flag and put the things you are searching for in double quotes, seperated by pipes.
<details>
<summary>Click for Answer</summary>
```
grep -E "^Sweden,|^Norway,|^Denmark," beds.2020-03-11.csv
```
</details>
\
19. Create a file that has only the United States data for beds with the same three columns we have been using. Call it usbeds.csv. We'll use this file for the questions below.
20. Get the number of beds in the United States for each day in 2020. Note: Since these numbers don't change, this is kind of silly but it will give you the skills to grab other variables across a date range.
21. Get the number of beds in the United States for the first COVID wave (March through September 2020).
Hint: Use sed to remove the dashes then awk with $2>=20200301&&$2<=20200930.
<details>
<summary>Click for Answer</summary>
```
sed 's/-//g' usbeds.csv | awk '$2>=20200301&&$2<=20200930{print}'
```
</details>
\
22. BONUS: Get the average number of beds in the United States for the first COVID wave (March through September 2020). This is a tough one. See if you can understand the code.
<details>
<summary>Click for Answer</summary>
```
sed 's/-//g' usbeds.csv |awk -F, '$2>=20200301&&$2<=20200930{SUM+=$3;CNT+=1}END{print SUM/CNT}'
```
</details>
\
<details>
<summary>Click for All Answers</summary>
```
11. pwd [If you aren't in that directory you can: "cd ~/parse"]
10. ln -s /home/data/nise/owid-covid-data.csv .
12. wc -l owid-covid-data.csv
13. awk -F, '{print NF}' owid-covid-data.csv
14. cut -d, -f 3,4,60 owid-covid-data.csv | less
15. cut -d, -f 3,4,60 owid-covid-data.csv | grep 2020-03-11 > beds.2020-03-11.csv
16. sort -t, -n -k3 beds.2020-03-11.csv | less
Least: Mali,2020-03-11,0.1
Most: Monaco,2020-03-11,13.8
17. grep "^United States," beds.2020-03-11.csv
United States,2020-03-11,2.77
18. grep -E "^Sweden,|^Norway,|^Denmark," beds.2020-03-11.csv
Denmark,2020-03-11,2.5
Norway,2020-03-11,3.6
Sweden,2020-03-11,2.22
19. cut -d, -f 3,4,60 owid-covid-data.csv | grep "^United States," > usbeds.csv
20. grep 2020 usbeds.csv
21. sed 's/-//g' usbeds.csv | sed 's/-//g' |awk -F, '$2>=20200301&&$2<=20200930{print}'
22. sed 's/-//g' usbeds.csv | sed 's/-//g' |awk -F, '$2>=20200301&&$2<=20200930{SUM+=$3;CNT+=1}END{print SUM/CNT}'
```
</details>
## Practice with your 3 countries
Now practice on your own. Choose another variable and see if there are any interesting differences between your 3 countries. Explore at least 3 other variables.
Hint: You can look at the variables by heading the first line of the file.
```
head -1 owid-covid-data.csv
```
Hint: If you prefer to have them each on their own line you can use sed to replace the commas with hard returns. Make sure to use the global "g" at the end so that every comma is replaced and not just the first one on each line.
```
head -1 owid-covid-data.csv | sed 's/,/\n/g'
```
Hint: awk will also print out the row number for you (which corresponds to the column number for that variable in the original file). The NR variable is a special awk variable that will print the row.
```
head -1 owid-covid-data.csv | sed 's/,/\n/g' |awk '{print NR "\t" $1}'
```
## Country files
Make one file each for Denmark, Norway, and Sweden that we'll use in future chapters. We'll get the following columns:
location (column 3)
date (column 4)
total_cases (column 5)
total_deaths (column 8)
total_cases_per_million (column 11)
total_deaths_per_million (column 14)
icu_patients (column 18)
icu_patients_per_million (column 19)
people_fully_vaccinated (column 37)
people_fully_vaccinated_per_hundred (column 43)
An example is below.
```
grep "Norway" /home/data/nise/owid-covid-data.csv | cut -d, -f 3,4,5,8,11,14,18,19,37,43 > Norwaydata.csv
```
Make a similar file for each of your three countries.
## Jeopardy
https://jeopardylabs.com/play/linux-practicum
<!--I suggest that you don’t make this timed–we want them to think through this rather than hurrying through it. All answers have at least 2 ways to do things so you could have the team whose turn it is make a guess then the other team could try to propose a different way. If you want to give them access to the files, you could send them into quick breakout rooms to discuss and test. Or you could have them put something in the chat (might still want breakout rooms so they can discuss?) and you can test it. Either way, they need to know some things about the files (see below).
This uses the pandemics.csv in /home/jm/linux_practice (and also copied to their home directories in an earlier exercise), though there are screenshots in the jeopardy that are sufficient.
It also uses 2 new files that are in /home/jm/jeopardy/ (influenzaA_sequences.fa some_pandemic_movies.txt)
Genetic distance of omicron (variants)
For the some_pandemic_movies.txt they need to know that it is a tab--delimited file containing the movie name, year of release, and type of pathogen. At a minimum show them a head so that they realize that there are spaces in the movie title, which affects awk because it splits on whitespace.
For influenzaA_sequences.fa, they need to know that the header is pipe-delimited: Accession, Genbank title, Organism Name, Length, Country, Collection Date, Host. So, give them the file to work with or show them a head of the headers. Influenza A affects humans, birds, and other animals, while influenza B is generally just humans.
-->