-
Notifications
You must be signed in to change notification settings - Fork 62
/
01_ImportingData.Rmd
811 lines (610 loc) · 29.4 KB
/
01_ImportingData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
---
title: "01 - First baby steps: R, Importing Data, and Manipulating Data"
output:
html_notebook:
toc: true
toc_float: true
---
# What is R?
R is a ‘programming environment for statistics and graphics’
* Does basically everything, can also be extended
* It’s the default when statisticians implement new methods
* Free, open-source
But;
* Steeper learning curve than e.g. Excel, Stata
* Command-line driven (programming, not drop-down menus)
* Gives only what you ask for!
# Base R vs tidyverse
You know how when you get a new smartphone, it comes with an email and calendar app...but they're not the greatest? I usually download the Google Calendar and Gmail apps on my phone because, even though they technically do the same thing, they do it better. R is similar in this way.
When you downloaded R, it came with capabilities to import, analyze, and export data.
But since R's creation, users have created `packages` which act like plug-ins or addons or apps. These add or improve the functionality of R. We'll be using a suite of packages called the `tidyverse` that tries to make R more straightforward for beginners.
The tidyverse has two main goals:
* Work with tidy (not messy) data
* Make code more human readable
Each package within the tidyverse is meant to do a particular thing, but each ultimately goes back to those two goals. We'll be using three packages in the tidyverse, called `readr` (for reading in data into R), `dplyr` (for manipulating tidy data in R), and `ggplot2` (for visualizing tidy data in R).
# What is RStudio?
Often, learning a programming language is made worse by an unintuitive and unhelpful user interface. For our workshop, we will be using RStudio, a graphical user interface (front-end) for R that is slightly more user-friendly than ‘Classic’ R’s GUI.
# The console window
There are some useful features available in the console:
* Use the UP arrow to see past commands you have typed
* Help window appears as you type to help you complete your thought
* Tab autocomplete
+ When just typing, use tab autocomplete to fill in object names for you
+ When inside quotes `' '`, you can press tab to help you spell folder names correctly
# Trying out the Console
We’ll use the ‘Console’ window first – as a (fancy!) calculator
```{r}
2+2
# [1] 4
2^5+7
# [1] 39
2^(5+7)
# [1] 4096
exp(pi)-pi
# [1] 19.9991
log(20+pi)
# [1] 3.141632
0.05/1E6 # a comment; note 1E6 = 1,000,000
# [1] 5e-08
```
* All common math functions are available; parentheses (round brackets) work as per high school math
* Try to get used to bracket matching. A ‘+’ prompt means the line isn’t finished – hit Escape to get out, then try again.
We can also compare things in R using **operators**.
We can see if things are equal by using two equal signs `==`. To see if something is NOT equal, we use the exclamation mark `!`.
```{r}
3 == 2
# [1] FALSE
3-1 == 2
# [1] TRUE
TRUE == TRUE
# [1] TRUE
TRUE == FALSE
#[1] FALSE
'a' == 'a'
# [1] TRUE
'abc' != 'ABC'
# [1] TRUE
!is.na(NA)
# [1] FALSE
```
**Note:** We can represent missing data with `NA`, and use a function/command called `is.na()` to ask if data is missing. More on this later.
We can use greater than `>` or less than `<` signs as you would expect.
```{r}
300 > 200
# [1] TRUE
0 > 999
# [1] FALSE
```
# MULTIPLE CHOICE
Which of the following will NOT return TRUE?
A. FALSE == FALSE
B. 10-5 == sqrt(25)
C. TRUE > FALSE
D. 'a' > 'b'
[Answer](exercises/01_ImportingData_Answers.Rmd)
# Storing Data
We can quickly make comparisons, but we usually want to do things more sophisticated than that. For example, instead of typing `"This is an important string that we want to do analysis on."` into the console over and over again, we might want to give it a shorter name and then reference it later.
```{r}
x <- "This is an important string that we want to do analysis on."
```
This shows up in the Environment tab in R Studio. This is very useful, because now when we want to print out this string, we can just type `x` into the Console.
```{r}
x
# [1] "This is an important string that we want to do analysis on."
```
The console is most useful for quick calculations or code chunks. But what happens when you want to remember what you coded yesterday, a year ago, or a decade ago? R won't necessarily save everything you've done forever in the Environment tab (and we wouldn't want it to!).
## Using the script window
While fine for occasional use, entering every command ‘by hand’ is error-prone, and quickly gets tedious. A much better approach is to use a Script window – open one with Ctrl-Shift-N, or the drop-down menus;
* Opens a nice editor, enables saving code (.R extension)
* Run current line (or selected lines) with Ctrl-Enter, or Ctrl-R
**An important notice:** From now on, we assume you are using a script editor.
* First-time users tend to be reluctant to switch! – but it’s worth it, ask any experienced user
* Scripts make it easy to run slightly modified code, without re-typing everything – remember to save them as you work
* Also remember the Escape key, if e.g. your bracket-matching goes wrong
For a very few jobs, e.g. changing directories, we’ll still use drop-down menus. But commands are available, for all tasks.
## Where are we?
We can save our scripts wherever we want, but it makes it easier if we set a working directory in R. This makes it easier to find files, and also can make research more reproducible because it gives you the ability to share data structure with a collaborator.
Before we can set the working directory, we need to know where we are on our computer right now. Just like the command line's `pwd` command, R has a command called `getwd()`. Notice that it returns the absolute path to your home directory.
```{r}
getwd()
# [1] "C:/Users/gaugustus/Documents/Rdocs/"
```
You can point to files from anywhere on the computer RELATIVE to your current position. If you need to change this working directory, such as to go into the new `r-intro-20170825-master` folder you got from GitHub, you can do so with `setwd()`. Let's try this. Make sure you put the path in quotes.
You can use tab complete in R Studio, so once you open the quotes, press tab to see all the files and directories listed for you. If you type a letter, that list will shorten. **Note:** You can also use the Files tab in R Studio. Your home directory can be found by clicking the `Home` button.
```{r}
setwd("r-intro-20170825-master")
getwd()
# [1] "C:/Users/gaugustus/Documents/Rdocs/r-intro-20170825-master"
```
# Storing Data (cont'd)
R stores data (and everything else) as objects. New objects are created when we assign them values;
```{r}
x <- 3
y <- 2 # now check the Environment window
x+y
# [1] 5
```
# MULTIPLE CHOICE
What is the output when we execute the following code?
x <- 3
y <- 2
y <- 17.4
x+y
A. [1] 3 2 17.4
B. [1] 22.4
C. [1] 20.4
D. [1] 5
[Answer](exercises/01_ImportingData_Answers.Rmd)
Assigning new values to existing objects over-writes the old version – and be aware there is no Ctrl-Z ‘undo’;
```{r}
y <- 17.4 # check the Environment window again
x+y
# [1] 20.4
```
* Anything after a hash (#) is ignored – e.g. comments
* Spaces don’t matter outside of quotes (except for the `<-` symbol)
* Capital letters do matter
# How to name my data
What’s a good name for my new object?
* Something memorable (!) and not easily-confused with other objects, e.g. X isn’t a good choice if you already have x
* Names must start with a letter or period (”.”), after that any letter, number or period is okay
* Avoid other characters; they get interpreted as math (”-”,”*”) or are hard to read (” ”) so should not be used in names
* Avoid names of existing functions – e.g. summary. Some oneletter choices (c, C, F, t, T and S) are already used by R as names of functions, it’s best to avoid these too
# Reading in Data
## Base R
First, let's see how we can read in data using base R, using the `read.csv()` command:
```{r}
gapminder.base <- read.csv(file = "datasets/gapminder.txt", header=TRUE, sep = "\t", stringsAsFactors = FALSE)
```
## readr package
The package within `tidyverse` for reading in data is called `readr`. It solves some of the issues with `read.csv` that are beyond the scope of this course. Let's give it a try! Because this package has been developed by R Studio staff, it's integrated into your R Studio installation.
To import a dataset, follow pop-ups from the Environment tab;
Import Dataset > From CSV...
Once you've decided on your options, you'll see the code at the bottom right tells you how you can code that yourself:
```{r}
library(readr)
gapminder <- read_delim(file = "datasets/gapminder.txt",
delim = "\t", escape_double = FALSE, trim_ws = TRUE)
```
By default, you'll see that a `library` command is set. The `library` command allows us to add on to the basic features of R (also called base R). In other words, we can add functionality to make our lives easier. In this case, we are getting the `readr` package. This is part of a suite of packages called the `tidyverse`.
More on those options;
* Name: Name of the data frame object that will store the whole dataset
* file: where is the file located? Absolute or relative path
* First row as names: Does your first line have the names of the columns?
* Delimiter: What separates column values? Tabs, commas, white space
* Skip: Do you need to skip any lines at the top?
* Trim whitespace: If there are extra spaces, get rid of them
The defaults are sensible, but R assumes you know what your data should look like – and whether it has named columns, row names etc. No software is smart enough to cope with every format that might be used by you/your colleagues to store data.
**Note:** There is also a way to input Excel files, using a package called `readxl`, also from the tidyverse.
After successfully reading in the data;
* The environment now includes a `gapminder` object – or whatever you called the data read from file
* A copy of the data can be examined in the Excel-like data viewer – if it looks weird, find out why & fix it!
## Are these really the same?
We check to see if every value is equal between the dataset we read in using `readr` and the one we read in using base `R`, and see that they are the same.
```{r}
gapminder == gapminder.base
# country continent year lifeExp pop gdpPercap
# [1,] TRUE TRUE TRUE TRUE TRUE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
# [4,] TRUE TRUE TRUE TRUE TRUE TRUE
# [5,] TRUE TRUE TRUE TRUE TRUE TRUE
# [6,] TRUE TRUE TRUE TRUE TRUE TRUE
# [7,] TRUE TRUE TRUE TRUE TRUE TRUE
# [8,] TRUE TRUE TRUE TRUE TRUE TRUE
# [ reached getOption("max.print") -- omitted #### rows ]
```
# What can I do with my data?
To operate on data, type commands in the Console window, just like our earlier calculator-style approach;
```{r}
summary(gapminder)
# country continent year lifeExp pop
# Length:1704 Length:1704 Min. :1952 Min. :23.60 Min. :6.001e+04
# Class :character Class :character 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
# Mode :character Mode :character Median :1980 Median :60.71 Median :7.024e+06
# Mean :1980 Mean :59.47 Mean :2.960e+07
# 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
# Max. :2007 Max. :82.60 Max. :1.319e+09
str(gapminder)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1704 obs. of 6 variables:
# $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
```
* summary() summarizes the object and provide basic summary statistics for each column within your data
* str() tells us the structure of an object (i.e., it's dimensions/size and the class of the each data column)
We can also use these commands on any object – e.g. the single numbers we created earlier (try it!)
There are also commands to get these statistics alone. For this we use the `$` symbol to tell R which column we are interested in.
```{r}
min(gapminder$lifeExp)
# [1] 23.599
median(gapminder$lifeExp)
# [1] 60.7125
max(gapminder$lifeExp)
# [1] 82.603
```
These are called FUNCTIONS, and are used to do a particular task on a set of data. Here we are accessing columns by using the dollar sign. We are telling R that we are only interested in one column.
We can also do more sophisticated things with these commands. Let's try a simple plot:
```{r}
plot(gapminder$lifeExp, gapminder$gdpPercap)
```
# Data Frames
The `gapminder` data we just imported is in an object called a Data Frame. A data frame holds data in a table format, like what you might be used to in Excel. A "tidy" data frame has columns that each represent a variable and rows which hold one observation.
As we saw before, individual columns in data frames are identified using the `$` symbol – just seen in the str() output.
Think of $ as ‘apostrophe-S’, i.e. gapminder`’S` lifeExp
New columns are created when you assign their values – here containing the life expectancy in months instead of years;
```{r}
gapminder$lifeExpMonths <- gapminder$lifeExp*12
str(gapminder)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1704 obs. of 7 variables:
# $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num 28.8 30.3 32 34 36.1 ...
# $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
# $ gdpPercap : num 779 821 853 836 740 ...
# $ lifeExpMonths: num 346 364 384 408 433 ...
summary(gapminder$lifeExpMonths)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 283.2 578.4 728.5 713.7 850.1 991.2
```
* Assigning values to existing columns over-writes existing values – again, with no warning
* With e.g. gapminder$newcolumn <- 0, the new column has every entry zero; R recycles this single value, for every entry
* It’s unusual to delete columns... but if you must; use `gapminder$lifeExpMonths <- NULL`
Other functions useful for summarizing data frames, and their columns;
```{r}
names(gapminder)
# [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
# [7] "lifeExpMonths"
dim(gapminder) # dim is short for dimension
# [1] 1704 7
length(gapminder$lifeExp) # how many rows in our dataset?
# [1] 1704
min(gapminder$lifeExp)
# [1] 23.599
max(gapminder$lifeExp)
# [1] 82.603
range(gapminder$lifeExp)
# [1] 23.599 82.603
mean(gapminder$lifeExp)
# [1] 59.47444
sd(gapminder$lifeExp) # sd is short for standard deviation
# [1] 12.91711
median(gapminder$lifeExp)
# [1] 60.7125
median(gapminder$li) # uses pattern-matching (but hard to debug later)
# [1] 60.7125
```
# EXERCISE
Import the gapminder data frame again.
Use `str()` to look at the structure of the dataframe and `summary()` to get information about the variables.
* What are its columns?
* How many rows and columns are there?
* What is the earliest year in the `year` column?
* What is the average life expectancy?
* What is the largest population?
[Answers](exercises/01_ImportingData_Answers.Rmd)
```{r}
gapminder <- read_delim("datasets/02_gapminder.txt",
"\t", escape_double = FALSE, trim_ws = TRUE)
str(gapminder)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1704 obs. of 6 variables:
# $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num 28.8 30.3 32 34 36.1 ...
# $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
# $ gdpPercap: num 779 821 853 836 740 ...
dim(gapminder)
```
# Subsetting
## Base R
Suppose we were interested in the life expectancy (i.e. 4th column) for 1957 for Afganistan in the years 1952, 1962, and 1977 (i.e. rows 1, 3, and 5). How to select these multiple elements?
```{r}
gapminder[c(1, 3, 5), 4]
# A tibble: 3 × 1
# lifeExp
# <dbl>
# 1 28.801
# 2 31.997
# 3 36.088 # check these against data view
```
But what is `c(1,3,5)`? It’s a vector of numbers – `c()` is for combine;
```{r}
length(c(1, 3, 5))
# [1] 3
str(c(1, 3, 5))
# num [1:3] 1 3 5
```
We can select these rows and all the columns;
```{r}
gapminder[c(1, 3, 5),]
# A tibble: 3 × 6
# country continent year lifeExp pop gdpPercap
# <chr> <chr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1962 31.997 10267083 853.1007
# 3 Afghanistan Asia 1972 36.088 13079460 739.9811
```
A very useful special form of vector;
```{r}
1:10
# [1] 1 2 3 4 5 6 7 8 9 10
6:2
# [1] 6 5 4 3 2
-1:-3
# [1] -1 -2 -3
```
R expects you to know this shorthand – see e.g. its use of `1:3` in the output from `str()`, on the previous slide. For a ‘rectangular’ selection of rows and columns;
```{r}
gapminder[20:22, 3:4]
# A tibble: 3 x 2
# year lifeExp
# <int> <dbl>
# 1 1987 72.000
# 2 1992 71.581
# 3 1997 72.950
```
Negative values correspond to dropping those rows/columns;
```{r}
gapminder[-3:-1704,] # everything but the first two rows will be dropped
# A tibble: 2 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453 345.612
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530 363.984
```
As well as storing numbers and character strings (like "United States", "Canada") R can also store logicals – `TRUE` and `FALSE`.
To make a new vector, with elements that are `TRUE` if life expectancy is above 71.5 and FALSE otherwise;
```{r}
is.above.avg <- gapminder$lifeExp > 71.5
```
Let's see how many of the total were TRUE and how many were FALSE using the table() function.
The table() function will create a count table from a vector of categorical data.
```{r}
table(is.above.avg)
# is.above.avg
# FALSE TRUE
# 1329 375
```
Which countries and during what years were these? (And what was the avg. life expectancy?)
```{r}
gapminder[is.above.avg,] # just the rows for which is.above.avg is TRUE
# A tibble: 375 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Albania Europe 1987 72.000 3075321 3738.933 864.000
# 2 Albania Europe 1992 71.581 3326498 2497.438 858.972
# 3 Albania Europe 1997 72.950 3428038 3193.055 875.400
# 4 Albania Europe 2002 75.651 3508512 4604.212 907.812
# 5 Albania Europe 2007 76.423 3600523 5937.030 917.076
# 6 Algeria Africa 2007 72.301 33333216 6223.367 867.612
# 7 Argentina Americas 1992 71.868 33958947 9308.419 862.416
# 8 Argentina Americas 1997 73.275 36203463 10967.282 879.300
# 9 Argentina Americas 2002 74.340 38331121 8797.641 892.080
# 10 Argentina Americas 2007 75.320 40301927 12779.380 903.840
> gapminder[is.above.avg,4] # combining TRUE/FALSE (rows) and numbers (columns)
# A tibble: 375 x 1
# lifeExp
# <dbl>
# 1 72.000
# 2 71.581
# 3 72.950
# 4 75.651
# 5 76.423
# 6 72.301
# 7 71.868
# 8 73.275
# 9 74.340
# 10 75.320
```
One final method... for now!
Instead of specifying rows/columns of interest by number, or through vectors of `TRUE`s/`FALSE`s, we can also just give the names – as character strings, or vectors of character strings.
```{r}
gapminder[,'lifeExp']
# A tibble: 1,704 x 1
# lifeExp
# <dbl>
# 1 28.801
# 2 30.332
# 3 31.997
# 4 34.020
# 5 36.088
# 6 38.438
# 7 39.854
# 8 40.822
# 9 41.674
# 10 41.763
# # ... with 1,694 more rows
gapminder[gapminder$country == 'Gabon',c("lifeExp","gdpPercap")]
# A tibble: 12 x 2
# lifeExp gdpPercap
# <dbl> <dbl>
# 1 37.003 4293.476
# 2 38.999 4976.198
# 3 40.489 6631.459
# 4 44.598 8358.762
# 5 48.690 11401.948
# 6 52.790 21745.573
# 7 56.564 15113.362
# 8 60.190 11864.408
# 9 61.366 13522.158
# 10 60.461 14722.842
# 11 56.761 12521.714
# 12 56.735 13206.485
gapminder[gapminder$country == 'Gabon',4] # okay to mix & match
# A tibble: 12 x 1
# lifeExp
# <dbl>
# 1 37.003
# 2 38.999
# 3 40.489
# 4 44.598
# 5 48.690
# 6 52.790
# 7 56.564
# 8 60.190
# 9 61.366
# 10 60.461
# 11 56.761
# 12 56.735
```
This is more typing than the other options, but is much easier to debug/reuse.
## Dplyr
Remember how we mentioned earlier that data should be "tidy", that is each variable should be represented in one column and each row represents one observation. The `tidyverse` has a package to help us work with data in a tidy way. We are now going to discuss a package that helps you to manipulate your data, `dplyr`.
If you haven't already, install dplyr
```{r}
#install.packages("dplyr")
```
Don't forget to load the package so we can use its functionality
```{r}
library(dplyr)
```
dplyr works by piping commands, like you learned to do in the command line. Instead of the pipe `|`, we use `%>%`.
```{r}
gapminder %>% select(lifeExp) %>% min()
# [1] 23.599
min(gapminder$lifeExp)
# [1] 23.599
```
An important difference between `dplyr` and base R is when use character strings we don't need to enclose them in quotation marks as we did above (i.e. gapminder[,'lifeExp'])
dplyr also comes with ways to subset our data.
If we only want to choose one column, we use `select`:
```{r}
# select(data = gapminder, lifeExp)
```
The above is valud code, but because we can pipe commands, it's best practice to just pipe all functions.
```{r}
gapminder %>% select(lifeExp)
# A tibble: 1,704 x 1
# lifeExp
# <dbl>
# 1 28.801
# 2 30.332
# 3 31.997
# 4 34.020
# 5 36.088
# 6 38.438
# 7 39.854
# 8 40.822
# 9 41.674
# 10 41.763
# # ... with 1,694 more rows
```
If we want to make a new column, use `mutate`. Don't forget we have to assign it if we want to keep the changes
```{r}
gapminder <- gapminder %>% mutate(NewColumn = lifeExp * 12)
gapminder
# A tibble: 1,704 x 8
# country continent year lifeExp pop gdpPercap lifeExpMonths NewColumn
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453 345.612 345.612
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530 363.984 363.984
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007 383.964 383.964
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971 408.240 408.240
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811 433.056 433.056
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134 461.256 461.256
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114 478.248 478.248
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959 489.864 489.864
# 9 Afghanistan Asia 1992 41.674 16317921 649.3414 500.088 500.088
# 10 Afghanistan Asia 1997 41.763 22227415 635.3414 501.156 501.156
# ... with 1,694 more rows
```
If we want to select all columns except 1, we can do that with the `-` operator. Remember that if we want to save anything we are doing, we must store it in a variable.
```{r}
gapminder <- gapminder %>% select(-NewColumn)
gapminder
# A tibble: 1,704 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453 345.612
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530 363.984
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007 383.964
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971 408.240
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811 433.056
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134 461.256
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114 478.248
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959 489.864
# 9 Afghanistan Asia 1992 41.674 16317921 649.3414 500.088
# 10 Afghanistan Asia 1997 41.763 22227415 635.3414 501.156
# ... with 1,694 more rows
```
Now what about subsetting rows? For this we use te `filter` command:
```{r}
gapminder %>% filter(lifeExp > 71.5)
# A tibble: 375 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Albania Europe 1987 72.000 3075321 3738.933 864.000
# 2 Albania Europe 1992 71.581 3326498 2497.438 858.972
# 3 Albania Europe 1997 72.950 3428038 3193.055 875.400
# 4 Albania Europe 2002 75.651 3508512 4604.212 907.812
# 5 Albania Europe 2007 76.423 3600523 5937.030 917.076
# 6 Algeria Africa 2007 72.301 33333216 6223.367 867.612
# 7 Argentina Americas 1992 71.868 33958947 9308.419 862.416
# 8 Argentina Americas 1997 73.275 36203463 10967.282 879.300
# 9 Argentina Americas 2002 74.340 38331121 8797.641 892.080
# 10 Argentina Americas 2007 75.320 40301927 12779.380 903.840
# ... with 365 more rows
```
We can pipe several commands, just like with the command line:
```{r}
gapminder %>% select(lifeExp, country) %>% filter(lifeExp > 71.5) %>% mutate(lifeExpdays = lifeExp * 365)
# A tibble: 375 x 3
# lifeExp country lifeExpdays
# <dbl> <chr> <dbl>
# 1 72.000 Albania 26280.00
# 2 71.581 Albania 26127.07
# 3 72.950 Albania 26626.75
# 4 75.651 Albania 27612.61
# 5 76.423 Albania 27894.40
# 6 72.301 Algeria 26389.87
# 7 71.868 Argentina 26231.82
# 8 73.275 Argentina 26745.38
# 9 74.340 Argentina 27134.10
# 10 75.320 Argentina 27491.80
# ... with 365 more rows
```
We can also use outside information to help subset data.
```{r}
two.countries <- c('Kenya', 'Gibon')
gapminder %>% filter(country %in% two.countries)
# A tibble: 12 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Kenya Africa 1952 42.270 6464046 853.5409 507.240
# 2 Kenya Africa 1957 44.686 7454779 944.4383 536.232
# 3 Kenya Africa 1962 47.949 8678557 896.9664 575.388
# 4 Kenya Africa 1967 50.654 10191512 1056.7365 607.848
# 5 Kenya Africa 1972 53.559 12044785 1222.3600 642.708
# 6 Kenya Africa 1977 56.155 14500404 1267.6132 673.860
# 7 Kenya Africa 1982 58.766 17661452 1348.2258 705.192
# 8 Kenya Africa 1987 59.339 21198082 1361.9369 712.068
# 9 Kenya Africa 1992 59.285 25020539 1341.9217 711.420
# 10 Kenya Africa 1997 54.407 28263827 1360.4850 652.884
# 11 Kenya Africa 2002 50.992 31386842 1287.5147 611.904
# 12 Kenya Africa 2007 54.110 35610177 1463.2493 649.320
```
%in% will enable you to search all lines in the column country for all character strings in the two.countries file and will return a TRUE if it finds an one of them.
# EXERCISE
Reimport the `gapminder` dataframe:
Create a new dataframe that contains only country names, years, and life expectancies of the `gapminder` dataset. Use this new dataframe to calculate minimum & maximum expectancies.
[Answers](exercises/01_ImportingData_Answers.Rmd)
# Quitting R
When you’re finished with RStudio;
* Ctrl-Q, or the drop-down menus, or entering q() at the command line all start the exit process
* You will be asked “Save workspace image to ∼/.RData?”
+ No/Don’t Save: nothing is saved, and is not available when you re-start. This is recommended, because you will do different things in each session
+ Yes: Everything in memory is stored in R’s internal format (.Rdata) and will be available when you re-start RStudio
+ Cancel: don’t quit, go back
* Writing about what you did (output from a script) often takes much longer than re-running that script’s analyses – so often, a ‘commented’ script is all the R you need to store
**Note:** To get rid of objects in your current session, use `rm()`, e.g. `rm(is.above.avg, new_gapminder, x, y)` ... or RStudio’s ‘broom’ button on the Environment tab.
# Summary
* In RStudio, read in data from the pop-up menu in the Environment window (or Tools menu)
* Data frames store data; can have many of these objects – and multiple other objects, too
* Identify vectors with $, subsets with square brackets
* Many useful summary functions are available, with sensible names
* Scripts are an important drudgery-avoidance tool!
References:
1. Lectures from Ken Rice at University of Washington, Summer Institute for Statistical Genetics - http://faculty.washington.edu/kenrice/rintro/indexSEA15.shtml
2. Scripts & Exercise from Asher Haug-Baltzell - https://github.com/asherkhb/intro_r