-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-rprog2.Rmd
executable file
·1313 lines (1054 loc) · 64.9 KB
/
03-rprog2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Getting and Cleaning Data {#rprog2}
``` {r ch3-pkgs, echo=FALSE, message=FALSE}
library(tidyverse)
library(bookdown)
library(knitr)
```
## Ch. 3 Objectives
This chapter is designed around the following learning objectives. Upon
completing this chapter, you should be able to:
- Recognize what a flat file is and how it differs from data stored in a binary
file format
- Distinguish between delimited and fixed width formats for flat files
- Identify the delimiter in a delimited file
- Describe a working directory
- Demonstrate how to read in different types of flat files
- Demonstrate how to read in a few types of binary files (e.g., Matlab, Excel)
- Recognize the difference between relative and absolute file pathnames
- Describe the basics of your computer's directory structure
- Reference files in your directory structure using relative and absolute pathnames
- Apply the basic `dplyr` functions (e.g., `rename()`, `select()`, `mutate()`,
`slice()`, `filter()`, and `arrange()`) to work with data in a dataframe object
- Define a logical operator and know the R syntax for common logical operators
- Apply logical operators in conjunction with `dplyr`'s `filter()` function to
create subsets of a dataframe based on logical conditions
- Apply a sequence of `dplyr` functions to a dataframe using piping (`%>%`)
- Create R Markdown documents and describe their basic content and function
## Overview
There are four basic steps you will often repeat as you prepare to analyze data
in R:
1. Identify the location of the data. If it's on your computer, which
directory? If it's online, what link?
2. Read data into R (e.g., using a function like `read_delim()` or `read_csv()`
from the `readr` package) using the file path you figured out in step 1
3. Check to make sure the data came in correctly using functions like `dim()`, `head()`, `tail()`, `str()`, and/or `glimpse()`.
4. Clean the data up by removing missing (or nonsense) values, renaming or reclassifying variables, performing units conversions, or other actions that
support a streamlined data analysis.
In this chapter, I'll go over the basics for each of these steps and dive a bit
deeper into some related topics you should learn now to make your life easier
as you get started using R for data analysis.
## Reading data into R
Data comes in files of all shapes and sizes. R has the capability to import
data from many files types and locations, even proprietary files for other
software. Here are some of the types of data files that R can read and work with:
- Flat files (more about these soon)
- Files from other software packages such as MATLAB or Excel
- Tables on webpages (e.g., the table on Ebola outbreaks near the end of [this
Wikipedia
page](http://en.wikipedia.org/wiki/Ebola_virus_epidemic_in_West_Africa))
- Data in a database (e.g., MySQL, Oracle)
- Data in JSON and XML formats
- Really crazy data formats used in other disciplines (e.g., TDMS files from LabView, netCDF files from
climate research, MRI data stored in Analyze, NIfTI, and DICOM formats)
- Geographic shapefiles
- Data through Application Programming Interfaces (APIs; most websites use APIs
to ask you for input and then use that input to direct new information back to
you)
Often, it is possible to import and wrangle extremely messy data by using
functions like `scan()` and `readLines()` to read the data in a line at a time,
and then using regular expressions to clean up the data as it gets imported. For
this course, however, we will begin with less challenging file formats (and
degrees of messiness).
### Reading local flat files
Much of the data that you will want to read in will be in **flat files** that
are stored locally (i.e., on your computer's hard drive). A *flat file* is
basically a file that you can open using a text editor. The most
common type you'll work with are probably comma-separated files, often with a
`.csv` or `.txt` file extension. Most flat files come in two general
categories:
1. Fixed width files, and
2. Delimited files, which include:
- ".csv": Comma-separated values
- ".tab", ".tsv": Tab-separated values
- Other possible delimiters: colon, semicolon, pipe ("|")
*Fixed-width files* are files where a column always has the same width, for all
the rows in the column. These tend to look very neat and easy-to-read when you
open them in a text editor. For example, the first few rows of a fixed-width
file might look like this:
```
Course Number Day Time
Thermodynamics 337 M/W/F 9:00-9:50
Aerosol Physics and Technology 577 M/W/F 10:00-10:50
```
Fixed-width files used to be very popular, and they make it easier to look at
data when you open the file in a text editor. Now, it's rare to just use a text
editor to open a file and check out the data. Also, these files can be a bit of
a pain to read into R and other programs because you sometimes have to specify
the length of each column. You may come across a fixed-width file every now and
then, particularly when working with older data, so it's useful to be able to
recognize one and to know how to import it.
*Delimited files* use some **delimiter** such as a comma or tab to separate
each column value within a row. The first few rows of a delimited file might
look like this:
```
Course, Number, Day, Time
"Thermodynamics", 337, "M/W/F", "9:00-9:50"
"Aerosol Physics and Technology", 577, "M/W/F", "10:00-10:50"
```
Delimited files are very easy to read into R. You just need to be able to
figure out what character is used as a delimiter and specify that to R in the
function call to read in the data.
These flat files can have a number of different file extensions. The most
generic is `.txt`, but they will also have ones more specific to their format,
like `.csv` for a comma-delimited file (.csv stands for
**"comma-separated values"**), or `.fwf` for a fixed-width file.
R can read in data from both fixed-width and delimited flat files. The only
catch is that you need to tell R a bit more about the format of the flat file,
including whether it is fixed-width or delimited. If the file is fixed-width,
you will usually have to provide R with information about each column (see `read_fwf()` for details). If the file is delimited, you'll need to tell R which delimiter, such as comma or tab, is being used.
The `read_delim()` family of functions are used to read delimited flat files into
R - these functions come from the `readr` package, which we will use extensively
in ths course. All members of the `read_delim()` family do the same basic thing:
import flat files into a `tibble`. The major difference is what defaults each
function has for the delimiter (`delim`). Members of the `read_delim()` family include:
Function | Delimiter
--------------- | ------------
`read_csv()` | comma
`read_csv2()` | semi-colon
`read_table2()` | whitespace
`read_tsv()` | tab
You can use `read_delim()` to read in any delimited file, regardless of the
delimiter; however, you will need to specify the type of delimiter using the
`delim` argument. If you remember the more specialized function call (e.g.,
`read_csv()` for a comma-delimited file), you can save yourself some typing.
For example, to read in the Ebola data data file, which is comma-delimited,
you could either use `read_table()` with a `delim` argument specified or use
`read_csv()`, in which case you don't have to specify `delim`:
```{r ebola-delim, message=FALSE}
library(package = "readr")
# The following two calls do the same thing
ebola <- readr::read_delim(file = "data/country_timeseries.csv", delim = ",")
```
```{r ebola-csv}
ebola <- readr::read_csv(file = "data/country_timeseries.csv")
```
```{block, type='rmdtip'}
The message that R prints after this call ("Parsed with column specification:
... ") lets you know what classes were used for each column. This function
tries to guess the appropriate class and typically gets it right. You can
suppress the message using the `cols_types = cols()` argument, or by adjusting
the code chunk options in an R Markdown. If `readr` doesn't correctly assign
some of the columns classes, you can use the `type_convert()` function for R to
guess again after you've tweaked the formats of the rogue columns.
```
This family of functions has a few other helpful options you can specify. For
example, if you want to skip the first few lines of a file before you start
reading in the data, you can use `skip()` to set the number of lines to skip.
If you only want to read in a few lines of the data, you can use the `n_max()`
option. For example, if you have a really large file, and you want to save time
by only reading in the first ten lines, as you figure out what other optional
arguments to use in `read_delim()` for that file, you could include the option
`n_max = 10`. Here is a table of some of the most useful options common to the
`read_delim()` family of functions:
Option | Description
------- | -----------
`skip()` | How many lines of the start of the file should you skip?
`col_names()` | Use the column names provided or define your own names?
`col_types()` | What are the column types (e.g., chr, num, int, logi etc.])?
`n_max()` | How many rows do you want to read in?
`na()` | How are missing values coded?
```{block, type='rmdnote'}
Remember that you can always find out more about a function by looking at its
help file. For example, check out `?read_delim` and `?read_fwf` (note the lack
of parentheses). You can also use the help files to determine the default
values of arguments for each function.
```
So far, we've only looked at functions from the `readr` package for reading in
data files. There is a similar family of functions available in base R, the
`read.table()` family of functions. The `readr` family of functions is very
similar to the base R `read.table()` functions, but have some more sensible
defaults. Compared to the `read.table()` function family, the `readr`
functions are:
- Faster; show progress bar of data import
- Work better with large datasets
- Have more sensible defaults (e.g., characters default to characters, not
factors)
I recommend that you always use the `readr` functions rather than their base R
alternatives, given these advantages; however, you are likely to come across
code with these base R functions, so it is helpful to be aware of them.
Functions in the `read.table` family include:
- `read.csv()`
- `read.delim()`
- `read.table()`
- `read.fwf()`
Note: these base R functions use periods (`read.`) whereas the `readr` functions
use underscores (`read_`).
```{block, type='rmdnote'}
The `readr` package is a member of the `tidyverse` suite of R packages. The
*tidyverse* describes an evolving collection of R packages with a common
philosophy and approach, and they are unquestionably changing the way people
code in R. Many of these R packages were developed in part or full by Hadley
Wickham and others at RStudio. Many of these packages are less than ten years
old but have been rapidly adapted by the R community. As a result, newer
examples of R code will often look very different from the code in older R
scripts, including examples in books that are more than a few years old. In
this course, I'll focus on `tidyverse` functions when possible, but I do put in
details about base R equivalent functions or processes at some points. This
will help you interpret older code. You can download all the `tidyverse`
packages at the same time with `install.packages("tidyverse")` and make all the
`tidyverse` functions available for use with`library("tidyverse")`.
```
### Reading in other file types
Later in the course, we'll talk about how to open a variety of other file types
in R. You might find it immediately useful to be able to read in files from
other statistical programs.
There are two "tidyverse" packages, `readxl` and `haven`, that help with this.
They allow you to read in files from the following formats:
```{r read-table, echo=FALSE}
read_funcs <- data.frame(file_type = c("Excel",
"SAS",
"SPSS",
"Stata"),
func = c("`read_excel()`",
"`read_sas()`",
"`read_spss()`",
"`read_stata()`"),
pkg = c("`readxl`",
"`haven`",
"`haven`",
"`haven`"))
knitr::kable(read_funcs, col.names = c("File type", "Function", "Package"))
```
## Directories and pathnames
### Directory structure
So far, we've only looked at reading in files that are located in your current
working directory. For example, if you're working in an R Project, by default
the project will open with that directory as the working directory, so you can
read files that are saved in that project's main directory using only the file
name as a reference.
You will often want to read in files that are located somewhere else on
your computer, or even files that are saved on another computer or posted
online. Doing this is very similar to reading in a file that is in your current
working directory; the only difference is that you need to give R some
directions so it can find the file.
The most common case will be reading in files in a subdirectory of your current
working directory. For example, you may have created a "data" subdirectory in
one of your R Projects directories to keep all the project's data files in the
same place while keeping the structure of the main directory fairly clean. In
this case, you'll need to direct R into that subdirectory when you want to read
one of those files.
To understand how to give R these directions, you need to have some
understanding of the directory structure of your computer. It seems a bit of a
pain and a bit complex to have to think about computer directory structure in
the "basics" part of this class, but this structure is not terribly complex
once you get the idea of it. There are a couple of very good reasons why it's
worth learning now.
First, many of the most frustrating errors you get when you start using R
trace back to understanding directories and filepaths. For example, when you try
to read a file into R using only the filename, and that file is not in your
current working directory, you will get an error like:
```
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open file 'Ex.csv': No such file or directory
```
This error is especially frustrating when you're new to R because it happens at
the very beginning of your analysis---you can't even import the data. Also, if
you don't understand a bit about working directories and how R looks for the
file you're asking it to find, you'd have no idea where to start to fix this
error. Second, once you understand how to use pathnames, especially relative
pathnames, to tell R how to find a file that is in a directory other than your
working directory, you will be able to organize all of your files for a project
in a much cleaner way. For example, you can create a directory for your
project, then create one subdirectory to store all of your R scripts, and
another to store all of your data, and so on. This can help you keep very
complex projects more structured and easier to navigate.
Your computer organizes files through a collection of directories. Chances are,
you are fairly used to working with these in your daily life already, although
you may call them "folders" rather than "directories". For example, you've
probably created new directories to store data files and Word documents for a
specific project.
Figure \@ref(fig:filedirstructure) gives an example file directory structure
for a hypothetical computer. Directories are shown in blue, and files in green.
```{r filedirstructure, echo=FALSE, out.width="600pt", fig.align="center", fig.cap="An example of file directory structure."}
knitr::include_graphics("figures/FileDirectoryStructure.png")
```
Notice a few interesting things from Figure \@ref(fig:filedirstructure). First,
you might notice the structure includes a few of the directories that you use a
lot on your own computer, like `Desktop`, `Documents`, and `Downloads`. Next,
the directory at the very top is the computer's root directory, `/`. For a PC,
the root directory might something like `C:`. For Unix and Macs, it's usually
`/`. Finally, if you look closely, you'll notice that it's possible to have
different files in different locations of the directory structure with the same
file name. For example, in the figure, there are files names `heat_mort.csv` in
both the `CourseText` directory and in the `example_data` directory. These are
two different files with different contents, but they can have the same name as
long as they're in different directories. This fact---that you can have files
with the same name in different places---should help you appreciate how useful
it is that R requires you to give very clear directions to describe exactly
which file you want R to read in, if you aren't importing something in your
current working directory.
You will have a home directory somewhere near the top of your structure,
although it's likely not your root directory. In the hypothetical computer in
Figure \@ref(fig:filedirstructure), the home directory is
`/Users/brookeanderson`. I'll describe just a bit later how you can figure out
what your own home directory is on your own computer.
### Working directory
When you run R, it's always running from within some working directory, which
will be one of the directories somewhere in your computer's directory
structure. At any time, you can figure out which directory R is working in by
running the command `getwd()` (short for "get working directory"). For example,
my R session is currently running in the following directory:
```{r getwd}
getwd()
```
This means that, for my current R session, R is working in the
`edar_coursebook` subdirectory of my `johnvolckens` directory (home directory).
There are a few general rules for which working directory R selects when
you open an R session. These are not absolute rules, but they're generally
true. If you have R closed, and you open it by double-clicking on an R script,
then R will generally open with, as its working directory, the directory in
which that script is stored. This is often a very convenient convention,
because often any of the data you'll import for that script is somewhere
near where the script file is saved in the directory structure. If you open R
by double-clicking on the R icon in "Applications" (or the start menu on a
PC), R will start in its default working directory. You can find out what this
is, or change it, in RStudio's "Preferences". Finally, if you open an R
Project, R will start in that project's working directory---where the `.Rproj`
file for the project is stored. This is one of the reasons why we always create
a new R Project when starting a data analysis - RStudio projects remember where
to look!
### File and directory pathnames
Once you get a picture of how your directories and files are organized, you can
use pathnames, either absolute or relative, to read in files from different
directories outside your current working directory. Pathnames are the
directions for getting to a directory or file stored on your computer.
When you want to reference a directory or file, you can use one of two types of
pathnames:
- *Relative pathname*: How to get to the file or directory from your current working directory
- *Absolute pathname*: How to get to the file or directory from anywhere on the computer
Absolute pathnames are a bit more straightforward conceptually because they
don't depend on your current working directory; however, they're also a lot
longer to write and very inconvenient if you'll be sharing some of your code
with other people who might try to run it on their own computers. I'll explain
this second point a bit more later in this section.
I **strongly advise against the use of absolute pathnames** because of the
aforementioned collaborative issue, but I will include some details here
nonetheless. *Absolute pathnames* give the full directions to a directory or
file, starting all the way at the root directory. For example, the
`daily_show_guests.csv` file in the `data` directory has the absolute pathname:
```
"/Users/johnvolckens/Teaching/DataSci/edar_coursebook/data/daily_show_guests.csv"
```
You can use this absolute pathname to read this file in using any of the
`readr` functions to read in data. This absolute pathname will *always* work,
regardless of your current working directory, because it gives directions from
the root. In other words, it will always be clear to R exactly what file you're
talking about. Here's the code to use to read that file in using the
`read.csv()` function with the file's absolute pathname:
```{r daily-abs, eval=FALSE}
daily_show <- readr::read_csv(file = "/Users/johnvolckens/Teaching/DataSci/edar_coursebook/data/daily_show_guests.csv", skip = 4)
```
The *relative pathname*, on the other hand, gives R the directions for how to
get to a directory or file from the current working directory. If the file or
directory you're looking for is pretty close to your current working directory
in your directory structure, then a relative pathname can be a much shorter way
to tell R how to get to the file than an absolute pathname. But, the relative
pathname depends on your current working directory---the relative pathname that
works perfectly when you're working in one directory will not work at all once
you move into a different working directory.
As an example of a relative pathname, say you're working directory is
`edar_coursebook` and you want to read in the `daily_show_guests.csv` file
in the `data` directory (one of the `edar_coursebook` subdirectories). To get from `edar_coursebook` to that file,
you'd need to look in the subdirectory `data`, where you could find
`daily_show_guests.csv`. Therefore, the relative pathname from your working directory would be:
```
"data/daily_show_guests.csv"
```
You can use this relative pathname to tell R where to find and read in the
file:
```{r daily-rel, eval=FALSE}
daily_show <- readr::read_csv("data/daily_show_guests.csv")
```
While this pathname is much shorter than the absolute pathname, it is important to remember that if you are working in a different working directory, this relative pathname would no longer work.
There are a few abbreviations that can be really useful for pathnames:
```{r paths, echo=FALSE}
dirpath_shortcuts <- data.frame(abbr = c("`~`", "`.`", "`..`", "`../..`"),
meaning = c("Home directory",
"Current working directory",
"One directory up from current working directory",
"Two directories up from current working directory"))
knitr::kable(dirpath_shortcuts, col.names = c("Shorthand", "Meaning"))
```
These can help you keep pathnames shorter and also help you move "up-and-over" to get to a file or directory that's not on the direct path below your current working directory.
For example, my home directory is `/Users/johnvolckens`. You can use the
`list.files()` function to list all the files in a directory. If I wanted to list all the files in my `Downloads` directory, which is a direct sub-directory of my home directory, I could use:
```
list.files("~/Downloads")
```
As a second example, say I was working in the working directory `CourseText`,
(see Figure \@ref(fig:filedirstructure) but I wanted to read in the `heat_mort.csv` file that's in the `example_data`
directory, rather than the one in the `CourseText` directory. I can use the
`..` abbreviation to tell R to look up one directory from the current working
directory, and then down within a subdirectory of that. The relative pathname
in this case is:
```
"../Week2_Aug31/example_data/heat_mort.csv"
```
The `../` tells R to look one directory up from the working directory (the
directory that is one level above the current directory is also known as the
**parent directory**), which in this case is to `RCourseFall2015`, and then
down within that directory to `Week2_Aug31`, then to `example_data`, and then
to look within that directory for the file `heat_mort.csv`.
The relative pathname to read this file while R is working in the `CourseTest`
directory would be:
```
heat_mort <- read_csv("../Week2_Aug31/example_data/heat_mort.csv")
```
Relative pathnames "break" as soon as you try them from a different working
directory---this fact might make it seem like you would never want to use
relative pathnames, and would always want to use absolute ones instead, even if
they're longer. If that were the only consideration (length of the pathname),
then perhaps that would be true. However, as you do more and more in R, there
will likely be many occasions when you want to use relative pathnames instead.
They are particularly useful if you ever want to share a whole directory, with
all subdirectories, with a collaborator. In that case, if you've used relative
pathnames, all the code should work fine for the person you share with, even
though they're running it on their own computer. Conversely, if you'd used
absolute pathnames, none of them would work on another computer, because the
"top" of the directory structure (i.e., for me, `/Users/johnvolckens/`)
will definitely be different for your collaborator's computer than it is
for yours.
If you're getting errors reading in files, and you think it's related to the
relative pathname you're using, it's often helpful to use `list.files()` to
make sure the file you're trying to load is in the directory guided by the
relative pathname. The `list.files()` function is very useful because it
returnsa character vector of filenames (and paths, if desired). Once you have a
vector of filenames you can do things like ask logical questions (does this file
exist?), or count the number of files, or pass a relative path to a new
function...
### Tangent: `paste`
This is a good opportunity to explain how to use some functions that can be
very helpful when you're using relative or absolute pathnames: `paste()` and
`paste0()`. It's important that you understand that you can save a pathname
(absolute or relative) as an R object and then use that R object in calls to
later functions like `list.files()` and `read_csv()`. For example, to use the
absolute pathname to read the `heat_mort.csv` file in the `CourseText`
directory, you could run:
```
my_file <- "/Users/brookeanderson/Desktop/RCourseFall2015/CourseText/heat_mort.csv"
heat_mort <- read_csv(file = my_file)
```
You'll notice from this code that the pathname to get to a directory or file
can sometimes become ungainly and long. To keep your code cleaner, you can
address this by using the `paste` or `paste0` functions. These functions come
in handy in a lot of other applications, too, but this is a good place to
introduce them.
The `paste()` function is very straightforward. It takes, as inputs, a series
of different character strings you want to join together, and it pastes them
together in a single character string. (As a note, this means that your
resulting vector will only be one element long for basic uses of `paste()`,
while the inputs will be several different character stings.) You separate all
the different things you want to paste together using with commas in the
function call. For example:
```{r paste-days}
paste("Sunday", "Monday", "Tuesday")
length(x = c("Sunday", "Monday", "Tuesday"))
length(x = paste("Sunday", "Monday", "Tuesday"))
```
The `paste()` function has an option called `sep = `. This tells R what you
want to use to separate the values you're pasting together in the output. The
default is for R to use a space, as shown in the example above. To change the
separator, you can change this option, and you can put in just about anything
you want. For example, if you wanted to paste all the values together without
spaces, you could use `sep = ""`:
```{r paste-days-sep}
paste("Sunday", "Monday", "Tuesday", sep = "")
```
As a shortcut, instead of using the `sep = ""` option, you could achieve the
same thing using the `paste0` function. This function is almost exactly like
`paste`, but it defaults to `""` (i.e., no space) as the separator between
values by default:
```{r paste0-days}
paste0("Sunday", "Monday", "Tuesday")
```
With pathnames, you will usually not want spaces. Therefore, you could think
about using `paste0()` to write an object with the pathname you want to
ultimately use in commands like `list.files()` and `setwd()`. This will allow
you to keep your code cleaner, since you can now divide long pathnames over
multiple lines:
```
my_file <- paste0("/Users/brookeanderson/Desktop/",
"RCourseFall2015/CourseText/heat_mort.csv")
heat_mort <- read_csv(file = my_file)
```
You will end up using `paste()` and `paste0()` for many other applications, but
this is a good example of how you can start using these functions to start to
get a feel for them.
### Reading online flat files
So far, I've only shown you how to import data from files that are saved to
your computer. R can also read in data directly from the web. If a flat file is
posted online, you can read it into R in almost exactly the same way that you
would read in a local file. The only difference is that you will use the file's
URL instead of a local file path for the `file` argument.
With the `read_*` family of functions, you can do this both for flat files from
a non-secure webpage (i.e., one that starts with `http`) and for files from a
secure webpage (i.e., one that starts with `https`), including GitHub and
Dropbox.
For example, to read in data from this
[GitHub repository of Ebola data](https://raw.githubusercontent.com/cmrivers/ebola/master/country_timeseries.csv){target="_blank"}, you can run:
```{r ebola-url, message=FALSE}
url <- paste0("https://raw.githubusercontent.com/cmrivers/",
"ebola/master/country_timeseries.csv")
ebola <- readr::read_csv(file = url)
slice(.data = (dplyr::select(.data = ebola, 1:3)), 1:3)
```
## Data cleaning
Once you have loaded data into R, you'll likely need to clean it up a little
before you're ready to analyze it. Here, I'll go over the first steps of how to
do that with functions from `dplyr`, another package in the tidyverse. Here are
some of the most common data-cleaning tasks, along with the corresponding
`dplyr` function for each:
```{r dplyr-verbs, echo=FALSE}
library(package = "tibble")
dc_func <- tibble(task = c("Renaming columns",
"Filtering to certain rows",
"Selecting certain columns",
"Adding or changing columns"),
func = c("`rename()`",
"`filter()`",
"`select()`",
"`mutate()`"))
knitr::kable(dc_func, col.names = c("Task", "`dplyr` function"))
```
In this section, I describe how to do each of these four tasks. For the
examples in this section, I use example data listing guests to the Daily Show.
To follow along with these examples, you'll want to load that data, as well as
load the `dplyr` package. Install it using `install.packages()` if you have not
done so already.
```{r daily-load, message=FALSE}
library("dplyr")
daily_show <- readr::read_csv(file = "data/daily_show_guests.csv", skip = 4)
```
I've used this data in previous examples, but as a reminder, here's what it
looks like:
```{r daily-head}
head(x = daily_show)
```
### Renaming columns
A first step is often renaming the columns of the dataframe. It can be hard to
work with a column name that:
- is long
- includes spaces or other special characters
- includes uppercase letters
You can check out the column names for a dataframe using the `colnames()`
function, with the dataframe object as the argument. Several of the column
names in `daily_show` have some of these issues:
```{r daily-col}
colnames(x = daily_show)
```
To rename these columns, use `rename()`. The basic syntax is:
```{r rename-generic, eval=FALSE}
## generic code; will not run
dplyr::rename(.data = dataframe,
new_column_name_1 = old_column_name_1,
new_column_name_2 = old_column_name_2)
```
The first argument is the dataframe for which you'd like to rename columns.
Then you list each pair of new and old column names (in that order) for each
of the columns you want to rename. To rename columns in the `daily_show` data
using `rename()`, for example, you would run:
```{r daily-rename}
daily_show <- dplyr::rename(.data = daily_show,
year = YEAR,
job = GoogleKnowlege_Occupation,
date = Show,
category = Group,
guest_name = Raw_Guest_List)
head(x = daily_show, 3)
```
```{block, type='rmdwarning'}
Many of the functions in tidyverse packages, including those in `dplyr`,
provide exceptions to the general rule about quotation marks. Unfortunately,
this may make it a bit hard to learn when to use quotation marks. One way to
think about this, which is a bit of an oversimplification but can help as
you're learning, is to assume that anytime you're using a `dplyr` function,
every column in the dataframe you're working with has been loaded to your R
session as its own object, which means you don't need to use parentheses---most
of the time.
```
If you have been paying close attention to the code snippets, you may have
noticed the last bit of code included both the `package name` and the `function`
call separated by two colons, as in `dplyr::rename()`. This syntax of `package.name::package.function` is used for the sake of being **explicit**,
because (as you may have guessed) some R packages use the same for functions
that do entirely different things! For example, both the base R `stats` package
and the `dplyr` package have a function called `filter()` - the former is used
to pick our rows from a data frame and the latter is used to manipulate
time-series objects. When two packages are loaded containing functions with the
same name, R will default to using the function for the most recently loaded
package (and send you a message stating as much). This can be tricky business
when your R session has many packages running, which is why it never hurts to
be **explicit** in your function calls.
### Selecting columns
Next, you may want to select only some columns of the dataframe. You can use
the `select()` function from `dplyr` to subset the dataframe to certain
columns. The basic structure of this command is:
```{r select-generic, eval=FALSE}
## generic code; will not run
dplyr::select(.data = dataframe, column_name_1, column_name_2, ...)
```
In this call, you first specify the dataframe to use and then list all of the
column names to include in the output dataframe, with commas between each
column name. For example, to select all columns in `daily_show` except `year`
(since that information is already included in `date`), run:
```{r select-daily}
dplyr::select(.data = daily_show, job, date, category, guest_name)
```
```{block, type='rmdwarning'}
Don't forget that, if you want to change column names in the saved object, you
must reassign the object to be the output of `rename()`. If you run one of
these cleaning functions without reassigning the object, R will print out the
result, but the object itself won't change. You can take advantage of this, as
I've done in this example, to look at the result of applying a function to a
dataframe without changing the original dataframe. This can be helpful as
you're figuring out how to write your code.
```
The `select()` function also provides some time-saving tools. In the last
example, we wanted all the columns except one. Instead of writing out all the
columns we want, we can use `-` with only the columns we don't want to
save time (notice the object reassignment/override):
```{r select-inverse-daily}
daily_show <- dplyr::select(.data = daily_show, -year)
head(x = daily_show, n = 3)
```
### Add or change columns
You can change a column or add a new column using the `mutate()` function from
the `dplyr` package. That function has the syntax:
```{r mutate-generic, eval=FALSE}
# generic code; will not run
dplyr::mutate(.data = dataframe,
changed_column = function(changed_column),
new_column = function(other arguments))
```
For example, the `job` column in `daily_show` sometimes uses upper case and
sometimes does not. This call uses the `unique()` function to list only unique
values in this column:
```{r unique-job-daily}
head(x = unique(x = daily_show$job), n = 10)
```
To make all the observations in the `job` column lowercase, use the
`str_to_lower()` function from the `stringr` package within a `mutate()`
function:
```{r mutate-str-daily}
library(package = "stringr")
mutate(.data = daily_show,
job = str_to_lower(string = job))
```
We will take a deeper dive into strings and the `stringr` package
[later on](#rprog3).
### Filtering to certain rows
Next, you might want to filter the dataset to certain rows. For example, you
might want to get a dataset with only the guests from 2015, or only guests who
are scientists.
You can use the `filter()` function from `dplyr` to filter a dataframe down to
a subset of rows. The syntax is:
```{r filter-generic, eval=FALSE}
## generic code; will not run
filter(.data = dataframe, logical expression)
```
The `logical expression` in this call gives the condition that a row must meet
to be included in the output data frame. For example, if you want to create a
data frame that only includes guests who were scientists, you can run:
```{r filter-daily}
scientists <- filter(.data = daily_show,
category == "Science")
head(x = scientists)
```
To build a logical expression to use in `filter`, you'll need to know some of
R's logical operators. Some commonly used ones are:
Operator | Meaning | Example
--------- | ------- | ---------------------------------
`==` | equals | `category == "Acting"`
`!=` | does not equal | `category != "Comedy`
`%in%` | match; contains the following | `category %in% c("Academic", "Science")`
`is.na()` | is missing | `is.na(job)`
`!is.na()`| is not missing | `!is.na(job)`
`&` | and | `year == 2015 & category == "Academic"`
`|` | or | `year == 2015 | category == "Academic"`
We'll use these logical operators and expressions a lot more as the course
continues, so they're worth memorizing.
```{block, type='rmdwarning'}
Two common mistakes with logical operators are: (1) Using `=` instead of `==` to
check if two values are equal; and (2) Using `== NA` instead of `is.na` to
check for missing observations.
```
### Base R equivalents to `dplyr` functions
Just so you know, all of these `dplyr` functions have alternatives, either
functions or processes, in base R:
```{r dplyr-vs-base, echo=FALSE}
dplyr_vs_base <- data.frame(dplyr = c("`rename()`",
"`select()`",
"`filter()`",
"`mutate()`",
"`slice()`"),
base = c("Reassign `colnames()`",
"Square bracket indexing",
"`subset()`",
"Use `$` to change or create columns",
"`subset()` with logical expression"))
knitr::kable(dplyr_vs_base, col.names = c("`dplyr`",
"Base R equivalent"))
```
You will see these alternatives used in older code examples. Some of these
functions have variants specific to particular data wrangling needs. For
example, under `slice()`, there are others such as `slice_max()` and
`slice_min()`, which extract the top and bottom values, respectively, from a
dataset based on user input in the required arguments, including `n` and
`order_by`.
## Merging Data Frames
Many data analysis exercises will require you to combine data from different sources into a single object. Thus, it's worthwhile to understand how R can be used to merge together two or more data frames.
Merging data frames is generally done in one of two ways, depending on *how those data frames are similar*: **row binding** or **column binding**. Below, I provide reference to `dplyr::` functions, but, as usual, there are base R functions (`rbind()` and `cbind()`) that work, too!
### Row Binding
*Row binding* can be performed whenever the two (or more) data frames have column variables in common. Figure \@ref(fig:row-binding) shows this process graphically. Since the `x` and `y` data frames have identical column variables, the rows of data (i.e., what's under the column names) can be "staked" on top of each other to create a single data frame.
This is accomplished in the example using `dplyr::bind_rows(x, y)`.
```{r row-binding, echo=FALSE, out.width="600pt", fig.align="center", fig.cap="Row binding can occur when data frames x and y share the column variables."}
knitr::include_graphics("./images/bind_rows.png")
```
Note that if the `x` and `y` data frames have column variables that are *not shared*, those variables will be carried forward but the observations will contain `NA`. Therefore, you should always check your resultant data frame for completeness using functions like `complete.cases()` or a combination of `is.na()` with `sum()` or `which()`.
### Column Binding (_join)
Data frames can also be merged when row observations are shared, such that you end up merging column variables together from two objects. In this case, we would **join** the two data frames using a function like `dplyr::left_join()`. To do this, we must specify one or more variables that can uniquely identify row observations that are common between the two data frames. Once the rows are "lined up", we can paste the new column variables into a combined data frame. This is shown schematically in Figure \@ref(fig:left-join) where the matching rows are specified using the argument `by = var_a` within the join function.
```{r join-function, eval=FALSE}
# generic code for example; will not run
new.dataframe <- left_join(x, y, by = var_a)
```
```{r left-join, echo=FALSE, out.width="600pt", fig.align="center", fig.cap="Column binding, or joins can occur when data frames x and y share the same row observations."}
knitr::include_graphics("./images/left_join.png")
```
The `dplyr::` package features a number of mutate-join functions (e.g., `left_join()`, `right_join()`, `inner_join()`) that add columns from data frame `y` to data frame `x`, once you specify how to match rows using the `by = ` argument.
* `inner_join()`: includes all rows in x and y (regardless of whether they match).
* `left_join()`: includes all rows in x (and only y rows if they match).
* `right_join()`: includes all rows in y.
* `full_join()`: includes all rows in x or y.
Note that if your matching key (`by = `) does not produce unique row observations (for example, if you had two different `"John"` entries in a variable called `first.name` between both data frames) then R will create duplicate row entries that account for the possible combinations of the `John` observation in `x` with the `John` observation in `y`. One way to check for this is to look at the `length()` of the resultant (merged) data frame. In most cases, it should have the same length as the starting data frame, contingent on which mutate-join function you call. Another way is to look for duplicate observations in the data frame using the inverse `unique()`function on your key variable `!unique()` or the `duplicated()` function.
## Piping
So far, I've shown how to use these `dplyr` functions one at a time to clean up
the data, reassigning the dataframe object at each step; however, there's a
trick called "piping" (with `%>%`) that will let you complete multiple data
wrangling steps at once.
If you look at the format of these `dplyr` functions, you'll notice that they
all take a dataframe as their first argument:
```{r dplyr-generic, eval=FALSE}
# generic code; will not run
rename(.data = dataframe,
new_column_name_1 = old_column_name_1,
new_column_name_2 = old_column_name_2)
select(.data = dataframe,
column_name_1, column_name_2)
filter(.data = dataframe,
logical expression)
mutate(.data = dataframe,
changed_column = function(changed_column),
new_column = function(other arguments))
```
Without piping, you have to reassign the dataframe object at each step of this
cleaning if you want the changes saved in the object:
```{r daily-sep-clean, eval=FALSE, message=FALSE}
daily_show <-read_csv(file = "data/daily_show_guests.csv",
skip = 4)
daily_show <- rename(.data = daily_show,
job = GoogleKnowlege_Occupation,
date = Show,
category = Group,
guest_name = Raw_Guest_List)
daily_show <- select(.data = daily_show,
-YEAR)
daily_show <- mutate(.data = daily_show,
job = str_to_lower(job))
daily_show <- filter(.data = daily_show,
category == "Science")
```
Piping lets you streamline this process. It can be used with any function that
inputs a dataframe (or vector) as its first argument. The `%>%` operator *pipes*
the object on the left-hand-side of the pipe (`%>%`) into the function on the
right-hand-side (immediately after the pipe). With piping, therefore, all of the
data cleaning steps shown avove would look like:
```{r daily-pipe-clean, message=FALSE}
daily_show <- readr::read_csv(file = "data/daily_show_guests.csv",
skip = 4) %>%
dplyr::rename(job = GoogleKnowlege_Occupation,
date = Show,
category = Group,
guest_name = Raw_Guest_List) %>%
dplyr::select(-YEAR) %>%
dplyr::mutate(job = str_to_lower(job)) %>%
dplyr::filter(category == "Science")
```
Notice that, when piping a data frame, the first argument (name of the data
frame) is excluded from all function calls that follow a pipe. This is because
piping sends the dataframe from the last step into each of the following
functions as the dataframe argument. Remember: Order matters in a data wrangling
pipeline. For example, if you remove a column in an early line of code in the
pipeline but then reference that column name later, R will throw an error.
You can use selective highlighting to run one line at a time to see how the
dataframe changes in real-time as you move through successive pipes.
``` {block, type='rmdnote'}
Piping with `%>%` should only be used when you want to perform succesive data
wrangling steps on a **single object**. Each pipe operation should be followed
by a new line, as shown above. Creating a new line after each pipe step aids
readability of the pipe, since each new action occurs on a new line
of code. Also, if a single pipe function contains multiple arguments, consider
putting each argument on a separate line, too (also shown in the code snippet
above).
```
## Markdowns
A ***markdown*** is a file format designed for the internet. Markdown files
allow you to enter plain text into a file, format that text, and embed
code/images/data into the file (everything you are reading in this coursebook
was written and created with markdown files).
Markdown files are versatile because:
* Markdowns can be rendered into html, pdf, and doc files easily. Thus,
markdown files can be turned into websites, email messages, reports, blogs, textbooks, and other forms of media without worry;