-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-string_date.Rmd
executable file
·954 lines (779 loc) · 41.5 KB
/
06-string_date.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
# Strings, Dates, and Tidying {#rprog3}
```{r rprog2-1, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center')
library(tidyverse)
library(lubridate)
library(kableExtra)
```
```{r load_daily_show, echo=FALSE, message=FALSE}
daily_show <- read_csv(file = "data/daily_show_guests.csv", skip = 4)
daily_show <- rename(.data = daily_show,
year = YEAR,
job = GoogleKnowlege_Occupation,
date = Show,
category = Group,
guest_name = Raw_Guest_List)
```
## Chapter 6 Objectives
This Chapter is designed around the following learning objectives. Upon
completing this Chapter, you should be able to:
- Define the meaning of "strings" and "date time" objects in R
- Manipulate character strings using the `stringr` and `tidyr` packages of functions
- Parse strings using regular expressions ("regex")
- Describe how R stores a POSIXct date and time object internally
- Convert a character vector to a date format using functions from the
`lubridate` R package
- Extract information from a date object (e.g., month, year, day of week) using `lubridate` functions
- Search, organize, and visualize data that are linked to date objects
- Apply functions from the `dplyr` and `tidyr` packages to make dataframes "tidy"
## Strings {#strings}
***Strings*** are a form of character data like "John", or "blue", or "John's
sample 8021A turned blue".
**Strings are defined in R using quotes `" "`** and stored as `character`
vectors; they often show up in data analysis in one of two ways:
1. As ***metadata***. Metadata means: "data that describe other data".
A *readme.txt* file is metadata; notes and code comments are metadata. All of
these types of data usually come in the form of strings and are included
**with the data your are analyzing** but not **in the dataset** itself.
2. As ***vectorized data***. In R programming, *"vectorized"* means: stored
as a column of data. Examples of vectorized strings that you might find
include things like: "participant names", or "survey responses to question 1",
or "mode of failure". The example below creates three different string vectors
in R. *Note:* You can check the type of object you've created using the
`class()` or `typeof()` functions.
```{r strings-1}
# examples of vectorized string data
names_respond <- c("Ahmed",
"Josh",
"Mateo",
"William",
"Ali",
"Wei",
"Steve-O",
"John")
q1_responses <- c("Because you told me to do it.",
"It seemed like the right thing to do at the time.",
"Because I had been over-served.",
"I don't know. I just did it.",
"I got caught up in the heat of the moment.",
"I was given an opportunity. I took my shot.",
"I plead the 5th.",
"I could ask you the same question.")
failure_mode <- c("fracture",
"yielding",
"deflection",
"fatigue",
"creep")
# proof of vector type
class(names_respond)
```
The first step in analyzing a string is to parse it.
**To parse means to examine the individual components.** For example, when you
read this sentence you parse the words and then assign meaning to those
words based on your memory, your understanding of grammar, and the context in
which those words occur. Context is often critical to understanding because the
meaning of words can change from one context to the next (i.e., whether you are
reading an instruction manual, a text message, a novel, or a warrant for your
arrest). Strings can be challenging to analyze because computers are built on
logical operations and mathematics; strings are neither of those. Computers
have fantastic memory, are OK at grammar, and are comically poor at
contextualization. Taken together, this means that strings can be challenging
(but not impossible) to analyze using computers.
```{block, type="rmdnote"}
Are you active on social media platforms like Instagram or Twitter? You can bet that a computer program has downloaded and parsed all of your posts, each one as a string. You can learn a lot about a person (and their buying habits) from what they post online!
```
In this chapter, we will introduce a few simple string functions from base R
and the `stringr` package. We will also introduce the concept of
**regular expressions** as a means to perform more advanced string
manipulation.
```{r parse-comic, echo=FALSE}
knitr::include_graphics("./images/parse_comic.png")
```
### String detect, match, subset
One of the simplest string operations is to search whether a string contains a
pattern of interest. The `stringr` package (part of the
[Tidyverse](https://stringr.tidyverse.org/){target="_blank"}) was developed
to simplify the
analysis of strings. Most of the functions in `stringr` begin with `str_` and
end with a specific function name. A full list of functions is provided
[here](https://stringr.tidyverse.org/reference/index.html){target="_blank"}. Some examples:
1. **`str_detect()`** returns a vector of logical values (TRUE/FALSE) indicating whether the pattern was detected within each string searched. The function takes two arguments, the `string` to be searched and the `pattern` for which to search. Let's search for the pattern `"Josh"` in the character vector of strings, `names_respond`, that we created above:
```{r string-2}
stringr::str_detect(string = names_respond,
pattern = "Josh")
```
As expected, only one string in the vector produced a match.
An added benefit of logical functions like `str_detect()` is that return values
of `TRUE` are coded as 1 and `FALSE` as 0. Thus, if we `sum()` the result of
the `str_detect()` search, we will get the cumulative number of matches to
`"Josh"` from within our data.
```{r string-2a}
stringr::str_detect(string = names_respond,
pattern = "Josh") %>%
sum()
```
In other words, logical functions like `str_detect()` allow us to do math on
string data! For example, we can now calculate the proportion of `"Josh"`
entries within our sample:
```{r string-2b}
stringr::str_detect(string = names_respond,
pattern = "Josh") %>%
sum() / length(names_respond)
```
2. **`str_extract()`** takes the same arguments as `str_detect()` but returns a vector of the matched values (by string index). By "matched values", I mean only the portion of the string for which the search created a match.
```{r string-3}
stringr::str_extract(string = names_respond,
pattern = "Jo")
```
3. **`str_subset()`** returns only the entries that were matched (i.e., if a match was detected, then the entire string that was matched is returned). If we subset our short list of names to the pattern of letters `"li"`, we get:
```{r string-4}
stringr::str_subset(string = names_respond,
pattern = "li")
```
To note, there are base R versions of all these `stringr` functions. Most are
performed with the `grep` family of functions. The term *"grep"* is an acronym
for **<u>G</u>lobal <u>R</u>egular <u>E</u>xpression <u>P</u>attern** (more on
*regular expressions* below). Many "old-school" coders use this family of
functions, therefore, you will encounter them in the wild, so it's worth
knowing about them.
```{r string-table, echo=FALSE}
string_table <- tibble::tibble(
stringr_funcs = c("str_detect(x, pattern)",
"str_match(x, pattern)",
"str_subset(x, pattern)"),
base_funcs = c("grepl(pattern, x)",
"regexec(pattern, x) + regmatches()",
"grep(pattern, x, value = TRUE)")
)
knitr::kable(string_table)
```
### Regular Expressions
Before going much farther, we should spend some time discussing
***regular expressions*** or **regex** for short. When we pass a `pattern`
argument to a function like `str_detect()`, the function treats that argument
like a "regex". Up until this point, I have only passed simple character
strings as `pattern` arguments (i.e., `pattern = "Josh"`). In reality, we can
create much more advanced search criteria using **regex** syntax within our
search patterns.
```{block, type="rmdnote"}
A **regular expression** is a sequence of characters that define a search pattern to be implemented on a string.
```
In the R programming language, regular expressions follow the POSIX 1003.2
standard; regex can have different syntax based on the underlying standard.
Regex are created by including search syntax (i.e., symbols that communicate
search parameters) within your quoted string. For example, square brackets `[]`
in a search pattern indicate a search for *any* of the characters within the
brackets; conversely, to match *all* the characters you simply include them in
quotes. A key strength of regex patterns is that they allow you to use logical
and conditional relations. For example, the following text search patterns can
be coded as regex:
| Desired Search Pattern | Regex in R |
|:------------------------|:------------|
| *any letter followed by the numbers 3, 4, or 5* | "[:alpha:][345]" |
| *strings that start with 'ID' and are followed by 4 numbers* | "^ID[:digit:]{4}" |
One challenging aspect of string searching in R, however, is that certain
"special characters" like the quote `"` and the backslash `\` symbol must be
explicitly identified within the string in order to be interpreted by R
correctly. To identify these *special characters* in a string, you need to
***"escape"*** that character using a backslash `\`. Thus, if you want to
search for a quote symbol, you would type in `\"`. Whenever a regex requires
the use of a `\`, you have to identify it within a string as `\\`. The table
below shows some basic regex syntax and how they would be implemented as a
search pattern in R.
```{r regex-1, echo=FALSE}
table_regex <- tibble::tibble(
regex_symbols = c("\\\\d",
"[abc]",
"[a-z]",
"[^abc]",
"(abc)",
"^b",
"b$",
"a|b"),
match_examples = c("Any numeric digit",
"matches a, b, or c",
"matches every character between a and z",
"matches anything except a, b, or c",
"creates a \"capture group\" whereby abc must occur together",
"starts with: look for \"b\" at the start of a string",
"ends with: look for \"b\" at the end of a string",
"match a or b"),
example_code = c("\"\\\\\\\\d\" or \"[:digit:]\"",
"\"[abc]\"",
"\"[a-z]\"",
"\"[^abc]\"",
"\"(abc)\"",
"\"\\^b\"",
"\"b\\$\"",
"\"a|b\"")
)
knitr::kable(table_regex,
align = "c",
col.names = c("Regex syntax","String to be matched", "Example in R"),
caption = "Basic Regex Search Syntax and Example Implementation in R")
```
**Regex** sequences have seemingly no end of sophistication and nuance; you
could spend dozens of hours learning to use them and hundreds more learning to
master them. We will only introduce basic concepts here. More in-depth
introductions to regex syntax and usage can be found on [Hadley Wickham's R course](https://r4ds.hadley.nz/strings.html){target="_blank"},
on the `stringr`
[cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf){target="_blank"}
developed by RStudio, and through practice with, my personal favorite, a game
of [Regex Golf](https://alf.nu/RegexGolf){target="_blank"}.
### String split, replace
`str_split()` will split a string into two (or more) pieces when a match is
detected. The string will always be split at the first match and again at each
additional match location, unless you specify that only a finite number of `n`
matches should occur. A couple points to note:
- `str_split()` splits on each side of the match; the matched part of the string
is *not* included in the output.
- because both sides of the split string are returned, your output will take
the form of a `list`.
```{r string-5}
stringr::str_split(string = names_respond, pattern = "t")
```
`str_replace()` searches for a match and then replaces the matched value
with a new string of your choosing. The function takes three arguments: the
`string` to be searched, the `pattern` to match, and the `replacement`
string to be inserted. Let's replace the first period detected in each of
the `q1_responses` strings with a question mark. Both `.` and `?` are
*special characters* so we need to *"escape"* each of these symbols with two
back-slashes `\\`.
```{r string-6}
stringr::str_replace(string = q1_responses,
pattern = "\\.",
replacement = "\\?")
```
The `str_replace()` family of functions is useful for cleaning up misspellings
(or other unwanted language) in strings. Note, however, that `str_replace()`
will normally replace only one instance of a match; use `str_replace_all()` if
you plan to encounter multiple matches that need replacing.
```{block, type='rmdtip'}
The `pattern =` argument for matching can be a single string or a vector of
strings. In the latter case, you can define a vector of *keywords* that you
might be searching for across sentences. In that case, use
`pattern = c("pattern_1", "pattern_2", "pattern_3", ...etc)`.
```
## Dates and Date-times
Working with dates and times can be challenging. This section begins with a
discussion of how base R handles dates and times, since there is a ton of code
out there that utilizes these older functions. We will then quickly transition
to the `lubridate` family of functions (part of the Tidyverse) because of their
versatility and ease-of-use.
### Dates and Times in base R
Dates and times in base R all proceed from an *"epoch"* or *time origin*. In
R, the *epoch* or "dawn of time" occurred at midnight on January 1^st^, 1970.
For the sake of the R programming world, the concept of time started at that
precise moment and has moved forward ever since. To note: R can handle
date-times before 1/1/1970; it just treats them as negative values!
To see a date-time object, you can tell R to give you the current "System Time"
by calling the `Sys.time()` function.
```{r sys-time}
Sys.time()
```
As you can see, we got back the date, time, and timezone used by my computer
(*whenever I last ran this code in `bookdown`*). If you want to see how this
time is stored in R internally, you can use `unclass()`, which returns an
object value with its class attributes removed. When we wrap `unclass()`
around `Sys.time()`, we will see the number of seconds that have occurred
between the epoch of 1/1/1970 and right now:
```{r unclass-time}
unclass(Sys.time())
```
That's a lot of seconds. How many years is that?
Just divide that number by [60s/min $\cdot$ 60min/hr $\cdot$ 24hr/d $\cdot$
365d/yr] => `r unclass(Sys.time())/60/60/24/365` years.
This calculation ignores leap years, but you get the point...
### Date-time formats
Note that the `Sys.time()` function provided the date in a
***"year-month-day"*** format and the time in an ***"hour-minute-second"***
format: `r Sys.time()`.
Not everyone uses this exact ordering when they record dates and times, which
is one of the reasons working with dates and times can be tricky. You probably
have little difficulty recognizing the following date-time objects as
equivalent but, for some computer programs, not so much:
```{r date-times-1, echo=FALSE}
date_time <- tibble::tibble(examples = c("12/1/99 8:46 PM",
"1-Dec-1999 20:46 UTC",
"December 1st, 1999, 20:46:00"))
knitr::kable(date_time, col.names = NULL,
caption = "Date-time objects come in different forms") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = "striped")
```
```{block, type="rmdnote"}
You will often see time followed by **"UTC"**, which stands for
*"Universal Time, Coordinated"*. UTC is preferred by programmers because it
doesn't have a timezone and it doesn't follow *Daylight Savings Time*
conventions. Daylight savings is the bane of many coders.
In practice, UTC is the same time as GMT (Greenwich Mean Time, pronounced
"gren-itch") but with an important distinction. GMT is one of the many
[time-zones](https://wikipedia.org/wiki/List_of_tz_database_time_zones){target="_blank"} laid out across Earth's longitude, whereas, **UTC has no timezone**;
UTC is the same time for everyone, everywhere.
In Colorado, we are UTC-6 hours during *daylight savings* (March-Nov) and UTC-7
during *standard time* (Nov-March). This means that most Coloradoans eat
dinner at 12am UTC (6pm MST).
```
### Date-time classes in R
R has several classes of date-time objects, none of which are easy to remember:
1. **`POSIXct`** - stored as the time, in seconds, between the `epoch` of 1970-01-01 00:00:00 UTC and the date-time object in question.
* the 'ct' stands for *"continuous time"* to represent "continuous seconds from origin";
* A `POSIXct` object is a single numeric vector, thus useful for efficient computing.
2. **`POSIXlt`** - stored as a list of date-time objects.
* the 'lt' stands for *"list time"*.
* A `POSIXlt` list contains the following elements:
* *sec* as 0–61 seconds
* *min* as 0–59 minutes
* *hour* as 0–23 hours
* *mday* as 1–31 day of the month
* *mon* as 0–11 months after the first of the year
* *year* as Years since 1900
* *wday* as 0–6 day of the week, starting on Sunday
* *yday* as 0–365 days of the year.
* *isdst* as a flag for Daylight savings time. Positive if in force, zero if not, negative if unknown.
3. **`POSIXt`** - this is a virtual class. `POSIXt` (without the "l") is an internal way for R to convert between `POSIXct` and `POSIXlt` date-time objects.
* Think of the `POSIXt` as a way for R to perform operations/conversions between a `POSIXct` and `POSIXlt` object without throwing an error your way.
As a reminder, here are some common **object classes** in R:
```{r vect-classes, echo=FALSE, warning=FALSE}
vect_classes <- tibble::tibble(classes = c("`character`",
"`numeric`",
"`factor`",
"`Date`",
"`logical`",
"`date-time`"),
examples = c("\"Chemistry\", \"Physics\", \"Mathematics\"",
"10, 20, 30, 40",
"Male [underlying number: 1], Female [2]",
"\"2010-01-01\" [underlying number: 14,610]",
"TRUE, FALSE",
"\"2020-06-23 11:05:20 MDT\""))
knitr::kable(vect_classes, col.names = c("Class", "Example")) %>%
kable_styling(full_width = F, bootstrap_options = "basic")
```
To discover the class of a vector (including a column in a dataframe---remember
each column can be thought of as a vector), you can use `class()`:
```{r date-1, warning = FALSE}
class(Sys.time())
```
Both the `POSIXct` and `POSIXlt` class of objects return the same value to the
user; the difference is really in how these classes store date-time objects
internally. To examine them, you can coerce `Sys.time()` into each of the two
classes using `as.POSIXct` and `as.POSIXlt` functions and then examine their
attributes.
```{r ct_attr}
time_now_ct <- as.POSIXct(Sys.time())
unclass(time_now_ct)
```
```{r lt_attr}
time_now_lt <- as.POSIXlt(Sys.time())
str(unclass(time_now_lt)) # the `str()` function makes the output more compact
```
It's easy to see why the `POSIXct` object class is more computationally
efficient, but it's also nice to see all the date-time information packed into
the `POSIXlt`. This is why R keeps a key to unlock both using `POSIXt`.
As my father used to say: clear as mud?
### Reading and classifying date-times
Oftentimes, when data is read into R, there are column elements that contain
date and time information. These dates and times are often interpreted by R
as *character* vectors, which means they have lost their relational attributes.
For example, you cannot subtract "Monday 08:00" from "Wednesday 12:00" and get
"2 days 4 hours". If we want to analyze dates and times in a relational way,
we need to instruct R to recognize these as date-time objects (i.e., as either
the `POSIXct` or `POSIXlt` class). Thus, to convert a character vector into
date or date-time object requires a change of that vector's class.
Date-time elements can be tricky to work with for a few reasons:
1. Different programs store and handle dates and times in different ways
2. The existence of time zones means that date-time values can change with location
3. Date-time strings can be separated with spaces, colons, commas, slashes, dashes, or a mix of all those together (see Table \@ref(tab:date-times-1))
The base R function to convert between `character` classes and `date-time`
classes is the function `strptime()`, which is short for
*"**str**ing **p**arse into date-**time**"*. I mention this function not because
I encourage you to use it but because I want you to be able to recognize it. The
function has over 39 conversion specifications that it can take as arguments.
That is to say, this function not simple to master. If you are a glutton for
punishment, I invite you to read the R Documentation with `?strptime`.
Here are a few base R functions for working with date-time objects that are
worth knowing:
```{r base-r-times, echo=FALSE}
base_r_times <- tibble::tibble(functions = c("Sys.Date()",
"Sys.time()",
"Sys.timezone()",
"as.POSIXct()",
"as.POSIXlt()",
"strptime()"),
returned = c("Current system date",
"Current system date-Time",
"Current system timezone",
"date-time object of class POSIXct",
"date-time object of class POSIXlt",
"date-time object of class POSIXlt"),
examples = c("\"2020-06-23\"",
"\"2020-06-23 11:05:20 MDT\"",
"\"America/Denver\"",
"\"2020-06-23 11:05:20 MDT\"",
"\"2020-06-23 11:05:20 MDT\"",
"\"2020-06-23 11:05:20 MDT\"")
)
knitr::kable(base_r_times, col.names = c("{base} R Function",
"Value Returned",
"Example"),
caption = "Basic Date-time functions") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = "striped")
```
## `lubridate` {#lubridate}
The `lubridate` package was developed specifically to make it easier to work
with date-time objects. You can find out more information on `lubridate`
[here](https://lubridate.tidyverse.org/){target="_blank"}.
### Parsing functions in `lubridate`
One of best aspects of `lubridate` is its ability to parse date-time objects
with simplicity and ease; the `lubridate` parsing functions are designed as
"named-to-order". Let me explain:
> <span style="color: blue;"> **Parse**: ***to break apart and analyze the individual components*** of something, like a character string. </span>
* If a character vector is written in "**y**ear-**m**onth-**d**ay" format (e.g., `"2020-Dec-18"`), then the `lubridate` function to convert that vector is `ymd()`.
* If a character vector is written in "**d**ay-**m**onth-**y**ear" format (e.g., `"18-Dec-2020"`), then the `lubridate` function to convert that vector is `dmy()`. Try it out:
```{r lubridate-1}
# create a character vector
date_old <- "2020-Dec-18"
# prove it's a character class
class(date_old)
# convert it to a `Date` class with `ymd()`
date_new <- lubridate::ymd(date_old)
# prove it worked
class(date_new)
```
That little conversion exercise may not have blown you away, but watch what
happens when I feed the following set of wacky character vectors into that
same `lubridate` parsing function, `ymd()`:
``` {r lubridate-2}
messy_dates <- c("2020------Dec the 12",
"20.-.12.-.12",
"2020aaa12aaa12",
"20,12,12",
"2020x12-12",
"2020 .... 12 ...... 12",
"'20.December-12")
ymd(messy_dates)
```
Boom. That's right, the `ymd()` parsing function figured them all out correctly
with almost no effort on your part. But wait, there's more!
The `lubridate` package contains parsing functions for almost any order you can
imagine.
```{r lubridate-table1, echo=FALSE}
lubridate_parsers <- tibble::tibble(functions = c("`ymd()`",
"`mdy()`",
"`dmy()`"),
formats = c("year-month-day",
"month-day-year",
"day-month-year")
)
knitr::kable(lubridate_parsers,
col.names = c("Parsing Function", "Format to Convert")) %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = "basic")
```
And if you need to parse a time component, simply add a combination of `_hms` to
the function call to parse time in "hours-minutes-seconds" format. Some
additional examples of how you would parse time that followed from a `ymd`
format:
```{r lubridate-table2, echo=FALSE}
lubridate_parsers2 <- tibble::tibble(functions = c("`ymd_h()`",
"`mdy_hm()`",
"`dmy_hms()`"),
formats = c("year-month-day_hours",
"year-month-day_hours-minutes",
"year-month-day_hours-minutes-seconds")
)
knitr::kable(lubridate_parsers2,
col.names = c("Parsing Function", "Format to Convert")) %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = "basic")
```
The beauty of the `lubridate` parsers is that they do the hard work of cleaning
up the character vector, regardless of separators or delimiters within each
string, and return either a `Date` or `Date-time` object class.
### Date-time manipulation with `lubridate`
To convert the `date` column in the `daily_show` data into a Date class, you can
run:
```{r}
library(package = "lubridate")
# check the class of the 'date' column before mutating it
class(x = daily_show$date)
daily_show <- mutate(.data = daily_show,
date = mdy(date))
head(x = daily_show, n = 3)
# check the class of the 'date' column after mutating it
class(x = daily_show$date)
```
Once you have an object in the `Date` class, you can do things like plot by
date, calculate the range of dates, and calculate the total number of days the
dataset covers:
```{r eval=FALSE}
# report the min and max dates
range(daily_show$date)
# calculate the duration from first to last date using base
max(daily_show$date) - min(daily_show$date)
```
We could have used these to transform the date in `daily_show`, using the
following pipe chain:
```{r message=FALSE}
daily_show <- readr::read_csv(file = "data/daily_show_guests.csv",
skip = 4) %>%
dplyr::rename(job = GoogleKnowlege_Occupation,
date = Show,
category = Group,
guest_name = Raw_Guest_List) %>%
dplyr::select(-YEAR) %>%
dplyr::mutate(date = lubridate::mdy(date)) %>%
dplyr::filter(category == "Science")
# show first two rows of dataframe
head(x = daily_show, n = 2)
```
The `lubridate` package also includes functions to pull out certain elements of
a date, including:
- `wday()` return the day of the week pertaining to a Date object
- `mday()` return the day of the month pertaining to a Date object
- `yday()` return the day of the year pertaining to a Date object
- `month()` return the month pertaining to a Date object
- `quarter()` return the quarter of hte year pertaining to a Date object
- `year()` return the year pertaining to a date object
For example, we could use `wday()` to create a new column with the weekday of
each show:
```{r}
daily_show %>%
dplyr::mutate(show_day = lubridate::wday(x = date,
label = TRUE)) %>%
dplyr::select(date, show_day, guest_name) %>%
dplyr::slice(1:5)
```
```{block, type = "rmdwarning"}
R functions tend to use the timezone of **YOUR** computer's operating system by
default---or UTC, or GMT. You need to be careful when working with dates and
times to either specify the time zone or convince yourself the default behavior
works for your application. To see or set your computer's time zone, use the `Sys.timezone()` function from `{base}`.
```
## Tidy Data
***"Tidy Data"*** is a philosophy for how to arrange your data in a intuitive,
rectangular fashion, commonly in a `dataframe`. In his
[2014 paper](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf){target="_blank"},
Hadley Wickham states,
*"tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)."*
The definition of a tidy dataset is straightforward:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
The first two rules, *each variable forms a column* and
*each observation forms a row*, are relatively easy to implement.
In fact, most of the dataframes that we have used so far follow this convention.
The trick is to note that some column variables might look *independent* from
one another, when, in fact, the columns represent a single variable broken out
by some hierarchical, or nesting, structure.
For example, the (untidy) table below (\@ref(tab:grades-untidy)) shows exam
grades for three students. Table \@ref(tab:grades-untidy) is untidy because
the variable depicting the exam score shows up in three different columns.
Another way to characterize the *untidiness* of this table is that we have three
different exams (Exam1, Exam2, Exam3) that represent a variable (Exam_# or
Exam_type), that is not represented with a column.
```{r grades-untidy, echo=FALSE}
set.seed(14)
grades_untidy <- tibble::tibble(
Name = c("Harry", "Ron", "Hermione"),
Exam1_Score = c(85, 81, 95),
Exam2_Score = c(79, 75, 97),
Exam3_Score = c(88, 89, 99)
)
grades_tab1 <- knitr::kable(grades_untidy,
caption = "An Untidy Table of Exam Scores.")
kableExtra::kable_styling(grades_tab1,
bootstrap_options = "condensed",
position = "center",
full_width = FALSE)
```
Thus, to tidy table \@ref(tab:grades-untidy), we need to create a variable
called `Exam` and move all the scores into a new column vector named `Scores`.
This effectively makes the data "long" rather than "wide".
```{r grades-tidy, echo=FALSE}
set.seed(14)
grades_tidy <- tibble::tibble(
Name = rep(c("Harry", "Ron", "Hermione"), each = 3),
Exam = rep(1:3, 3),
Scores = c(85, 81, 95, 79, 75, 97, 88, 89, 99)
)
grades_tab2 <- knitr::kable(grades_tidy, caption = "A Tidy Table of Exam Scores")
kableExtra::kable_styling(grades_tab2,
bootstrap_options = "condensed",
position = "center",
full_width = FALSE)
```
For our work, one of the goals of data cleaning is to create datasets that
follow this *tidy* convention. Fortunately, the `tidyr` package contains a
useful function called `pivot_longer()` for re-arranging dataframes from
untidy (typically "wide) to tidy ("long").
`pivot_longer()` is designed to "lengthen" a data frame by
*"increasing the number of rows and decreasing the number of columns"*.
This function creates two new columns as output and requires four arguments:
- `data =` the dataframe to be lengthened
- `cols =` a list of the columns that should be combined together (forming a
single variable to be represented in a new, single column)
- `names_to` = this is an output column that contains the **names** of the
old columns that are being combined.
- `values_to` = this is the other output column that contains the **values**
from within the old columns that are being combined.
```{block, type="rmdnote"}
The `pivot_longer()` function always creates two new columns:
- one column contains `names_to = ` information.
- the other column contains the `values_to = ` data.
```
In the case of Table \@ref(tab:grades-untidy), we are using `cols = ` to combine
`Exam1_Score`, `Exam2_Score` and `Exam3_Score` into a a column of
`names_to = "Exam"` and the data represented in those three columns gets moved
into a column of `values_to = "Scores"`. We also add a call to `dplyr::mutate()`
and `dplyr::case_when()` to convert strings to numbers.
The `case_when()` function allows you introduce multiple "if/else" statements into
a vectorized operation, such as within this call to `dplyr:mutate()`. This function
takes a series of logical arguments as *cases*, with each case seeking to find a "match". Once a "match" occurs, an outcome instruction is followed. In the code
chunk below, the left-hand-side of each "case" evaluates whether each entry in
the `Exam` column equates to a particular string (like `"Exam_1_Score"`). On the
right-hand-side of each case, separated by a *tilde* ~, contains the `mutate()`
instruction. In plain speak, the function reads:
> If the vector contains a string that matches "Exam_1_Score", then replace that entry with a 1; else, if the vector contains a string that matches "Exam_2_Score",
then replace that entry with a 2; else, if the vector contains a string that matches "Exam_3_Score", then replace that entry with a 3.
```{r pivot-longer}
grades_untidy <- tibble::tibble(
Name = c("Harry", "Ron", "Hermione"),
Exam1_Score = c(85, 81, 95),
Exam2_Score = c(79, 75, 97),
Exam3_Score = c(88, 89, 99)
)
tidygrades <- tidyr::pivot_longer(data = grades_untidy,
cols = Exam1_Score:Exam3_Score,
names_to = "Exam",
values_to = "Scores") %>%
dplyr::mutate(Exam = dplyr::case_when(
Exam == "Exam1_Score" ~ 1,
Exam == "Exam2_Score" ~ 2,
Exam == "Exam3_Score" ~ 3
))
```
``` {r grades-tab3, echo=FALSE}
grades_tab3 <- knitr::kable(tidygrades,
caption = "Pivot longer applied to untidy data")
kableExtra::kable_styling(grades_tab3,
bootstrap_options = "condensed",
position = "center",
full_width = FALSE)
```
Now these data are tidy, wherein each variable forms a column, each row
forms an observation, and the table is specific to one observational unit.
This may seem obvious, but most data analysis problems occur because of poor
data management techniques. Remember: What might seem useful at the time---like
a color-coded Excel spreadsheet---is a pain for computers and not reproducible.
Do yourself a favor and begin with the end in mind: *tidy data*.
## Chapter 6 Exercises
The following exercises use data on tweets from United Senators, processed by
[FiveThirtyEight](https://github.com/fivethirtyeight/data/tree/master/twitter-ratio){target="_blank"}.
### Set 1: Data import and manipulation
Create a pipeline that
- imports the raw data (.csv) containing tweets from all US senators,
- filters the raw data to only Colorado ("CO") senators,
- creates **4 new variables** within a call to the appropriate `dplyr` function and four distinct `lubridate` functions based on the format of the date or date-time variable:
- `date` from `created_at`
- `hour_co` from `date`, also setting "America/Denver" timezone with `lubridate::with_tzone()`
- `month` from `date`
- `week` from `date`, and
- retains all variables **except** `url`, `created_at`, and `bioguide_id`
```{r tweet-data, message=FALSE}
# pipeline to import, filter, manipulate datetime vars, select vars
tweets_co <- readr::read_csv("data/senators.csv") %>%
dplyr::filter(state == "CO") %>% # filter to Colorado senators
dplyr::mutate(date = lubridate::mdy_hm(created_at), # convert timestamp to datetime object
hour_co = lubridate::hour(lubridate::with_tz(date, # convert timestamp to datetime object
tzone = "America/Denver")), # and add timezone label
month = lubridate::month(date), # create var of month from timestamp
week = lubridate::week(date)) %>% # create var of week from timestamp
dplyr::select(-(c(url, created_at, bioguide_id))) # retain all vars except 3
```
### Set 2: Counting
1. Which Colorado senators are included in the data?
```{r co-senators}
# find unique strings within user variable
# if you convert to factor, you can also find `levels()`
unique(tweets_co$user)
```
2. Which Colorado Senator tweets the most?
```{r tweet-count}
# compare number of tweets by senator; remember to always ungroup grouped data!
tweets_co %>%
dplyr::group_by(user) %>%
dplyr::tally() %>%
dplyr::ungroup()
```
### Set 3: Weekly tweets
Create a time series plot that indicates tweet activity on a weekly basis since
2011 for the Colorado senators. You will need to first create a new data frame
that contains new time-date variables based on the (a) total time and (b) weeks
since data collection started.
```{r time-vars}
# create new time-based variables
tweets_co_wk <- tweets_co %>%
dplyr::mutate(time_since = as.duration(date - min(date)), # create duration variable, time since first datum
week = round(as.numeric(time_since, "weeks"), 0)) %>% # duration rounded to nearest cumulative "weeks"
dplyr::group_by(user, week) %>% # by week per user
dplyr::tally() %>% # tweet count
dplyr::ungroup() # remember to ungroup grouped data
```
```{r tweet-time-series}
# plot time series of tweets per week per senator since 2011
ggplot2::ggplot(data = tweets_co_wk,
mapping = aes(x = week, y = n, color = user)) +
geom_line() +
theme_minimal() +
labs(title = "Number of tweets per week by Colorado senator since 2011",
subtitle = "Gardner joined Twitter later but tweets more (and cyclically)")
```
### Are tweets correlated in time?
We arbitrarily selected "weekly number of tweets" by each senator and it's possible that tweet volume in a prior week correlate with tweet volume in a subsequent week (i.e., autocorrelation). This calls for an autocorrelation plot! We'll create one for Senator Gardner.
```{r tweet-pacf}
# create a data frame with only Gardner's tweets
tweets_gardner <- tweets_co_wk %>%
filter(user == "SenCoryGardner")
# create a partial autocorrelation plot of tweets by week
pacf(tweets_gardner$n, lag.max = 60) #go out more than 52 weeks to see annual correlation
```
It appears, as one might expect, that tweets are correlated in time. Most of the correlation happens in the first three lags, so maybe these data would be better suited for a monthly average...
### Set 4: CSU vs. CU
1. How many times are strings related to CSU mentioned by each senator by year?
This might include things like "CSU", "colostate", "Colostate", "Colorado State
U", "Rams", etc. You will need to use a few `dplyr` functions, one `lubridate`
function, and one `stringr` function within the pipeline.
```{r tweet-csu}
# determine number of csu-related tweets by senator per year
tweets_co %>%
dplyr::group_by(user, lubridate::year(date)) %>%
dplyr::filter(stringr::str_detect(text, "CSU|colostate|Colostate|Colorado State U|RAMS|Rams|csu")) %>%
dplyr::tally() %>%
dplyr::ungroup()
```
2. What is the content of the tweets extracted in the previous question? You
will need to use another `stringr` function. You can use the same text strings
as before.
```{r tweet-csu-content}
# extract csu-related tweet content
stringr::str_subset(tweets_co$text,
"CSU|colostate|Colostate|Colorado State U|RAMS|Rams|csu")
```
3. Which Colorado senator tweets the most about University of Colorado? Use a
similar approach to Question 7, with strings related to CU, such as "Buffs" or
"University of Colorado".
```{r tweet-cu}
# number of cu-related tweets by senator per year
tweets_co %>%
dplyr::group_by(user, lubridate::year(date)) %>%
dplyr::filter(year(date) >= 2013) %>%
dplyr::filter(stringr::str_detect(text,
"CU|Buffs|buffs|University of Colorado")) %>%
dplyr::tally() %>%
dplyr::ungroup()
```
## Chapter 6 Homework
You will continue to use the [FiveThirtyEight](https://github.com/fivethirtyeight/data/tree/master/twitter-ratio){target="_blank"} Twitter data for homework.
Download the csv from Canvas containing Colorado senators' tweets from 2011 to
2017 (not the full data file used in the above exercises) and the R Markdown
template. Your knitted submission is due at the start of the first class of the
next chapter.