generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 2
/
part3-03-vectors.Rmd
933 lines (695 loc) · 45 KB
/
part3-03-vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
# Vectors
Vectors (similar to single-type arrays in other languages) are ordered collections of simple types, usually numerics, integers, characters, or logicals. We can create vectors using the `c()` function (for concatenate), which takes as parameters the elements to put into the vector:
<pre id=part3-03-concat
class="language-r
line-numbers
linkable-line-numbers">
<code>
samples <- c(3.2, 4.7, -3.5) # 3-element numeric vector
</code></pre>
The `c()` function can take other vectors as parameters, too--it will “deconstruct” all subvectors and return one large vector, rather than a vector of vectors.
<pre id=part3-03-c-deconstruct
class="language-r
line-numbers
linkable-line-numbers">
<code>
samples2 <- c(20.4, 4.7, 37.6) # 5-element numeric vector
print(samples2) # prints [1] 20.4, 3.2, 4.7, -3.5, 37.6
</code></pre>
We can extract individual elements from a vector using `[]` syntax; though note that, unlike many other languages,
the first element is at index 1.
<pre id=part3-03-bracket-index-single
class="language-r
line-numbers
linkable-line-numbers">
<code>
second_sample <- samples2[2] # numeric 3.2
</code></pre>
The `length()` function returns the number of elements of a vector (or similar types, like lists, which we’ll cover later) as an integer:
<pre id=part3-03-length-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
num_samples <- length(samples2) # integer 5
</code></pre>
We can use this to extract the last element of a vector, for example.
<pre id=part3-03-length-get-last
class="language-r
line-numbers
linkable-line-numbers">
<code>
last_sample <- samples2[num_samples] # numeric 37.6
# OR
last_sample <- samples2[length(samples2)] # numeric 37.6
</code></pre>
### No "Naked Data": Vectors Have (a) Class {-}
So far in our discussion of R’s data types, we’ve been making a simplification, or at least we’ve been leaving something out. Even individual values like the numeric `4.6` are actually vectors of length one. Which is to say, `gc_content <- 0.34` is equivalent to `gc_content <- c(0.34)`, and in both cases, `length(gc_content)` will return `1`, which itself is a vector of length one. This applies to numerics, integers, logicals, and character types. Thus, at least compared to other languages, R has no “naked data”; the vector is the most basic unit of data that R has. This is slightly more confusing for character types than others, as each individual element is a string of characters of any length (including potentially the “empty” string `""`).
<div class="fig center" style="width: 100%">
<img src="images/part3-03-vectors.Rmd.images/III.3_6_r_22_2_char_vec.png" />
</div>
This explains quite a lot about R, including some curiosities such as why `print(gc_content)` prints `[1] 0.34`. This output is indicating that `gc_content` is a vector, the first element of which is `0.34`. Consider the `seq()` function, which returns a vector of numerics; it takes three parameters:^[Most R functions take a large number of parameters, but many of them are optional. In the next chapter, we’ll see what such optional parameters look like, and how to get an extensive list of all the parameters that built-in R functions can take.] (1) the number at which to start, (2) the number at which to end, and (3) the step size.
<pre id=part3-03-seq-example
class="language-r
line-numbers
linkable-line-numbers">
<code>
range <- seq(1, 20, 0.5)
print(range)
</code></pre>
When we print the result, we’ll get output like the following, where the list of numbers is formatted such that it spans the width of the output window.
<pre id=part3-03-seq-example-out
class="language-txt
line-numbers
linkable-line-numbers
no-whitespace-normalization">
<code> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
[16] 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
[31] 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0
</code></pre>
The numbers in brackets indicate that the first element of the printed vector is `1.0`, the sixteenth element is `8.5`, and the thirty-first element is `16.0`.
By the way, to produce a sequence of integers (rather than numerics), the step-size argument can be left off, as in `seq(1, 20)`. This is equivalent to a commonly seen shorthand, `1:20`.
If all of our integers, logicals, and so on are actually vectors, and we can tell their type by running the `class()` function on them, then vectors must be the things that we are examining the class of. So, what if we attempt to mix types within a vector, for example, by including an integer with some logicals?
<pre id=part3-03-vec-class-mix1
class="language-r
line-numbers
linkable-line-numbers">
<code>
mix <- c(TRUE, FALSE, as.integer(20))
</code></pre>
Running `print(class(mix))` will result in `"integer"`. In fact, if we attempt to print out mix with `print(mix)`, we’d find that the logicals have been converted into integers!
<pre id=part3-03-vec-class-mix1-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[1] 1 0 20
</code></pre>
R has chosen to convert `TRUE` into `1` and `FALSE` into `0`; these are standard binary values for true and false, whereas there is no standard logical value for a given integer. Similarly, if a numeric is added, everything is converted to numeric.
<pre id=part3-03-vec-class-mix2
class="language-r
line-numbers
linkable-line-numbers">
<code>
mix <- c(TRUE, FALSE, as.integer(20), 3.5)
print(class(mix)) # [1] "numeric"
print(mix) # [1] 1.0 0.0 20.0 3.5
</code></pre>
And if a character string is added, everything is converted into a character string (with `3.5` becoming `"3.5"`, `TRUE` becoming `"TRUE"`, and so on).
<pre id=part3-03-vec-class-mix3
class="language-r
line-numbers
linkable-line-numbers">
<code>
mix <- c(TRUE, FALSE, as.integer(20), 3.5, "A")
print(class(mix)) # [1] "character"
print(mix) # [1] "TRUE" "FALSE" "20" "3.5" "A"
</code></pre>
In summary, vectors are the most basic unit of data in R, and they cannot mix types—R will autoconvert any mixed types in a single vector to a “lowest common denominator,” in the order of logical (most specific), integer, numeric, character (most general). This can sometimes result in difficult-to-find bugs, particularly when reading data from a file. If a file has a column of what appears to be numbers, but a single element cannot be interpreted as a number, the entire vector may be converted to a character type with no warning as the file is read in. We’ll discuss reading data in from text files after examining vectors and their properties.
### Subsetting Vectors, Selective Replacement {-}
Consider the fact that we can use `[]` syntax to extract single elements from vectors:
<pre id=part3-03-selective-rep-indexing-single
class="language-r
line-numbers
linkable-line-numbers">
<code>
numbers <- c(10, 20, 30, 40, 50)
second_el <- numbers[2] # 20
</code></pre>
Based on the above, we know that the `20` extracted is a vector of length one. The `2` used in the brackets is also a vector of length one; thus the line above is equivalent to `second_el <- nums[c(2)]`. Does this mean that we can use longer vectors for extracting elements? Yes!
<pre id=part3-03-selective-rep-indexing-multi
class="language-r
line-numbers
linkable-line-numbers">
<code>
subvector <- numbers[c(3,2)]
print(subvector) # [1] 30 20
</code></pre>
In fact, the extracted elements were even placed in the resulting two-element vector in the order in which they were extracted (the third element followed by the second element). We can use a similar syntax to selectively replace elements by specific indices in vectors.
<pre id=part3-03-selective-rep-multi1
class="language-r
line-numbers
linkable-line-numbers">
<code>
numbers[c(3,2)] <- c(35, 25)
print(numbers) # [1] 10 25 35 40 50
</code></pre>
*Selective replacement* is the process of replacing selected elements of a vector (or similar structure) by specifying which elements to replace with `[]` indexing syntax combined with assignment `<-`.^[The term “selective replacement” is not widely used outside of this book. In some situations, the term “conditional replacement” is used, but we wanted to define some concrete terminology to capture the entirety of the idea.]
R vectors (and many other data container types) can be named, that is, associated with a character vector of the same length. We can set and subsequently get this names vector using the `names()` function, but the syntax is a little odd.
<pre id=part3-03-named-vec1
class="language-r
line-numbers
linkable-line-numbers">
<code>
# create vector
scores <- c(89, 94, 73)
# set names for the elements
names(scores) <- c("Student A", "Student B", "Student C")
print("Printing the vector:")
print(scores)
print("Printing the names:")
names_scores <- names(scores)
print(names_scores)
</code></pre>
Named vectors, when printed, display their names as well. The result from above:
<pre id=part3-03-named-vec1-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[1] "Printing the vector:"
Student A Student B Student C
89 94 73
[1] "Printing the names:"
[1] "Student A" "Student B" "Student C"
</code></pre>
Named vectors may not seem that helpful now, but the concept will be quite useful later. Named vectors give us another way to subset and selectively replace in vectors: by name.
<pre id=part3-03-named-vec-selection
class="language-r
line-numbers
linkable-line-numbers">
<code>
ca_scores <- scores[c("Student C", "Student A")] # 2 element vector: 73 98
# OR
ca_names <- c("Student C", "Student A")
ca_scores <- scores[ca_names]
scores[c("Student A", "Student C")] <- c(93, 84)
print(scores)
</code></pre>
Although R doesn’t enforce it, the names should be unique to avoid confusion when selecting or selectively replacing this way. Having updated Student A’s and Student B’s score, the change is reflected in the output:
<pre id=part3-03-named-vec-selection-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
Student A Student B Student C
93 94 84
</code></pre>
There’s one final and extremely powerful way of subsetting and selectively replacing in a vector: by logical vector. By indexing with a vector of logicals of the same length as the vector to be indexed, we can extract only those elements where the logical vector has a `TRUE` value.
<pre id=part3-03-logical-selection
class="language-r
line-numbers
linkable-line-numbers">
<code>
select_vec <- c(TRUE, FALSE, TRUE)
ac_scores <- scores[select_vec] # 2 element vector: 93 84
# OR
ac_scores <- scores[c(TRUE, FALSE, TRUE)]
</code></pre>
While indexing by index number and by name allows us to extract elements in any given order, indexing by logical doesn’t afford us this possibility.
We can perform selective replacement this way as well; let’s suppose Students A and C retake their quizzes and moderately improve their scores.
<pre id=part3-03-logical-replacement
class="language-r
line-numbers
linkable-line-numbers">
<code>
scores[c(TRUE, FALSE, TRUE)] <- c(94, 86)
print(scores)
</code></pre>
And the printed output:
<pre id=part3-03-logical-replacement-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
Student A Student B Student C
94 94 86
</code></pre>
In this case, the length of the replacement vector (`c(159, 169)`) is equal to the number of `TRUE` values in the indexing vector (`c(TRUE, FALSE, TRUE)`); we’ll explore whether this is a requirement below.
In summary, we have three important ways of indexing into/selecting from/selectively replacing in vectors:
1. by index number vector,
2. by character vector (if the vector is named), and
3. by logical vector.
### Vectorized Operations, `NA` Values {-}
If vectors are the most basic unit of data in R, all of the functions and operators we’ve been working with—`as.numeric()`, `*`, and even comparisons like `>`—implicitly work over entire vectors.
<pre id=part3-03-as-numeric-vectorized
class="language-r
line-numbers
linkable-line-numbers">
<code>
numeric_chars <- c("6", "3.7", "9b3x")
numerics <- as.numeric(numeric_chars)
print(numerics) # [1] 6.0 3.7 NA
</code></pre>
In this example, each element of the character vector has been converted, so that `class(numerics)` would return `"numeric"`. The final character string, `"9b3x"`, cannot be reasonably converted to a numeric type, and so it has been replaced by `NA`. When this happens, the interpreter produces a warning message: `NAs introduced by coercion`.
`NA` is a special value in R that indicates either missing data or a failed computation of some type (as in attempting to convert `"9b3x"` to a numeric). Most operations involving `NA` values return `NA` values; for example, `NA + 3` returns `NA`, and many functions that operate on entire vectors return an `NA` if any element is `NA`. A canonical example is the `mean()` function.
<pre id=part3-03-mean-na-example
class="language-r
line-numbers
linkable-line-numbers">
<code>
ave <- mean(numerics)
print(ave) # [1] NA
</code></pre>
Such functions often include an optional parameter that we can give, `na.rm = TRUE`, specifying that `NA` values should be removed before the function is run.
<pre id=part3-03-mean-na-example-narm-true
class="language-r
line-numbers
linkable-line-numbers">
<code>
ave <- mean(numerics, na.rm = TRUE)
print(ave) # [1] 4.85
</code></pre>
While this is convenient, there is a way for us to remove `NA` values from any vector (see below).
Other special values in R include `NaN`, for “Not a Number,” returned by calculations such as the square root of -1, `sqrt(-1)`, and `Inf` for “Infinity,” returned by calculations such as `1/0`. (`Inf/Inf`, by the way, returns `NaN`.)
Returning to the concept of vectorized operations, simple arithmetic operations such as `+`, `*`, `/`, `-`, `^` (exponent), and `%%` ([modulus](linky)) are vectorized as well, meaning that an expression like `3 * 7` is equivalent to `c(3)` * `c(7)`. When the vectors are longer than a single element, the operation is done on an element-by-element basis.
<pre id=part3-03-vectorized-mult-ex
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40)
mult <- c(1, 2, 3, 4)
result <- values * mult # 4 element vector: 10 40 90 160
</code></pre>
<div class="fig right" style="width: 20%; margin-left: 20px">
<img src="images/part3-03-vectors.Rmd.images/III.3_27_vectorized_multiplication.png" />
</div>
If we consider the `*` operator, it takes two inputs (numeric or integer) and returns an output (numeric or integer) for each pair from the vectors. This is quite similar to the comparison `>`, which takes two inputs (numeric or integer or character) and returns a logical.
<pre id=part3-03-vectorized-gt-ex
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40)
comparison_values <- c(25, 10, 25, 35)
result <- values > comparison_values # 4 element vector: FALSE TRUE TRUE TRUE
</code></pre>
### Vector Recycling {-}
###### {- #vector_recycling}
What happens if we try to multiply two vectors that aren’t the same length? It turns out that the shorter of the two will be reused as needed, in a process known as *vector recycling*, or the reuse of the shorter vector in a vectorized operation.
<pre id=part3-03-vector-recyc-1
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40)
mult <- c(10, -10)
result <- values * mult # 4 element vector: 100 -200 300 -400
</code></pre>
This works well when working with vectors of length one against longer vectors, because the length-one vector will be recycled as needed.
<pre id=part3-03-vector-recyc-2
class="language-r
line-numbers
linkable-line-numbers">
<code>
result <- values * 2 # same as values * c(2)
print(result) # [1] 20 40 60 80
</code></pre>
If the length of the longer vector is not a multiple of the length of the shorter, however, the last recycle will go only partway through.
<pre id=part3-03-vector-recyc-3
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(3, 5, 7)
mult <- c(10, -10)
result <- values * mult # 3 element vector: 30 -50 70
</code></pre>
When this happens, the interpreter prints a warning: `longer object length is not a multiple of shorter object length`. There are few situations where this type of partial recycling is not an accident, and it should be avoided.
Vector recycling also applies to [selective replacement](); for example, we can selectively replace four elements of a vector with elements from a two-element vector:
<pre id=part3-03-vector-recyc-4
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40, 50, 60)
values[c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)] <- c(5, -5)
print(values) # [1] 5 -5 30 5 50 -5
</code></pre>
More often we’ll selectively replace elements of a vector with a length-one vector.
<pre id=part3-03-vector-recyc-5
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40, 50, 60)
values[c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)] <- 0 # same as ... <- c(0)
print(values) # [1] 0 0 30 0 50 0
</code></pre>
These concepts, when combined with vector indexing of various kinds, are quite powerful. Consider that an expression like `values > 35` is itself vectorized, with the shorter vector (holding just `35`) being recycled such that what is returned is a logical vector with `TRUE` values where the elements of values are greater than `35`. We could use this vector as an indexing vector for selective replacement if we wish.
<pre id=part3-03-vector-recyc-replacement
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40, 50, 60)
select_vec <- values > 35 # TRUE TRUE TRUE FALSE FALSE FALSE
values[select_vec] <- 0
print(values) # [1] 10 20 30 0 0 0
</code></pre>
More succinctly, rather than create a temporary variable for `select_vec`, we can place the expression `values > 35` directly within the brackets.
<pre id=part3-03-vector-recyc-replacement-2
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(10, 20, 30, 40, 50, 60)
values[values > 35] <- 0
print(values) # [1] 10 20 30 0 0 0
</code></pre>
Similarly, we could use the result of something like `mean(values)` to replace all elements of a vector greater than the mean with `0` easily, no matter the order of the elements!
<pre id=part3-03-vector-recyc-replacement-3
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(30, 10, 60, 20, 40, 50)
values[values > mean(values)] <- 0
print(values) # [1] 30 10 0 20 0 0
</code></pre>
More often, we’ll want to extract such values using logical selection.
<pre id=part3-03-vector-recyc-extraction-mean
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(30, 10, 60, 20, 40, 50)
gt_mean <- values[values > mean(values)]
print(gt_mean) # [1] 60 40 50
</code></pre>
These sorts of vectorized selections, especially when combined with logical vectors, are a powerful and important part of R, so study them until you are confident with the technique.
<div class="exercises">
#### Exercises {-}
<!-- weird, something about bookdown or knitr doesn't like `r` as a variable name in inline code e.g. `rels <- seq(1, 30, 0.3)`
hell I can't even put it in a comment (that should be just r int he e.g.) -->
1. Suppose we have `els` as a range of numbers from 1 to 30 in steps of 0.3; `els <- seq(1, 30, 0.3)`. Using just the `as.integer()` function, logical indexing, and comparisons like `>`, generate a sequence `els_decimals` that contains all values of `els` that are not round integers. (That is, it should contain all values of `els` except `1.0`, `2.0`, `3.0`, and so on. There should be 297 of them.)
2. We briefly mentioned the `%%`, or “modulus,” operator, which returns the remainder of a number after integer division (e.g., `4 %% 3 == 1` and `4 %% 4 == 0`; it is also vectorized). Given any vector `els`, for example `els <- seq(1, 30, 0.3)`, produce a vector `els_every_other` that contains every other element of `els`. You will likely want to use `%%`, the `==` equality comparison, and you might also want to use `seq()` to generate a vector of indices of the same length as `els`. Do the same again, but modify the code to extract every third element of `els` into a vector called `els_every_third`.
3. From chapter 27, “[Variables and Data](),” we know that comparisons like `==`, `!=`, `>=` are available as well. Further, we know that `!` negates the values of a logical vector, while `&` combines two logical vectors with “and,” and `|` combines two logical vectors with “or.” Use these, along with the `%%` operator discussed above, to produce a vector `div_3_4` of all integers between `1` and `1,000` (inclusive) that are evenly divisible by `3` and evenly divisible by `4`. (There are 83 of them.) Create another, `not_div_5_6`, of numbers that are not evenly divisible by `5` or `6`. (There are 667 of them. For example, `1,000` should not be included because it is divisible by `5`, and `18` should not be included because it is divisible by `6`, but `34` should be because it is divisible by neither.)
</div>
### Common Vector Functions {-}
As vectors (specifically numeric vectors) are so ubiquitous, R has dozens (hundreds, actually) of functions that do useful things with them. While we can’t cover all of them, we can quickly cover a few that will be important in future chapters.
First, we’ve already seen the `seq()` and `length()` functions; the former generates a numeric vector comprising a sequence of numbers, and the latter returns the length of a vector as a single-element integer vector.
<pre id=part3-03-range-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
range <- seq(0, 7, 0.2) # 0.0 0.2 0.4 ... 7.0
len_range <- length(range) # 36
</code></pre>
Presented without an example, `mean()`, `sd()`, and `median()` return the mean, standard deviation, and median of a numeric vector, respectively. (Provided that none of the input elements are `NA`, though all three accept the `na.rm = TRUE` parameter.) Generalizing `median()`, the `quantile()` function returns the Yth percentile of a function, or multiple percentiles if the second argument has more than one element.
<pre id=part3-03-quantile-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
quantiles_range <- quantile(range, c(0.25, 0.5, 0.75))
print(quantiles_range)
25% 50% 75%
1.75 3.50 5.25
</code></pre>
The output is a named numeric vector:
<pre id=part3-03-quantile-func-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
25% 50% 75%
1.75 3.50 5.25
</code></pre>
The `unique()` function removes duplicates in a vector, leaving the remaining elements in order of their first occurrence, and the `rev()` function reverses a vector.
<pre id=part3-03-unique-rev-funcs
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(20, 40, 30, 20, 10, 50, 10)
values_uniq <- unique(values) # 20 40 30 10 50
rev_uniq <- rev(values_uniq) # 50 10 30 40 20
</code></pre>
There is the `sort()` function, which sorts a vector (in natural order for numerics and integers, and [lexicographic (dictionary) order]() for character vectors). Perhaps more interesting is the `order()` function, which returns an integer vector of indices describing where the original elements of the vector would need to be placed to produce a sorted order.
<pre id=part3-03-order-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
order_rev_uniq <- order(rev_uniq) # 2 5 3 4 1
</code></pre>
In this example, the order vector, `2 5 3 4 1`, indicates that the second element of `rev_uniq` would come first, followed by the fifth, and so on. Thus we could produce a sorted version of `rev_uniq` with `rev_uniq[order_rev_uniq]` (by virtue of vectors’ index-based selection), or more succinctly with `rev_uniq[order(rev_uniq)]`.
<div class="fig center" style="width: 60%">
<img src="images/part3-03-vectors.Rmd.images/III.3_43_order_example.png" />
</div>
Importantly, this allows us to rearrange multiple vectors with a common order determined by a single one. For example, given two vectors, `id` and `score`, which are related element-wise, we might decide to rearrange both sets in alphabetical order for `id`.
<pre id=part3-03-order-func-example
class="language-r
line-numbers
linkable-line-numbers">
<code>
id <- c("cc4", "aa6", "bb3")
score <- c(20.05, 35.62, 42.71)
id_sorted <- id[order(id)]
score_sorted <- score[order(id)]
print(id_sorted) # [1] "aa6" "bb3" "cc4"
print(score_sorted) # [1] 35.62 42.71 20.05
values <- c(5, 10, 15, 20, 25, 30)
</code></pre>
The `sample()` function returns a random sampling from a vector of a given size, either with replacement or without as specified with the `replace =` parameter (`FALSE` is the default if unspecified).
<pre id=part3-03-sample-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
values <- c(5, 10, 15, 20, 25, 30)
sample_1 <- sample(values, 3, replace = FALSE) # 15 5 30
sample_2 <- sample(values, 3, replace = TRUE) # 15 30 15
</code></pre>
The `rep()` function repeats a vector to produce a longer vector. We can repeat in an element-by-element fashion, or over the whole vector, depending on whether the `each =` parameter is used or not.
<pre id=part3-03-rep-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
count <- c(1, 2)
count_rep1 <- rep(count, 3) # 1 2 1 2 1 2
count_rep2 <- rep(count, each = 3) # 1 1 1 2 2 2
</code></pre>
Last (but not least) for this discussion is the `is.na()` function: given a vector with elements that are possibly `NA` values, it returns a logical vector whole elements are `TRUE` in indices where the original was `NA`, allowing us to easily indicate which elements of vectors are `NA` and remove them.
<pre id=part3-03-isna-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
values_char <- c("5.7", "4.3", "a9b3", "2.4")
values <- as.numeric(values_char) # 5.7 4.3 NA 2.4
values_na <- is.na(values) # FALSE FALSE TRUE FALSE
values_no_nas <- values[!values_na] # 5.7 4.3 2.4
# OR
values_no_nas <- values[!is.na(values_na)] # 5.7 4.3 2.4
</code></pre>
Notice the use of the exclamation point in the above to negate the logical vector returned by `is.na()`.
### Generating Random Data {-}
R excels at working with probability distributions, including generating random samples from them. Many distributions are supported, including the Normal (Gaussian), Log-Normal, Exponential, Gamma, Student’s t, and so on. Here we’ll just look at generating samples from a few for use in future examples.
First, the `rnorm()` function generates a numeric vector of a given length sampled from the Normal distribution with specified mean (with `mean =`) and standard deviation (with `sd =`).
<pre id=part3-03-rnorm-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
sample_norm <- rnorm(5, mean = 6, sd = 2) # e.g. 7.07 2.4 4.5 6.2 5.1
</code></pre>
Similarly, the `runif()` function samples from a uniform distribution limited by a minimum and maximum value.
<pre id=part3-03-runif-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
sample_unif <- runif(5, min = 2, max = 6) # e.g. 2.1 4.06 2.48 4.67 5.80
</code></pre>
The `rexp()` generates data from an Exponential distribution with a given “rate” parameter, controlling the rate of decay of the density function (the mean of large samples will approach `1.0/rate`).
<pre id=part3-03-rexp-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
sample_exp <- rexp(5, rate = 1.5) # e.g. 0.24 0.50 0.01 0.55 0.30
</code></pre>
<div class="fig center" style="width: 80%">
<img src="images/part3-03-vectors.Rmd.images/III.3_51_distribution_shapes.png" />
</div>
R includes a large number of statistical tests, though we won’t be covering much in the way of statistics other than a few driving examples. The `t.test()` function runs a two-sided student’s t-test comparing the means of two vectors. What is returned is a more complex data type with class `"htest"`.
<pre id=part3-03-ttest-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
sample_1 <- rnorm(100, mean = 10, sd = 4)
sample_2 <- rnorm(100, mean = 12, sd = 4)
ttest_result <- t.test(sample_1, sample_2)
print(class(ttest_result)) # [1] "htest"
print(ttest_result)
</code></pre>
When printed, this complex data type formats itself into nice, human-readable output:
<pre id=part3-03-ttest-func-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
Welch Two Sample t-test
data: sample_1 and sample_2
t = -2.6847, df = 193.503, p-value = 0.007889
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.690577 -0.411598
sample estimates:
mean of x mean of y
10.03711 11.58819
</code></pre>
### Reading and Writing Tabular Data, Wrapping Long Lines {-}
Before we go much further, we’re going to want to be able to import data into our R programs from external files (which we’ll assume to be rows and columns of data in text files). We’ll do this with `read.table()`, and the result will be a type of data known as a “data frame” (or `data.frame` in code). We’ll cover the nuances of data frames later, but note for now that they can be thought of as a collection of vectors (of equal length), one for each column in the table.
As an example, let’s suppose we have a tab-separated text file in our present working directory called `states.txt`.^[When running on the command line, the present working directory is inherited from the shell. In RStudio, the present working directory is set to the “project” directory if the file is part of a project folder. In either case, it is possible to change the working directory from within R using the `setwd()` function, as in `setwd("/home/username/rproject")` in Unix/Linux and `setwd("C:/Documents and Settings/username/My Documents/rproject")` in Windows. It is also possible to specify file names by absolute path, as in `/home/username/rproject/states.txt`, no matter the present working directory.] Each row represents one of the US states along with information on population, per capita income, illiteracy rate, murder rate (per 100,000), percentage of high school graduates, and region (all measured in the 1970s). The first row contains a “header” line with column names.
<pre id=part3-03-states-txt-example
class="language-txt
line-numbers
linkable-line-numbers">
<code>
name population income murder hs_grad region
Alabama 3615 3624 15.1 41.3 South
Alaska 365 6315 11.3 66.7 West
Arizona 2212 4530 7.8 58.1 West
Arkansas 2110 3378 10.1 39.9 South
California 21198 5114 10.3 62.6 West
Colorado 2541 4884 6.8 63.9 West
...
</code></pre>
Later in the file, someone has decided to annotate Michigan’s line, indicating it as the “mitten” state:
<pre id=part3-03-states-txt-example-comment
class="language-txt
line-numbers
linkable-line-numbers">
<code>
...
Massachusetts 5814 4755 3.3 58.5 Northeast
Michigan 9111 4751 11.1 52.8 North Central # mitten
Minnesota 3921 4675 2.3 57.6 North Central
...
</code></pre>
Like most functions, `read.table()` takes many potential parameters (23, in fact), but most of them have reasonable defaults. Still, there are five or so that we will commonly need to set. Because of the need to set so many parameters, using `read.table()` often results in a long line of code. Fortunately, the R interpreter allows us to break long lines over multiple lines, so long as each line ends on a character that doesn’t complete the expression (so the interpreter knows it needs to keep reading following lines before executing them). Common character choices are the comma and plus sign. When we do wrap a long line in this way, it’s customary to indent the following lines to indicate their continuance in a visual way.
<pre id=part3-03-read-table-func
class="language-r
line-numbers
linkable-line-numbers">
<code>
states <- read.table(file = "states.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE,
comment.char = "#")
</code></pre>
When reading states.txt, the `file =` parameter specifies the file name to be read, while `header = TRUE` indicates to the interpreter that the first line in the file gives the column names (without it, the column names will be `"V1"`, `"V2"`, `"V3"` and so on). The `sep = "\t"` parameter indicates that tab characters are used to separate the columns in the file (the default is any whitespace), and `comment.char = "#"` indicates that `#` characters and anything after them should be ignored while reading the file (which is appropriate, as evident by the `# mitten` annotation in the file). The `stringsAsFactors = FALSE` parameter is more cryptic: it tells the interpreter to leave the character-vector columns (like `region` in this example) as character vectors, rather than convert them to the more sophisticated factor data type (to be covered in later chapters).
At this point, the `states` variable contains the data frame holding the columns (vectors) of data. We can print it with `print(states)`, but the result is quite a lot of output:
<pre id=part3-03-print-states-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
name population income murder hs_grad region
1 Alabama 3615 3624 15.1 41.3 South
2 Alaska 365 6315 11.3 66.7 West
3 Arizona 2212 4530 7.8 58.1 West
4 Arkansas 2110 3378 10.1 39.9 South
5 California 21198 5114 10.3 62.6 West
6 Colorado 2541 4884 6.8 63.9 West
...
</code></pre>
It might make better sense to extract just the first 10 rows of data and print them, which we can do with the `head()` function (`head()` can also extract just the first few elements of a long vector).
<pre id=part3-03-states-head
class="language-r
line-numbers
linkable-line-numbers">
<code>
first_10 <- head(states, n = 10)
print(first_10)
# OR
print(head(states, n = 10))
</code></pre>
The functions `nrow()` and `ncol()` return the number of rows and columns of a data frame, respectively (which is preferred over `length()`, which returns the number of columns); the `dim()` function returns a two-element vector with number of rows (at index 1) and number of columns (at index 2).
As mentioned previously, individual columns of a data frame are (almost always) vectors. To access one of these individual vectors, we can use a special `$` syntax, with the column name following the `$`.
<pre id=part3-03-states-dollar-syntax
class="language-r
line-numbers
linkable-line-numbers">
<code>
incomes <- states$"income"
print(incomes) # [1] 3624 6315 4530 3378 5114 4884 ...
</code></pre>
So long as the column name is sufficiently simple (in particular, so long as it doesn’t have any spaces), then the quote marks around the column name can be (and often are) omitted.
<pre id=part3-03-states-dollar-syntax-noquotes
class="language-r
line-numbers
linkable-line-numbers">
<code>
incomes <- states$income
print(incomes) # [1] 3624 6315 4530 3378 5114 4884 ...
</code></pre>
Although this syntax can be used to extract a column from a data frame as a vector, note that it refers to the vector within the data frame as well. In a sense, `states$income` *is* the vector stored in the `states` data frame. Thus we can use techniques like [selective replacement]() to work with them just like any other vectors. Here, we’ll replace all instances of “North Central” in the `states$region` vector with just the term “Central,” effectively renaming the region.^[If you have any familiarity with R, you might have run across the `attach()` function, which takes a data frame and results in the creation of a separate vector for each column. Generally, “disassembling” a data frame this way is a bad idea—after all, the columns of a data frame are usually associated with each other for a reason! Further, this function results in the creation of many variables with names based on the column names of the data frame. Because these names aren’t clearly delimited in the code, it’s easy to create hard-to-find bugs and mix up columns from multiple data frames this way.]
<pre id=part3-03-states-rename-region
class="language-r
line-numbers
linkable-line-numbers">
<code>
nrth_cntrl_logical <- states$region == "North Central" # Logical vector
states$region[nrth_cntrl_logical] <- "Central" # Selective replacement
# OR
states$region[states$region == "North Central"] <- "Central"
</code></pre>
Writing a data frame to a tab-separated file is accomplished with the `write.table()` function.^[There are also more specialized functions for both reading and writing tabular data, such as `read.csv()` and `write.csv()`. We’ve focused on `read.table()` and `write.table()` because they are flexible enough to read and write tables in a variety of formats, including comma separated, tab separated, and so on.] As with `read.table()`, `write.table()` can take quite a few parameters, most of which have reasonable defaults. But there are six or so we’ll want to set more often than others. Let’s write the modified `states` data frame to a file called `states_modified.txt` as a tab-separated file.
<pre id=part3-03-write-table
class="language-r
line-numbers
linkable-line-numbers">
<code>
write.table(states,
file = "states_modified.txt",
quote = FALSE,
sep = "\t",
row.names = FALSE,
col.names = TRUE)
</code></pre>
The first two parameters here are the data frame to write and the file name to write to. The `quote = FALSE` parameter specifies that quotation marks shouldn’t be written around character types in the output (so the `name` column will have entries like `Alabama` and `Alaska` rather than `"Alabama"` and `"Alaska"`). The `sep = "\t"` indicates that tabs should separate the columns, while `row.names = FALSE` indicates that row names should not be written (because they don’t contain any meaningful information for this data frame), and `col.names = TRUE` indicates that we do want the column names output to the first line of the file as a “header” line.
<div class="callout-box">
#### R and the Unix/Linux Command Line {-}
In chapter 26, “[An Introduction](),” we mentioned that R scripts can be run from the command line by using the `#!/usr/bin/env Rscript` executable environment. (Older versions of R required the user to run a command like `R CMD BATCH scriptname.R`, but today using `Rscript` is preferred.) We devoted more discussion to interfacing Python with the command line environment than we will R, partially because R isn’t as frequently used that way, but also because it’s quite easy.
When using `read.table()`, for example, data can be read from standard input by using the file name `"stdin"`. Anything that is printed from an R script goes to standard output by default. Because R does a fair amount of formatting when printing, however, it is often more convenient to print data frames using `write.table()` specifying `file = ""`.
Finally, to get command line parameters into an R script as a character vector, the line `args <- commandArgs(trailingOnly = TRUE)` will do the trick. Here’s a simple script that will read a table on standard input, write it to standard output, and also read and print out any command line arguments:
<pre id=part3-03-rscript-example
class="language-r
line-numbers
linkable-line-numbers">
<code>
#!/usr/bin/env Rscript
# read args from command-line params
args <- commandArgs(trailingOnly = TRUE)
print(args)
# read data frame from stdin
input_df <- read.table("stdin",
header = FALSE,
stringsAsFactors = FALSE)
# write data frame to stdout
write.table(input_df,
file = "",
row.names = FALSE,
col.names = FALSE,
sep = "\t")
</code></pre>
Try making this script executable on the command line, and running it on `p450s_blastp_yeast_top1.txt` with something like `cat p450s_blastp_yeast_top1.txt | ./stdin_stdout_ex.R arg1 'arg 2'`.
</div>
<div class="exercises">
#### Exercises {-}
1. Suppose we have any odd-length numeric vector (e.g., `sample<- c(3.2, 5.1, 2.5, 1.6, 7.9)` or `sample <- runif(25, min = 0, max = 1))`. Write some lines of code that result in printing the median of the vector, without using the `median()` or `quantile()` functions. You might find the `length()` and `as.integer()` functions to be helpful.
2. If `sample` is a sample from an exponential distribution, for example, `sample <- rexp(1000, rate = 1.5)`, then the median of the sample is generally smaller than the mean. Generate a vector, `between_median_mean`, that contains all values of `sample` that are larger than (or equal to) the median of the sample, and less than (or equal to) the mean of the sample.
3. Read in the [`states.txt`]() file into a data frame as described. Extract a numeric vector called `murder_lowincome` containing murder rates for just those states with per capita incomes less than the median per capita income (you can use the `median()` function this time). Similarly, extract a vector called `murder_highincome` containing murder rates for just those states with greater than (or equal to) the median per capita income. Run a two-sample `t.test()` to determine whether the mean murder rates are different between these two groups.
4. Let `states` be the state information data frame described above. Describe what the various operations below do in terms of indexing, selective replacement, vector recycling, and the types of data involved (e.g., numeric vectors and logical vectors). To get you started, the first line adds a new column to the `states` data frame called `"newpop"` that contains the same information as the `"population"` column.
<pre id=part3-03-describe-ops-exercise
class="language-r
line-numbers
linkable-line-numbers">
<code>
states$newpop <- states$population
highmurder <- states$murder >= median(states$murder)
states$newpop[highmurder] <- states$population[highmurder] * 0.9
states$newpop[!highmurder] <- states$population[!highmurder] * 1.1
</code></pre>
5. Determine the number of unique regions that are listed in the `states` data frame. Determine the number of unique regions represented by states with greater than the median income.
6. What does the `sum()` function report for a numeric vector `c(2, 3, 0, 1, 0, 2)`? How about for `c(1, 0, 0, 1, 1, 0)`? And, finally, how about for the logical vector `c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE)`? How could the `sum()` function thus be useful in a logical context?
</div>