-
Notifications
You must be signed in to change notification settings - Fork 83
/
Copy path08_Probability.Rmd
928 lines (693 loc) · 31.8 KB
/
08_Probability.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
```{r, echo=FALSE}
library(knitr)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options)) # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "etc ..."
if (length(lines) == 1) {
if (length(x) > lines) { # truncate the output, but add ....
x <- c(head(x, lines), more)
}
} else {
x <- c(more, x[lines], more)
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
```
# Probability {#Probability}
Introduction {-#intro-Probability}
------------
Probability theory is the foundation of statistics, and R has plenty of
machinery for working with probability, probability distributions, and
random variables. The recipes in this chapter show you how to calculate
probabilities from quantiles, calculate quantiles from probabilities,
generate random variables drawn from distributions, plot distributions,
and so forth.
### Names of Distributions {-}
R has an abbreviated name for every probability distribution. This name
is used to identify the functions associated with the distribution. For
example, the name of the Normal distribution is “norm,” which is the
root of the function names listed in Table \@ref(tab:tbl-profunctions).
Function Purpose
---------- ------------------------------
`dnorm` Normal density
`pnorm` Normal distribution function
`qnorm` Normal quantile function
`rnorm` Normal random variates
Table: (\#tab:tbl-profunctions) Normal distribution functions
Table \@ref(tab:tbl-Discrete) describes some common discrete
distributions, and Table \@ref(tab:tbl-Continuous) describes several
common continuous distributions.
Table: (\#tab:tbl-Discrete) Common discrete distributions
|Discrete distribution|R name|Parameters |
|---|---|---|
| Binomial| `binom` | `n` = number of trials; `p` = probability of success for one trial |
|Geometric|`geom`|`p` = probability of success for one trial|
|Hypergeometric|`hyper`|`m` = number of white balls in urn; `n` = number of black balls in urn; `k` = number of balls drawn from urn|
|Negative binomial (NegBinomial)|`nbinom`|`size` = number of successful trials; either `prob` = probability of successful trial or `mu` = mean|
|Poisson|`pois`|`lambda` = mean|
Table: (\#tab:tbl-Continuous) Common continuous distributions
|Continuous distribution|R name|Parameters|
|--|--|--|
| Beta| `beta` | `shape1`; `shape2` |
|Cauchy|`cauchy`|`location`; `scale`|
|Chi-squared (Chisquare)|`chisq`|`df` = degrees of freedom|
|Exponential|`exp`|`rate`|
|F|`f`|`df1` and `df2` = degrees of freedom|
|Gamma|`gamma`|`rate`; either `rate` or `scale`|
|Log-normal (Lognormal)|`lnorm`|`meanlog` = mean on logarithmic scale; `sdlog` = standard deviation on logarithmic scale|
|Logistic|`logis`|`location`; `scale`|
|Normal|`norm`|`mean`; `sd` = standard deviation|
|Student’s *t* (TDist)|`t`|`df` = degrees of freedom|
|Uniform|`unif`|`min` = lower limit; `max` = upper limit|
|Weibull|`weibull`|`shape`; `scale`|
|Wilcoxon|`wilcox`|`m` = number of observations in first sample; `n` = number of observations in second sample|
> **Warning**
>
> All distribution-related functions require distributional parameters,
> such as `size` and `prob` for the binomial or `prob` for the
> geometric. The big “gotcha” is that the distributional parameters may
> not be what you expect. For example, we would expect the parameter of
> an exponential distribution to be *β*, the mean. The R convention,
> however, is for the exponential distribution to be defined by the rate
> = 1/*β*, so we often supply the wrong value. The moral is, study the
> help page before you use a function related to a distribution. Be sure
> you’ve got the parameters right.
### Getting Help on Probability Distributions {-}
To see the R functions related to a particular probability distribution,
use the help command and the full name of the distribution. For example,
this will show the functions related to the Normal distribution:
``` {r, eval=FALSE}
?Normal
```
Some distributions have names that don’t work well with the help
command, such as “Student’s *t*.” They have special help names, as noted
in Tables \@ref(tab:tbl-Discrete) and \@ref(tab:tbl-Continuous): NegBinomial, Chisquare, Lognormal, and
TDist. Thus, to get help on the Student’s *t* distribution, use this:
``` {r, eval=FALSE}
?TDist
```
### See Also {-}
There are many other distributions implemented in downloadable packages;
see the CRAN task view devoted to [probability
distributions](http://cran.r-project.org/web/views/Distributions.html).
The `SuppDists` package is part of the R base, and it includes 10
supplemental distributions. The `MASS` package, which is also part of
the base, provides additional support for distributions, such as
maximum-likelihood fitting for some common distributions as well as
sampling from a multivariate normal distribution.
Counting the Number of Combinations {#recipe-id163}
-----------------------------------
### Problem {-#problem-id163}
You want to calculate the number of combinations of *n* items taken *k*
at a time.
### Solution {-#solution-id163}
Use the `choose` function:
``` {r, eval = FALSE}
choose(n, k)
```
### Discussion {-#discussion-id163}
A common problem in computing probabilities of discrete variables is
counting combinations: the number of distinct subsets of size *k* that
can be created from *n* items. The number is given by *n*!/*r*!(*n* -
*r*)!, but it’s much more convenient to use the `choose`
function—especially as *n* and *k* grow larger:
``` {r}
choose(5, 3) # How many ways can we select 3 items from 5 items?
choose(50, 3) # How many ways can we select 3 items from 50 items?
choose(50, 30) # How many ways can we select 30 items from 50 items?
```
These numbers are also known as *binomial coefficients*.
### See Also {-#see_also-id163}
This recipe merely counts the combinations; see
Recipe \@ref(recipe-id198), ["Generating Combinations"](#recipe-id198), to actually generate them.
Generating Combinations {#recipe-id198}
-----------------------
### Problem {-#problem-id198}
You want to generate all combinations of *n* items taken *k* at a time.
### Solution {-#solution-id198}
Use the `combn` function:
``` {r}
items <- 2:5
k <- 2
combn(items, k)
```
### Discussion {-#discussion-id198}
We can use `combn(1:5,3)` to generate all combinations of the numbers 1
through 5 taken three at a time:
``` {r}
combn(1:5, 3)
```
The function is not restricted to numbers. We can generate combinations
of strings, too. Here are all combinations of five treatments taken
three at a time:
``` {r}
combn(c("T1", "T2", "T3", "T4", "T5"), 3)
```
> **Warning**
>
> As the number of items, *n*, increases, the number of combinations can
> explode—especially if *k* is not near to 1 or *n*.
### See Also {-#see_also-id198}
See Recipe \@ref(recipe-id163), ["Counting the Number of Combinations"](#recipe-id163), to count the
number of possible combinations *before* you generate a huge set.
Generating Random Numbers {#recipe-id165}
-------------------------
### Problem {-#problem-id165}
You want to generate random numbers.
### Solution {-#solution-id165}
The simple case of generating a uniform random number between 0 and 1 is
handled by the `runif` function. This example generates one uniform
random number:
``` {r}
runif(1)
```
> Note: If you are saying `runif` out loud (or even in your head), you should pronounce it "are unif" instead of "run if." The term `runif` is a *portmanteau* of "random uniform" so should not sound as if it's a flow control function.
R can generate random variates from other distributions as well. For a
given distribution, the name of the random number generator is “r”
prefixed to the distribution’s abbreviated name (e.g., `rnorm` for the
normal distribution’s random number generator). This example generates
one random value from the standard normal distribution:
``` {r}
rnorm(1)
```
### Discussion {-#discussion-id165}
Most programming languages have a wimpy random number generator that
generates one random number, uniformly distributed between 0.0 and 1.0,
and that’s all. Not R.
R can generate random numbers from many probability distributions other
than the uniform distribution. The simple case of generating uniform
random numbers between 0 and 1 is handled by the `runif` function:
``` {r}
runif(1)
```
The argument of `runif` is the number of random values to be generated.
Generating a vector of 10 such values is as easy as generating one:
``` {r}
runif(10)
```
There are random number generators for all built-in distributions.
Simply prefix the distribution name with “r” and you have the name of
the corresponding random number generator. Here are some common ones:
```{r, include=FALSE}
set.seed(42)
```
``` {r}
runif(1, min = -3, max = 3) # One uniform variate between -3 and +3
rnorm(1) # One standard normal variate
rnorm(1, mean = 100, sd = 15) # One normal variate, mean 100 and SD 15
rbinom(1, size = 10, prob = 0.5) # One binomial variate
rpois(1, lambda = 10) # One Poisson variate
rexp(1, rate = 0.1) # One exponential variate
rgamma(1, shape = 2, rate = 0.1) # One gamma variate
```
As with `runif`, the first argument is the number of random values to be
generated. Subsequent arguments are the parameters of the distribution,
such as `mean` and `sd` for the normal distribution or `size` and `prob`
for the binomial. See the function’s R help page for details.
The examples given so far use simple scalars for distributional
parameters. Yet the parameters can also be vectors, in which case R will
cycle through the vector while generating random values. The following
example generates three normal random values drawn from distributions
with means of –10, 0, and +10, respectively (all distributions have a
standard deviation of 1.0):
``` {r}
rnorm(3, mean = c(-10, 0, +10), sd = 1)
```
That is a powerful capability in cases such as hierarchical models,
where the parameters are themselves random. The next example calculates
30 draws of a normal variate whose mean is itself randomly distributed
and with hyperparameters of *μ* = 0 and *σ* = 0.2:
``` {r}
means <- rnorm(30, mean = 0, sd = 0.2)
rnorm(30, mean = means, sd = 1)
```
If you are generating many random values and the vector of parameters is
too short, R will apply the Recycling Rule to the parameter vector.
### See Also {-#see_also-id165}
See the [Introduction](#intro-Probability) to this chapter.
Generating Reproducible Random Numbers {#recipe-id201}
--------------------------------------
### Problem {-#problem-id201}
You want to generate a sequence of random numbers, but you want to
reproduce the same sequence every time your program runs.
### Solution {-#solution-id201}
Before running your R code, call the `set.seed` function to initialize
the random number generator to a known state:
``` {r}
set.seed(42) # Or use any other positive integer...
```
### Discussion {-#discussion-id201}
After generating random numbers, you may often want to reproduce the
same sequence of “random” numbers every time your program executes. That
way, you get the same results from run to run.
One of the authors once supported a
complicated Monte Carlo analysis of a huge portfolio of securities. The
users complained about getting slightly different results each time the
program ran. No kidding! The analysis was driven entirely by random numbers, so
of course there was randomness in the output. The solution was to set the
random number generator to a known state at the beginning of the
program. That way, it would generate the same (quasi-)random numbers
each time and thus yield consistent, reproducible results.
In R, the `set.seed` function sets the random number generator to a
known state. The function takes one argument, an integer. Any positive
integer will work, but you must use the same one in order to get the
same initial state.
The function returns nothing. It works behind the scenes, initializing
(or reinitializing) the random number generator. The key here is that
using the same seed restarts the random number generator back at the
same place:
``` {r}
set.seed(165) # Initialize generator to known state
runif(10) # Generate ten random numbers
set.seed(165) # Reinitialize to the same known state
runif(10) # Generate the same ten "random" numbers
```
> **Warning**
>
> When you set the seed value and freeze your sequence of random
> numbers, you are eliminating a source of randomness that may be critical
> to algorithms such as Monte Carlo simulations. Before you call
> `set.seed` in your application, ask yourself: Am I undercutting the
> value of my program or perhaps even damaging its logic?
### See Also {-#see_also-id201}
See Recipe \@ref(recipe-id165), ["Generating Random Numbers"](#recipe-id165), for more about generating
random numbers.
Generating a Random Sample {#recipe-id199}
--------------------------
### Problem {-#problem-id199}
You want to sample a dataset randomly.
### Solution {-#solution-id199}
The `sample` function will randomly select *n* items from a set:
``` {r, eval = FALSE}
sample(set, n)
```
### Discussion {-#discussion-id199}
Suppose your World Series data contains a vector of years when the
Series was played. You can select 10 years at random using `sample`:
``` {r, message=FALSE, warning=FALSE}
world_series <- read_csv("./data/world_series.csv")
sample(world_series$year, 10)
```
The items are randomly selected, so running `sample` again (usually)
produces a different result:
``` {r}
sample(world_series$year, 10)
```
The `sample` function normally samples without replacement, meaning it
will not select the same item twice. Some statistical procedures
(especially the bootstrap) require sampling *with* replacement, which
means that one item can appear multiple times in the sample. Specify
`replace=TRUE` to sample with replacement.
It’s easy to implement a simple bootstrap using sampling with
replacement.
Suppose we have a vector, `x`, of 1,000 random numbers,
drawn from a normal distribution with mean 4 and standard deviation 10.
```{r}
set.seed(42)
x <- rnorm(1000, 4, 10)
```
This code fragment samples 1,000 times from `x` and
calculates the median of each sample:
``` {r}
medians <- numeric(1000) # empty vector of 1000 numbers
for (i in 1:1000) {
medians[i] <- median(sample(x, replace = TRUE))
}
```
From the bootstrap estimates, we can estimate the confidence interval
for the median:
``` {r}
ci <- quantile(medians, c(0.025, 0.975))
cat("95% confidence interval is (", ci, ")\n")
```
We know that `x` was created from a normal distribution with a mean of 4
and, hence, the sample median should be 4 also.
(In a symmetrical distribution like the normal, the mean and the median are the same.)
Our confidence interval easily contains the value.
### See Also {-#see_also-id199}
See Recipe \@ref(recipe-id151), ["Randomly Permuting a Vector"](#recipe-id151), for randomly permuting
a vector and Recipe \@ref(recipe-id159), ["Bootstrapping a Statistic"](#recipe-id159), for more about bootstrapping. Recipe \@ref(recipe-id201), ["Generating Reproducible Random Numbers"](#recipe-id201), discusses setting seeds for quasi-random numbers.
Generating Random Sequences {#recipe-id200}
---------------------------
### Problem {-#problem-id200}
You want to generate a random sequence, such as a series of simulated
coin tosses or a simulated sequence of Bernoulli trials.
### Solution {-#solution-id200}
Use the `sample` function. Sample *n* draws from the set of possible values, and
set `replace=TRUE`:
``` {r, eval=FALSE}
sample(set, n, replace = TRUE)
```
### Discussion {-#discussion-id200}
The `sample` function randomly selects items from a set. It normally
samples *without* replacement, which means that it will not select the
same item twice and will return an error if you try to sample more items than exist in the set. With `replace=TRUE`, however, `sample` can select items
over and over; this allows you to generate long, random sequences of
items.
The following example generates a random sequence of 10 simulated flips
of a coin:
``` {r}
sample(c("H", "T"), 10, replace = TRUE)
```
The next example generates a sequence of 20 Bernoulli trials—random
successes or failures. We use `TRUE` to signify a success:
``` {r}
sample(c(FALSE, TRUE), 20, replace = TRUE)
```
By default, `sample` will choose equally among the set elements and so
the probability of selecting either `TRUE` or `FALSE` is 0.5. With a
Bernoulli trial, the probability *p* of success is not necessarily 0.5.
You can bias the sample by using the `prob` argument of `sample`; this
argument is a vector of probabilities, one for each set element. Suppose
we want to generate 20 Bernoulli trials with a probability of success
*p* = 0.8. We set the probability of `FALSE` to be 0.2 and the
probability of `TRUE` to 0.8:
``` {r}
sample(c(FALSE, TRUE), 20, replace = TRUE, prob = c(0.2, 0.8))
```
The resulting sequence is clearly biased toward `TRUE`. We chose this
example because it’s a simple demonstration of a general technique. For
the special case of a binary-valued sequence you can use `rbinom`, the
random generator for binomial variates:
``` {r}
rbinom(10, 1, 0.8)
```
Randomly Permuting a Vector {#recipe-id151}
---------------------------
### Problem {-#problem-id151}
You want to generate a random permutation of a vector.
### Solution {-#solution-id151}
If `v` is your vector, then `sample(v)` returns a random permutation.
### Discussion {-#discussion-id151}
We typically think of the `sample` function for sampling from large datasets.
However, the default parameters enable you to create a random
rearrangement of the dataset. The function call `sample(v)` is
equivalent to:
``` {r, eval=FALSE}
sample(v, size = length(v), replace = FALSE)
```
which means “select all the elements of `v` in random order while using
each element exactly once.” That is a random permutation. Here is a
random permutation of 1, ..., 10:
``` {r}
sample(1:10)
```
### See Also {-#see_also-id151}
See Recipe \@ref(recipe-id199), ["Generating a Random Sample"](#recipe-id199), for more about `sample`.
Calculating Probabilities for Discrete Distributions {#recipe-id196}
----------------------------------------------------
### Problem {-#problem-id196}
You want to calculate either the simple or the cumulative probability
associated with a discrete random variable.
### Solution {-#solution-id196}
For a simple probability, *P*(*X* = *x*), use the density function. All
built-in probability distributions have a density function whose name is
“d” prefixed to the distribution name. For example, `dbinom` for the
binomial distribution.
For a cumulative probability, *P*(*X* ≤ *x*), use the distribution
function. All built-in probability distributions have a distribution
function whose name is “p” prefixed to the distribution name; thus,
`pbinom` is the distribution function for the binomial distribution.
### Discussion {-#discussion-id196}
Suppose we have a binomial random variable *X* over 10 trials, where
each trial has a success probability of 1/2. Then we can calculate the
probability of observing *x* = 7 by calling `dbinom`:
``` {r}
dbinom(7, size = 10, prob = 0.5)
```
That calculates a probability of about 0.117. R calls `dbinom` the
*density function*. Some textbooks call it the *probability mass function*
or the *probability function*. Calling it a density function keeps the
terminology consistent between discrete and continuous distributions
(see Recipe \@ref(recipe-id197), ["Calculating Probabilities for Continuous Distributions"](#recipe-id197)).
The cumulative probability, *P*(*X* ≤ *x*), is given by the *distribution
function*, which is sometimes called the *cumulative probability function*.
The distribution function for the binomial distribution is `pbinom`.
Here is the cumulative probability for *x* = 7 (i.e., *P*(*X* ≤ 7)):
``` {r}
pbinom(7, size = 10, prob = 0.5)
```
It appears the probability of observing *X* ≤ 7 is about 0.945.
The density functions and distribution functions for some
common discrete distributions are shown in Table \@ref(tab:distributions).
Table: (\#tab:distributions) Discrete distributions
Distribution Density function: *P*(*X* = *x*) Distribution function: *P*(*X* ≤ *x*)
-------------- ---------------------------------- ---------------------------------------
Binomial `dbinom(*x*, *size*, *prob*)` `pbinom(*x*, *size*, *prob*)`
Geometric `dgeom(*x*, *prob*)` `pgeom(*x*, *prob*)`
Poisson `dpois(*x*, *lambda*)` `ppois(*x*, *lambda*)`
The complement of the cumulative probability is the *survival function*,
*P*(*X* > *x*). All of the distribution functions let you find this
right-tail probability simply by specifying `lower.tail=FALSE`:
``` {r}
pbinom(7, size = 10, prob = 0.5, lower.tail = FALSE)
```
Thus we see that the probability of observing *X* > 7 is about 0.055.
The *interval probability*, *P*(*x*~1~ < *X* ≤ *x*~2~), is the
probability of observing *X* between the limits *x*~1~ and *x*~2~. It is calculated as the difference between two cumulative
probabilities: *P*(*X* ≤ *x*~2~) – *P*(*X* ≤ *x*~1~). Here is *P*(3 <
*X* ≤ 7) for our binomial variable:
``` {r}
pbinom(7, size = 10, prob = 0.5) - pbinom(3, size = 10, prob = 0.5)
```
R lets you specify multiple values of *x* for these functions and will
return a vector of the corresponding probabilities. Here we calculate
two cumulative probabilities, *P*(*X* ≤ 3) and *P*(*X* ≤ 7), in one call
to `pbinom`:
``` {r}
pbinom(c(3, 7), size = 10, prob = 0.5)
```
This leads to a one-liner for calculating interval probabilities. The
`diff` function calculates the difference between successive elements of
a vector. We apply it to the output of `pbinom` to obtain the difference
in cumulative probabilities—in other words, the interval probability:
``` {r}
diff(pbinom(c(3, 7), size = 10, prob = 0.5))
```
### See Also {-#see_also-id196}
See this chapter’s [Introduction](#intro-Probability) for more about the
built-in probability distributions.
Calculating Probabilities for Continuous Distributions {#recipe-id197}
------------------------------------------------------
### Problem {-#problem-id197}
You want to calculate the distribution function (DF) or cumulative
distribution function (CDF) for a continuous random variable.
### Solution {-#solution-id197}
Use the distribution function, which calculates *P*(*X* ≤ *x*). All
built-in probability distributions have a distribution function whose
name is “p” prefixed to the distribution’s abbreviated name—for
instance, `pnorm` for the normal distribution.
Example: what's the probability of a draw being below .8 for a draw from a random standard normal distribution?
```{r}
pnorm(q = .8, mean = 0, sd = 1)
```
### Discussion {-#discussion-id197}
The R functions for probability distributions follow a consistent
pattern, so the solution to this recipe is essentially identical to the
solution for discrete random variables
(see Recipe \@ref(recipe-id196), ["Calculating Probabilities for Discrete Distributions"](#recipe-id196)).
The significant difference is that continuous variables have no
“probability” at a single point, *P*(*X* = *x*). Instead, they have a
density at a point.
Given that consistency, the discussion of distribution functions in
Recipe \@ref(recipe-id196), ["Calculating Probabilities for Discrete Distributions"](#recipe-id196), is
applicable here, too. Table \@ref(tab:continuous) gives the distribution
functions for several continuous distributions.
Table: (\#tab:continuous) Continuous distributions
Distribution Distribution function: *P*(*X* ≤ *x*)
-------------------- ---------------------------------------
Normal `pnorm(*x*, *mean*, *sd*)`
Student’s *t* `pt(*x*, *df*)`
Exponential `pexp(*x*, *rate*)`
Gamma `pgamma(*x*, *shape*, *rate*)`
Chi-squared (χ^2^) `pchisq(*x*, *df*)`
We can use `pnorm` to calculate the probability that a man is shorter
than 66 inches, assuming that men’s heights are normally distributed
with a mean of 70 inches and a standard deviation of 3 inches.
Mathematically speaking, we want *P*(*X* ≤ 66) given that *X* \~ *N*(70,
3):
``` {r}
pnorm(66, mean = 70, sd = 3)
```
Likewise, we can use `pexp` to calculate the probability that an
exponential variable with a mean of 40 could be less than 20:
``` {r}
pexp(20, rate = 1 / 40)
```
Just as for discrete probabilities, the functions for continuous
probabilities use `lower.tail=FALSE` to specify the survival function,
*P*(*X* > *x*). This call to `pexp` gives the probability that the
same exponential variable could be greater than 50:
``` {r}
pexp(50, rate = 1 / 40, lower.tail = FALSE)
```
Also like discrete probabilities, the interval probability for a
continuous variable, *P*(*x*~1~ < *X* < *x*~2~), is computed as
the difference between two cumulative probabilities, *P*(*X* <
*x*~2~) – *P*(*X* < *x*~1~). For the same exponential variable, here
is *P*(20 < *X* < 50), the probability that it could fall between
20 and 50:
``` {r}
pexp(50, rate = 1 / 40) - pexp(20, rate = 1 / 40)
```
### See Also {-#see_also-id197}
See this chapter’s [Introduction](#intro-Probability) for more about the
built-in probability distributions.
Converting Probabilities to Quantiles {#recipe-id168}
-------------------------------------
### Problem {-#problem-id168}
Given a probability *p* and a distribution, you want to determine the
corresponding quantile for *p*: the value *x* such that *P*(*X* ≤ *x*) =
*p*.
### Solution {-#solution-id168}
Every built-in distribution includes a quantile function that converts
probabilities to quantiles. The function’s name is “q” prefixed to the
distribution name; thus, for instance, `qnorm` is the quantile function
for the normal distribution.
The first argument of the quantile function is the probability. The
remaining arguments are the distribution’s parameters, such as `mean`,
`shape`, or `rate`:
``` {r}
qnorm(0.05, mean = 100, sd = 15)
```
### Discussion {-#discussion-id168}
A common example of computing quantiles is when we compute the limits of
a confidence interval. If we want to know the 95% confidence interval
(*α* = 0.05) of a standard normal variable, then we need the quantiles
with probabilities of *α*/2 = 0.025 and (1 – *α*)/2 = 0.975:
``` {r}
qnorm(0.025)
qnorm(0.975)
```
In the true spirit of R, the first argument of the quantile functions
can be a vector of probabilities, in which case we get a vector of
quantiles. We can simplify this example into a one-liner:
``` {r}
qnorm(c(0.025, 0.975))
```
All the built-in probability distributions provide a quantile function.
Table \@ref(tab:discrete-quant-dist) shows the quantile functions for some common discrete distributions.
Table: (\#tab:discrete-quant-dist) Discrete quantile distributions
Distribution Quantile function
-------------- -----------------------
Binomial `qbinom(*p*, *size*, *prob*)`
Geometric `qgeom(*p*, *prob*)`
Poisson `qpois(*p*, *lambda*)`
Table \@ref(tab:cont-quant-dist) shows the quantile functions for common continuous distributions.
Table: (\#tab:cont-quant-dist) Continuous quantile distributions
Distribution Quantile function
-------------------- --------------------------------------------------------------
Normal `qnorm(*p*, *mean*, *sd*)`
Student’s *t* `qt(*p*, *df*)`
Exponential `qexp(*p*, *rate*)`
Gamma `qgamma(*p*, *shape*, *rate*=*rate*)` or `qgamma(*p*, *shape*, *scale*=*scale*)`
Chi-squared (χ^2^) `qchisq(*p*, *df*)`
### See Also {-#see-also-id168}
Determining the quantiles of a dataset is different from determining
the quantiles of a distribution—see Recipe \@ref(recipe-id143), ["Calculating Quantiles (and Quartiles) of a Dataset"](#recipe-id143).
Plotting a Density Function {#recipe-id166}
---------------------------
### Problem {-#problem-id166}
You want to plot the density function of a probability distribution.
### Solution {-#solution-id166}
Define a vector `x` over the domain. Apply the distribution’s density
function to `x` and then plot the result. If `x` is a vector of points over the domain we care about plotting, we then calculate the density using one of the `d_____` density functions, like `dlnorm` for lognormal or `dnorm` for normal.
```{r, eval=FALSE}
dens <- data.frame(x = x,
y = d_____(x))
ggplot(dens, aes(x, y)) + geom_line()
```
Here is a specific example that plots the
standard normal distribution for the interval –3 to +3:
```{r std-norm, fig.cap='Smooth density function'}
library(ggplot2)
x <- seq(-3, +3, 0.1)
dens <- data.frame(x = x, y = dnorm(x))
ggplot(dens, aes(x, y)) + geom_line()
```
Figure \@ref(fig:std-norm) shows the smooth density function.
### Discussion {-#discussion-id166}
All the built-in probability distributions include a density function.
For a particular density, the function name is “d” prepended to the
density name. The density function for the normal distribution is
`dnorm`, the density for the gamma distribution is `dgamma`, and so
forth.
If the first argument of the density function is a vector, then the
function calculates the density at each point and returns the vector of
densities.
The following code creates a 2 × 2 plot of four densities:
``` {r densities, fig.cap='Multiple density plots'}
x <- seq(from = 0, to = 6, length.out = 100) # Define the density domains
ylim <- c(0, 0.6)
# Make a data.frame with densities of several distributions
df <- rbind(
data.frame(x = x, dist_name = "Uniform" , y = dunif(x, min = 2, max = 4)),
data.frame(x = x, dist_name = "Normal" , y = dnorm(x, mean = 3, sd = 1)),
data.frame(x = x, dist_name = "Exponential", y = dexp(x, rate = 1 / 2)),
data.frame(x = x, dist_name = "Gamma" , y = dgamma(x, shape = 2, rate = 1)) )
# Make a line plot like before, but use facet_wrap to create the grid
ggplot(data = df, aes(x = x, y = y)) +
geom_line() +
facet_wrap(~dist_name) # facet and wrap by the variable dist_name
```
Figure \@ref(fig:densities) shows four density plots. However, a raw density plot is rarely useful or interesting by itself, and we often shade a region of interest.
```{r densityshaded, fig.cap='Standard normal with shading', echo=FALSE}
library(ggplot2)
x <- seq(from = -3, to = 3, length.out = 100)
df <- data.frame(x = x, y = dnorm(x, mean = 0, sd = 1))
# calculate quantiles
q75 <- quantile(df$x, .75)
q95 <- quantile(df$x, .95)
ggplot(df, aes(x, y)) +
geom_line() +
geom_ribbon(
data = subset(df, x > q75 & x < q95),
aes(ymax = y),
ymin = 0,
fill = "blue",
colour = NA,
alpha = 0.5
) +
labs(
title = "Standard Normal Distribution",
y = "Density",
x = "Quantile"
)
```
Figure \@ref(fig:densityshaded) is a normal distribution with shading from the 75^th^ percentile to the 95^th^ percentile.
We create the plot by first plotting the density and then creating a shaded
region with the `geom_ribbon` function from `ggplot2`.
First, we create some data and draw the density curve like the one shown in Figure \@ref(fig:normalden1)
``` {r normalden1, fig.cap='Density plot'}
x <- seq(from = -3, to = 3, length.out = 100)
df <- data.frame(x = x, y = dnorm(x, mean = 0, sd = 1))
p <- ggplot(df, aes(x, y)) +
geom_line() +
labs(
title = "Standard Normal Distribution",
y = "Density",
x = "Quantile"
)
p
```
Next, we define the region of interest by calculating the `x` value for the quantiles we're interested in. Finally, we use a `geom_ribbon` to add a subset of our original data as a colored region. The resulting plot is shown in Figure \@ref(fig:normalden2).
``` {r normalden2, fig.cap='Normal density with shading'}
q75 <- quantile(df$x, .75)
q95 <- quantile(df$x, .95)
p +
geom_ribbon(
data = subset(df, x > q75 & x < q95),
aes(ymax = y),
ymin = 0,
fill = "blue",
colour = NA,
alpha = 0.5
)
```