edits to ch11

SusannaLange · Aug 15, 2023 · 7608ee0 · 7608ee0
1 parent 4d88d06
commit 7608ee0
Show file tree

Hide file tree

Showing 13 changed files with 1,050 additions and 1,019 deletions.
diff --git a/textbook/11/1/Rules_Definitions.ipynb → ...11/1/Probability_1_RulesDefinitions.ipynb b/textbook/11/1/Rules_Definitions.ipynb → ...11/1/Probability_1_RulesDefinitions.ipynb
@@ -4,9 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Probability: Definitions and Rules \n",
+    "#  Probability: Definitions and Rules \n",
     "\n",
-    "The intention here is not to have a comprehensive introduction to probability, but just to provide a reminder of the basic definitions and rules. Every statistics textbook has a chapter on probability that is more complete than this section. We encourage the readers who have not encounter the concept of probability to find a good introductory chapter, and we offer a suggestion/reference at the end of this section.\n",
+    "The intention here is not to have a comprehensive introduction to probability, but just to provide a reminder of the basic definitions and rules. Every statistics textbook has a chapter on probability that is more complete than this section. We encourage the readers who have not encountered the concept of probability to find a good introductory chapter, and we offer a suggestion/reference at the end of this section.\n",
     "\n",
     "We start with some basic definitions illustrated on three examples:\n",
     "\n",
@@ -26,52 +26,60 @@
     "3. Having at least two people sharing birthdays.\n",
     "\n",
     "**Mutually exclusive events**: Events $A$ and $B$ are mutually exclusive (or disjoint) if they have no outcomes in common. Examples:\n",
-    "1. A is as above and B is rolling a 3.\n",
-    "2. A is as above and B is the event that the number of boys is between 60 and 70. \n",
-    "3. A is as above and B is the event that there is a birthday to celebrate for every day in March."
+    "1. A is as above (rolling an even number) and B is rolling a 3.\n",
+    "2. A is as above (less than half of the babies are boys) and B is the event that the number of boys is between 60 and 70. \n",
+    "3. A is as above (t least two people share birthdays) and B is the event that there is a birthday to celebrate for every day in the month of March.\n",
+    "\n",
+    "\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Complement of an event**: The complement of an event $A$ is the event that $A$ does not occur, denoted by $A^C$. For the events $A$ defined above:\n",
-    "1. $A^C$ is rolling an odd number: $A^C=\\{1,3,5\\}$ \n",
-    "2. $A^C$ is the event that more than half of the babies are boys, or the set of integers from 50 to 100.\n",
-    "3. $A^C$ is the event when there are no shared birthdays.\n",
     "\n",
-    "<img align=\"center\" src=\"./img/complement.png\" width=\"200\"/>"
+    "**Complement of an event**: The complement of an event $A$ is the event that $A$ does not occur, denoted by $A^C$. \n",
+    "\n",
+    "<img align=\"center\" src=\"./img/complement.png\" width=\"200\"/>\n",
+    "\n",
+    "For the events $A$ defined above:\n",
+    "1. $A$ is as above (rolling an even number): $A^C$ is rolling an odd number: $A^C=\\{1,3,5\\}$ \n",
+    "2. $A$ is as above (less than half of the babies are boys): $A^C$ is the event that more than half of the babies are boys, or the set of integers from 50 to 100.\n",
+    "3. $A$ is as above (t least two people share birthdays): $A^C$ is the event when there are no shared birthdays.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Compound events**: Events built from combinations of other events.\n",
+    "\n",
+    "**Compound events**: Events built from combinations of other events; for example, union and intersection.\n",
     "\n",
     "**Union:** ($A$ or $B$) = ($A\\cup B$): set of all outcomes in $A$, or in $B$, or in both.\n",
     "\n",
-    "<img align=\"center\" src=\"./img/union.png\" width=\"200\"/>"
+    "<img align=\"center\" src=\"./img/union.png\" width=\"200\"/>\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "\n",
     "**Intersection:**  ($A$ and $B$) = ($A\\cap B$): set of all outcomes that are in $A$ and in $B$.\n",
     "\n",
-    "<img align=\"center\" src=\"./img/intersection.png\" width=\"200\"/>"
+    "<img align=\"center\" src=\"./img/intersection.png\" width=\"200\"/>\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Definition of Probability\n",
+    "### Definition of Probability\n",
     "\n",
     "Probabilities describe how likely events are and so probability models consist of:\n",
     "- A list of possible outcomes (sample space)\n",
-    "- An assignment of probabilities $P$\n",
+    "- An assignment of probabilities $P$ for each possible outcome\n",
     "\n",
     "The **frequentist interpretation of the probability** of an event $A$, $\\mbox{P}(A)$, is the long run relative frequency of the event $A$. Suppose you are interested in the probability of \"Heads\" when tossing a coin. In this frequentist interpretation, probability is given by the limit of the relative frequency of \"Heads\" when tossing the coin repeatedly. Note that while you can imagine repeating the coin toss for a large number of times (and some people have done it!), there are other events where the intutition behind frequentists probabilities are not as evident. For example, what is the probability that it will rain next Sunday? This where the **Bayesian interpretation** of probability - based on a subjective degree of belief - is more natural. In the Bayesian world, two people could have different viewpoints and assign different probabilities. \n",
     "\n",
@@ -82,42 +90,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Basic Probability Rules\n",
-    "- $0 \\le \\mbox{P}(A) \\le 1$, for any event $A$\n",
+    "### Basic Probability Rules\n",
     "\n",
-    "- $\\mbox{P}(S) = 1$\n",
+    "Given a sample space S and events $A, B \\subseteq S$, we have:\n",
     "\n",
-    "- **Equally likely outcomes**:\n",
-    "$P(A)=\\frac{\\mbox{ Number of outcomes in A}}{\\mbox{ Total number of outcomes}}$\n",
+    "- $0 \\le \\mbox{P}(A) \\le 1$\n",
+    "\n",
+    "- $\\mbox{P}(S) = 1$\n",
     "\n",
-    "-  $\\mbox{P}(E^C) = 1 - \\mbox{P}(E)$ for any event $E$\n",
+    "-  $\\mbox{P}(A^C) = 1 - \\mbox{P}(A)$\n",
     "\n",
     "- $\\mbox{P}(A \\cup B) = \\mbox{P}(A) +\n",
     "\\mbox{P}(B) - \\mbox{P}(A \\cap B)$\n",
     "\n",
+    "- **Equally likely outcomes**:\n",
+    "$$P(A)=\\frac{\\mbox{ Number of outcomes in A}}{\\mbox{ Total number of outcomes}}$$\n",
     "\n",
-    "\n",
-    "\n",
-    "\n"
+    "The last rule refers to situations where all outcomes of an experiment are equally likely (for example, roll a fair die).\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Conditional Probability\n",
-    "\n",
-    "If $\\mbox{P}(A) \\ne 0$, the conditional probability of event $B$\n",
-    "given $A$ has occurred, denoted by $\\mbox{P}(B|A)$, is defined by,\n",
-    "$ \\mbox{P}(B|A) = \\frac{\\mbox{P}(A \\mbox{ and } B)}{\\mbox{P}(A)}$\n",
+    "### Conditional Probability\n",
+    "If $\\mbox{P}(B) \\ne 0$, the conditional probability of event $A$\n",
+    "given $B$ has occurred, denoted by $\\mbox{P}(A|B)$, is defined by,\n",
+    "$$ \\mbox{P}(A|B) = \\frac{\\mbox{P}(A \\mbox{ and } B)}{\\mbox{P}(B)}$$\n",
     "\n",
     "<img align=\"center\" src=\"./img/conditionalprobability.png\" width=\"600\"/>\n",
     "\n",
     "Example:\n",
     "- Select one subject at random in US;\n",
-    "- B is the event that the subject read a book last week;\n",
-    "- A is the event that the subject is a college student;\n",
-    "- Consider P(B|A) versus P(B): the fraction of college students who read a book last week is likely different than the fraction of US population who did that.\n",
+    "- A is the event that the subject read a book last week;\n",
+    "- B is the event that the subject is a college student;\n",
+    "- Consider P(A|B) versus P(A): the fraction of college students who read a book last week is likely different than the fraction of US population who did that.\n",
     "\n",
     "**Multiplication rule**: $\\mbox{P}(A \\mbox{ and } B) = \\mbox{P}(A|B) \\mbox{P}(B)$. Note that this follows directly from the definition of conditional probability."
    ]
@@ -126,21 +133,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Independence\n",
+    "### Independence\n",
     "\n",
     "Events $A$ and $B$ are called independent if $\\mbox{P}(A|B) =\n",
     "\\mbox{P}(A)$ (or equivalently, $\\mbox{P}(B|A) = \\mbox{P}(B)$)\n",
     "\n",
     "Equivalent condition for **independence**: \n",
-    "$\\mbox{P}(A \\mbox{ and } B) = \\mbox{P}(A) \\mbox{P}(B)$\n",
+    "$$\\mbox{P}(A \\mbox{ and } B) = \\mbox{P}(A) \\mbox{P}(B)$$\n",
     "\n",
     "### The Bayes Theorem\n",
     "\n",
     "The following property follows directly from the definition of conditional independence and the multiplication rule:\n",
     "\n",
-    "\\begin{eqnarray}\n",
-    "\\mbox{P}(A|B) & = & \\frac{\\mbox{P}(B|A) \\mbox{P}(A)}{\\mbox{P}(B)} \\nonumber\n",
-    "\\end{eqnarray}\n",
+    "$$\\mbox{P}(A|B)  = \\frac{\\mbox{P}(B|A) \\mbox{P}(A)}{\\mbox{P}(B)}$$\n",
     "\n",
     "This is one of the most important rules in statistics and data science because it describes statistical learning, and provides a way to update a belief (probability) given additional evidence (data)."
    ]
@@ -149,33 +154,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## The solution to the birthday problem\n",
+    "### The solution to the birthday problem\n",
     "\n",
     "We will use the **equally likely outcomes** formula from the Basic Probability Rules above. Note that, for $n$ random subjects, the total number of outcomes (number of possible combination of birthdays) is \n",
-    "\n",
     "$$365^n.$$\n",
     "\n",
     "The number of outcomes that lead to a set of distinct birthdays is\n",
-    "\n",
     "$$365\\times364\\times ...\\times (365-n+1)$$\n",
-    "\n",
     "and the intuition comes from the way we can count the total number of distinct birthdays as follows:\n",
     "- suppose you look at people sequentially;\n",
-    "- first person can have any of the 365 birthdays without leading to mathched birthdays;\n",
+    "- first person can have any of the 365 birthdays without leading to matched birthdays;\n",
     "- the second can have any of birthdays except the one of the first person: so 364 possibilities;\n",
     "- the $n$-th person can have any of birthdays except any of the (n-1) different birthdays of the other people: so (365-n+1) possibilities.\n",
     "\n",
-    "So the probability of having $n$ distinct birtdays is:\n",
-    "\n",
-    "$$\n",
-    "\\frac{365\\times364\\times ...\\times (365-n+1)}{365^n}\n",
-    "$$\n",
-    "\n",
+    "So the probability of having $n$ distinct birthdays is:\n",
+    "$$\\frac{365\\times364\\times ...\\times (365-n+1)}{365^n}$$\n",
     "The complement of this event is the event of interest (at least two people share birthdays) and so the probability of interest is:\n",
-    "\n",
-    "$$\n",
-    "P_n ~=~ 1-\\frac{365\\times364\\times ...\\times (365-n+1)}{365^n}\n",
-    "$$\n",
+    "$$P_n ~=~ 1-\\frac{365\\times364\\times ...\\times (365-n+1)}{365^n}$$\n",
     "\n",
     "**Reference.**\n",
     "\n",
@@ -185,7 +180,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -199,9 +194,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.8.8"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 2
 }
diff --git a/textbook/11/1/img/complement.png b/textbook/11/1/img/complement.png
diff --git a/textbook/11/1/img/conditionalprobability.png b/textbook/11/1/img/conditionalprobability.png
diff --git a/textbook/11/1/img/intersection.png b/textbook/11/1/img/intersection.png
diff --git a/textbook/11/1/img/union.png b/textbook/11/1/img/union.png
diff --git a/textbook/11/2/Simulation_Solution.ipynb → .../2/Probability_2_SimulationSolution.ipynb b/textbook/11/2/Simulation_Solution.ipynb → .../2/Probability_2_SimulationSolution.ipynb
@@ -10,16 +10,20 @@
    },
    "outputs": [],
    "source": [
-    "import numpy as np"
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "%matplotlib inline\n",
+    "import matplotlib.pyplot as plt"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# A simulation-based solution\n",
+    "#  A simulation-based solution\n",
     "\n",
-    "Simulations are used to imitate real-world scenarios. They can have many roles, ranging from obtaining insight into a physical system to testing and training of models and algorithms. In this chapter, we will use simulations to estimate probabilities. We will explore other applications later, such as gaining insight into probability distributions.\n",
+    "We have introduced the concept of simulation in Section 5.3 (we simulated the output of rolling a six-sided die). Simulations are used to imitate real-world scenarios. They can have many roles, ranging from obtaining insight into a physical system to testing and training of models and algorithms. In this chapter, we will use simulations to estimate probabilities. We will explore other applications later, such as gaining insight into probability distributions.\n",
     "\n",
     "Here we use the computer to run/execute our simulations and we will develop code that will allow us to do that. There are several steps in designing and executing a simulation:\n",
     "- Conceptualize what to simulate;\n",
@@ -31,7 +35,7 @@
     "- Simulations will give us only an estimate/approximation of the probability we are interested in; the more repetitions, the better the approximation.\n",
     "- The number of repetitions is important and strategies for selecting them will be discussed in more detail later in the chapter.\n",
     "\n",
-    "## Conceptualize and simulate one instance\n",
+    "### Conceptualize and simulate one instance\n",
     "\n",
     "In the birthday problem described at the beginning of this chapter (30 people at a party), the only information that is needed for deciding on matching birthdays is the set of birthdates. The simulation we will construct will focus on that - we just need a function that generates 30 random birthdays. It turns out we have already seen a function in `numpy` that can do that: `random.choice`.\n",
     "\n",
@@ -46,9 +50,9 @@
     {
      "data": {
       "text/plain": [
-       "array([ 48, 240, 100, 295, 177, 359, 107,  96, 255, 100, 126, 217, 271,\n",
-       "       224, 209, 172,  71, 361, 319,  73, 346,  73, 208, 166, 305,  75,\n",
-       "       197, 114, 126, 178])"
+       "array([353, 265, 326, 356, 124, 329, 122, 304, 144, 176, 268,  93, 117,\n",
+       "       178, 215, 112,  76, 120, 231, 139, 142, 227, 110, 295, 159, 210,\n",
+       "        77,   6, 159,  42])"
       ]
      },
      "execution_count": 2,
@@ -78,9 +82,9 @@
     {
      "data": {
       "text/plain": [
-       "array([ 48,  71,  73,  73,  75,  96, 100, 100, 107, 114, 126, 126, 166,\n",
-       "       172, 177, 178, 197, 208, 209, 217, 224, 240, 255, 271, 295, 305,\n",
-       "       319, 346, 359, 361])"
+       "array([  6,  42,  76,  77,  93, 110, 112, 117, 120, 122, 124, 139, 142,\n",
+       "       144, 159, 159, 176, 178, 210, 215, 227, 231, 265, 268, 295, 304,\n",
+       "       326, 329, 353, 356])"
       ]
      },
      "execution_count": 3,
@@ -176,7 +180,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Repeated simulations and summary\n",
+    "### Repeated simulations and summary\n",
     "\n",
     "We now create a function that will simulate **nrep** repetitions of the \n",
     "birthday setting for **n** subjects. The function returns an array that has **nrep** entries, each showing the count (frequency) of the most frequent birthday in one simulation."
@@ -207,7 +211,7 @@
     {
      "data": {
       "text/plain": [
-       "array([1., 2., 2., 2., 2., 2., 1., 1., 1., 1.])"
+       "array([1., 2., 1., 1., 1., 2., 1., 2., 2., 2.])"
       ]
      },
      "execution_count": 8,
@@ -227,11 +231,9 @@
     "\n",
     "We are ready now to estimate the probability that at least two people share a birthday in a group of n random subjects. In a similar fashion to the coin toss example where the probability of heads was given by the long run frequency (number of heads / number of tosses), the estimated probability is \n",
     "\n",
-    "$$\n",
-    "\\frac{\\mbox{number of repetitions with shared birthdays}}{\\mbox{nrep}}\n",
-    "$$\n",
+    "$$\\frac{\\mbox{number of repetitions with shared birthdays}}{\\mbox{nrep}}$$\n",
     "\n",
-    "As mentioned above, the number of repetitions affects both accuracy (better accuracy for more repetitions) and computational time. Section ?? provides more details on this issue. In the cell code below we use 1000 repetitions."
+    "As mentioned above, the number of repetitions affects both accuracy (better accuracy for more repetitions) and computational time. Chapter 12 provides more details on this issue. In the cell code below we use 1000 repetitions."
    ]
   },
   {
@@ -242,7 +244,7 @@
     {
      "data": {
       "text/plain": [
-       "0.488"
+       "0.486"
       ]
      },
      "execution_count": 9,
@@ -260,9 +262,36 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that the result we obtain is close but not equal to the probability we calculated at the beginning of this chapter (0.5073). Increasing the number of repetitions will lead to an estimate that is closer to the exact probability.\n",
-    "\n",
-    "## On the assumptions used in simulations\n",
+    "Note that the result we obtain is close but not equal to the probability we calculated at the beginning of this chapter (0.5073). Increasing the number of repetitions will lead to an estimate that is closer to the exact probability. Also, rerunning the above cell will lead to a (slightly) different result:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.514"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "n=23\n",
+    "nrep=1000\n",
+    "sum(birthday_sim(n,nrep)>=2)/nrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### On the assumptions used in simulations\n",
     "\n",
     "It is important to consider whether the assumptions made in the simulations are the same or different than the ones we made in the mathematical derivation. The answer is yes (same assumptions) because: (i) the `birthdays` array has 365 elements (so we implicitly assume the year has 365 days); and (ii) birthdates are independent and equally likely (because the `numpy` `random.choice` function is designed to sample elements this way when only sample size is provided). \n",
     "\n",
@@ -272,7 +301,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -286,9 +315,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.8.8"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 2
 }