diff --git a/README.md b/README.md index 22331df..b428c75 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Currently, there are three branches: * `bookdown`: used to build the Bookdown version of the book -* `leanpub`: sued to build the Leanpub version of the book +* `leanpub`: used to build the Leanpub version of the book ### Before You Start diff --git a/manuscript/apply.Rmd b/manuscript/apply.Rmd index c4f8878..70e3206 100644 --- a/manuscript/apply.Rmd +++ b/manuscript/apply.Rmd @@ -86,7 +86,7 @@ x <- 1:4 lapply(x, runif, min = 0, max = 10) ``` -So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. +So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10. The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace. @@ -165,7 +165,7 @@ where - `f` is a factor (or coerced to one) or a list of factors - `drop` indicates whether empty factors levels should be dropped -The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. +The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable. @@ -413,13 +413,13 @@ With `mapply()`, instead we can do This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument. -Here's another example for simulating randon Normal variables. +Here's another example for simulating random Normal variables. ```{r} noise <- function(n, mean, sd) { rnorm(n, mean, sd) } -## Simulate 5 randon numbers +## Simulate 5 random numbers noise(5, 1, 2) ## This only simulates 1 set of numbers, not 5 @@ -443,7 +443,7 @@ list(noise(1, 1, 2), noise(2, 2, 2), ## Vectorizing a Function -The `mapply()` function can be use to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions. +The `mapply()` function can be used to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions. Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\sum_{i=1}^n(x_i-\mu)^2/\sigma^2$. diff --git a/manuscript/apply.md b/manuscript/apply.md index 60d71cf..331c964 100644 --- a/manuscript/apply.md +++ b/manuscript/apply.md @@ -140,7 +140,7 @@ Here, the `min = 0` and `max = 10` arguments are passed down to `runif()` every [1] 0.9916910 1.1890256 0.5043966 9.2925392 ~~~~~~~~ -So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. +So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10. The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace. @@ -265,7 +265,7 @@ where - `f` is a factor (or coerced to one) or a list of factors - `drop` indicates whether empty factors levels should be dropped -The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. +The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable. @@ -732,7 +732,7 @@ With `mapply()`, instead we can do This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument. -Here's another example for simulating randon Normal variables. +Here's another example for simulating random Normal variables. {line-numbers=off} @@ -740,7 +740,7 @@ Here's another example for simulating randon Normal variables. > noise <- function(n, mean, sd) { + rnorm(n, mean, sd) + } -> ## Simulate 5 randon numbers +> ## Simulate 5 random numbers > noise(5, 1, 2) [1] -0.5196913 3.2979182 -0.6849525 1.7828267 2.7827545 > @@ -798,7 +798,7 @@ The above call to `mapply()` is the same as ## Vectorizing a Function -The `mapply()` function can be use to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions. +The `mapply()` function can be used to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions. Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is {$$}\sum_{i=1}^n(x_i-\mu)^2/\sigma^2{/$$}. diff --git a/manuscript/control.Rmd b/manuscript/control.Rmd index b9f98ef..fa6805a 100644 --- a/manuscript/control.Rmd +++ b/manuscript/control.Rmd @@ -25,10 +25,10 @@ Commonly used control structures are - `break`: break the execution of a loop -- `next`: skip an interation of a loop +- `next`: skip an iteration of a loop Most control structures are not used in interactive sessions, but -rather when writing functions or longer expresisons. However, these +rather when writing functions or longer expressions. However, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions. @@ -259,7 +259,7 @@ not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`. -One possible paradigm might be in an iterative algorith where you may +One possible paradigm might be in an iterative algorithm where you may be searching for a solution and you don't want to stop until you're close enough to the solution. In this kind of situation, you often don't know in advance how many iterations it's going to take to get diff --git a/manuscript/control.md b/manuscript/control.md index 17609a9..aa240db 100644 --- a/manuscript/control.md +++ b/manuscript/control.md @@ -23,10 +23,10 @@ Commonly used control structures are - `break`: break the execution of a loop -- `next`: skip an interation of a loop +- `next`: skip an iteration of a loop Most control structures are not used in interactive sessions, but -rather when writing functions or longer expresisons. However, these +rather when writing functions or longer expressions. However, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions. @@ -317,7 +317,7 @@ not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`. -One possible paradigm might be in an iterative algorith where you may +One possible paradigm might be in an iterative algorithm where you may be searching for a solution and you don't want to stop until you're close enough to the solution. In this kind of situation, you often don't know in advance how many iterations it's going to take to get diff --git a/manuscript/debugging.Rmd b/manuscript/debugging.Rmd index 53333e3..ab42898 100644 --- a/manuscript/debugging.Rmd +++ b/manuscript/debugging.Rmd @@ -269,7 +269,7 @@ Enter a frame number, or 0 to exit Selection: ``` -The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. +The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. ## Summary diff --git a/manuscript/debugging.md b/manuscript/debugging.md index 19b2345..5c96b90 100644 --- a/manuscript/debugging.md +++ b/manuscript/debugging.md @@ -305,7 +305,7 @@ Enter a frame number, or 0 to exit Selection: ~~~~~~~~ -The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. +The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. ## Summary diff --git a/manuscript/dplyr.Rmd b/manuscript/dplyr.Rmd index 14898b7..7b94053 100644 --- a/manuscript/dplyr.Rmd +++ b/manuscript/dplyr.Rmd @@ -221,7 +221,7 @@ Here you can see the names of the first five variables in the `chicago` data fra head(chicago[, 1:5], 3) ``` -The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. +The `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. ```{r} chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2) diff --git a/manuscript/dplyr.md b/manuscript/dplyr.md index 3375593..dfe22d1 100644 --- a/manuscript/dplyr.md +++ b/manuscript/dplyr.md @@ -339,7 +339,7 @@ Here you can see the names of the first five variables in the `chicago` data fra 3 chic 35 29.4 2005-12-29 7.45000 ~~~~~~~~ -The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. +The `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. {line-numbers=off} diff --git a/manuscript/functions.Rmd b/manuscript/functions.Rmd index 80da217..2e830ff 100644 --- a/manuscript/functions.Rmd +++ b/manuscript/functions.Rmd @@ -63,7 +63,7 @@ f() The last aspect of a basic function is the *function arguments*. These are the options that you can specify to the user that the user may -explicity set. For this basic function, we can add an argument that +explicitly set. For this basic function, we can add an argument that determines how many times "Hello, world!" is printed to the console. ```{r} @@ -98,7 +98,7 @@ print(meaningoflife) In the above function, we didn't have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function. -Note that there is a `return()` function that can be used to return an explicity value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). +Note that there is a `return()` function that can be used to return an explicit value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). Finally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error. diff --git a/manuscript/functions.md b/manuscript/functions.md index 9c12963..e639c2c 100644 --- a/manuscript/functions.md +++ b/manuscript/functions.md @@ -68,7 +68,7 @@ Hello, world! The last aspect of a basic function is the *function arguments*. These are the options that you can specify to the user that the user may -explicity set. For this basic function, we can add an argument that +explicitly set. For this basic function, we can add an argument that determines how many times "Hello, world!" is printed to the console. @@ -114,7 +114,7 @@ Hello, world! In the above function, we didn't have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function. -Note that there is a `return()` function that can be used to return an explicity value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). +Note that there is a `return()` function that can be used to return an explicit value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). Finally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error. diff --git a/manuscript/overview.Rmd b/manuscript/overview.Rmd index cf15008..57be3e0 100644 --- a/manuscript/overview.Rmd +++ b/manuscript/overview.Rmd @@ -281,7 +281,7 @@ with Douglas Bates and Brian Ripley in June 2004: > **Douglas Bates**: There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it's only a matter of time before you will have a pizza-ordering function available. -> **Brian D. Ripley**: Indeed, the GraphApp toolkit (used for the RGui interface under R for Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). Alternatively, a Padovian has no need of ordering pizzas with both home and neighbourhood restaurants .... +> **Brian D. Ripley**: Indeed, the GraphApp toolkit (used for the RGui interface under R for Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). Alternatively, a Padovian has no need of ordering pizzas with both home and neighborhood restaurants .... At this point in time, I think it would be fairly straightforward to build a pizza ordering R package using something like the `RCurl` or diff --git a/manuscript/overview.md b/manuscript/overview.md index cf15008..57be3e0 100644 --- a/manuscript/overview.md +++ b/manuscript/overview.md @@ -281,7 +281,7 @@ with Douglas Bates and Brian Ripley in June 2004: > **Douglas Bates**: There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it's only a matter of time before you will have a pizza-ordering function available. -> **Brian D. Ripley**: Indeed, the GraphApp toolkit (used for the RGui interface under R for Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). Alternatively, a Padovian has no need of ordering pizzas with both home and neighbourhood restaurants .... +> **Brian D. Ripley**: Indeed, the GraphApp toolkit (used for the RGui interface under R for Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). Alternatively, a Padovian has no need of ordering pizzas with both home and neighborhood restaurants .... At this point in time, I think it would be fairly straightforward to build a pizza ordering R package using something like the `RCurl` or diff --git a/manuscript/profiler.Rmd b/manuscript/profiler.Rmd index 0db0a57..3688df1 100644 --- a/manuscript/profiler.Rmd +++ b/manuscript/profiler.Rmd @@ -9,11 +9,11 @@ knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE) ``` -R comes with a profiler to help you optimize your code and improve its performance. In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. +R comes with a profiler to help you optimize your code and improve its performance. In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly should optimize the parts of your code that are running slowly, but how do we know what parts those are? -This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. +This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. Sometimes profiling becomes necessary as a project grows and layers of code are placed on top of each other. Often you might write some code that runs fine once. But then later, you might put that same code in a big loop that runs 1,000 times. Now the original code that took 1 second to run is taking 1,000 seconds to run! Getting that little piece of original code to run faster will help the entire loop. @@ -43,9 +43,9 @@ They `system.time()` function takes an arbitrary R expression as input (can be w - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there -Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involes some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). +Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involves some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). -The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallell` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. +The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallel` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. Here's an example of where the elapsed time is greater than the user time. diff --git a/manuscript/profiler.md b/manuscript/profiler.md index 1414e39..a508338 100644 --- a/manuscript/profiler.md +++ b/manuscript/profiler.md @@ -7,11 +7,11 @@ -R comes with a profiler to help you optimize your code and improve its performance. In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. +R comes with a profiler to help you optimize your code and improve its performance. In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly should optimize the parts of your code that are running slowly, but how do we know what parts those are? -This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. +This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. Sometimes profiling becomes necessary as a project grows and layers of code are placed on top of each other. Often you might write some code that runs fine once. But then later, you might put that same code in a big loop that runs 1,000 times. Now the original code that took 1 second to run is taking 1,000 seconds to run! Getting that little piece of original code to run faster will help the entire loop. @@ -41,9 +41,9 @@ They `system.time()` function takes an arbitrary R expression as input (can be w - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there -Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involes some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). +Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involves some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). -The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallell` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. +The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallel` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. Here's an example of where the elapsed time is greater than the user time. diff --git a/manuscript/readwritedata.Rmd b/manuscript/readwritedata.Rmd index 69a4f23..a980d0f 100644 --- a/manuscript/readwritedata.Rmd +++ b/manuscript/readwritedata.Rmd @@ -172,7 +172,7 @@ Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in -the worst case. So make sure to do a rough calculation of memeory +the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later. diff --git a/manuscript/readwritedata.md b/manuscript/readwritedata.md index 9e9fb17..d855059 100644 --- a/manuscript/readwritedata.md +++ b/manuscript/readwritedata.md @@ -174,7 +174,7 @@ Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in -the worst case. So make sure to do a rough calculation of memeory +the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later. diff --git a/manuscript/regex.Rmd b/manuscript/regex.Rmd index a37733e..cb8bf67 100644 --- a/manuscript/regex.Rmd +++ b/manuscript/regex.Rmd @@ -74,14 +74,14 @@ g <- grep("Cause: shooting", homicides) length(g) ``` -Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. +Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a capital "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. ```{r} g <- grep("Cause: [Ss]hooting", homicides) length(g) ``` -One thing you have to be careful of when processing text data is not not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. +One thing you have to be careful of when processing text data is not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. ```{r} g <- grep("[Ss]hooting", homicides) @@ -90,7 +90,7 @@ length(g) Notice that we see to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. -First we can get the indices for the first expresssion match. +First we can get the indices for the first expression match. ```{r} i <- grep("[cC]ause: [Ss]hooting", homicides) diff --git a/manuscript/regex.md b/manuscript/regex.md index 4088faa..b1bb3f6 100644 --- a/manuscript/regex.md +++ b/manuscript/regex.md @@ -85,7 +85,7 @@ Another possible way to do this is to `grep()` on the cause of death field, whic [1] 228 ~~~~~~~~ -Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. +Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a capital "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. {line-numbers=off} @@ -95,7 +95,7 @@ Notice that we seem to be undercounting again. This is because for some of the e [1] 1263 ~~~~~~~~ -One thing you have to be careful of when processing text data is not not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. +One thing you have to be careful of when processing text data is not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. {line-numbers=off} @@ -107,7 +107,7 @@ One thing you have to be careful of when processing text data is not not `grep() Notice that we see to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. -First we can get the indices for the first expresssion match. +First we can get the indices for the first expression match. {line-numbers=off} diff --git a/manuscript/simulation.Rmd b/manuscript/simulation.Rmd index 2477ec1..37c074c 100644 --- a/manuscript/simulation.Rmd +++ b/manuscript/simulation.Rmd @@ -12,14 +12,14 @@ set.seed(10) Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs. -R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R +R comes with a set of pseudo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R - `rnorm`: generate random Normal variates with a given mean and standard deviation - `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points) - `pnorm`: evaluate the cumulative distribution function for a Normal distribution - `rpois`: generate random Poisson variates with a given rate -For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a +For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates random numbers from that distribution. The other functions are prefixed with a - `d` for density - `r` for random number generation @@ -140,7 +140,7 @@ x <- rbinom(100, 1, 0.5) str(x) ## 'x' is now 0s and 1s ``` -Then we can procede with the rest of the model as before. +Then we can proceed with the rest of the model as before. ```{r Linear Model Binary} e <- rnorm(100, 0, 2) diff --git a/manuscript/simulation.md b/manuscript/simulation.md index 8fec973..4593506 100644 --- a/manuscript/simulation.md +++ b/manuscript/simulation.md @@ -9,14 +9,14 @@ Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs. -R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R +R comes with a set of pseudo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R - `rnorm`: generate random Normal variates with a given mean and standard deviation - `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points) - `pnorm`: evaluate the cumulative distribution function for a Normal distribution - `rpois`: generate random Poisson variates with a given rate -For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a +For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates random numbers from that distribution. The other functions are prefixed with a - `d` for density - `r` for random number generation @@ -176,7 +176,7 @@ What if we wanted to simulate a predictor variable `x` that is binary instead of int [1:100] 1 0 0 1 0 0 0 0 1 0 ... ~~~~~~~~ -Then we can procede with the rest of the model as before. +Then we can proceed with the rest of the model as before. {line-numbers=off} diff --git a/manuscript/vectorized.Rmd b/manuscript/vectorized.Rmd index 21b6e36..8da323e 100644 --- a/manuscript/vectorized.Rmd +++ b/manuscript/vectorized.Rmd @@ -35,7 +35,7 @@ hands would get very tired from all the typing. Another operation you can do in a vectorized manner is logical comparisons. So suppose you wanted to know which elements of a vector -were greater than 2. You could do he following. +were greater than 2. You could do the following. ```{r} x @@ -64,7 +64,7 @@ x / y ## Vectorized Matrix Operations -Matrix operations are also vectorized, making for nicly compact +Matrix operations are also vectorized, making for nicely compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element. diff --git a/manuscript/vectorized.md b/manuscript/vectorized.md index 8c47b31..9c3f25d 100644 --- a/manuscript/vectorized.md +++ b/manuscript/vectorized.md @@ -39,7 +39,7 @@ hands would get very tired from all the typing. Another operation you can do in a vectorized manner is logical comparisons. So suppose you wanted to know which elements of a vector -were greater than 2. You could do he following. +were greater than 2. You could do the following. {line-numbers=off} @@ -82,7 +82,7 @@ Of course, subtraction, multiplication and division are also vectorized. ## Vectorized Matrix Operations -Matrix operations are also vectorized, making for nicly compact +Matrix operations are also vectorized, making for nicely compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element.