Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typos, dead links, outdated information across various chapters #8

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e12695a
dead link at bell-labs; use web-archive version
engineerchange Nov 5, 2021
3f71e78
Overview chapter: syntax and typos
engineerchange Nov 5, 2021
7de756b
remove unnecessary html output
engineerchange Nov 5, 2021
5574868
output md file
engineerchange Nov 5, 2021
d3868b9
gettingstarted: typos
engineerchange Nov 5, 2021
7dcfa28
nutsbolts: syntax, wording, redirected url
engineerchange Nov 5, 2021
ae2e600
readwritedata: stringsAsFactors update, syntax, calculation markdown …
engineerchange Nov 5, 2021
e7d54ee
vectorized: spelling
engineerchange Nov 5, 2021
c9de63a
dplyr: spelling, wording, syntax
engineerchange Nov 5, 2021
0360873
change output
engineerchange Nov 5, 2021
dbddd0b
control: if-else example was malformed, spelling, syntax
engineerchange Nov 5, 2021
3bb3610
functions: spelling, syntax
engineerchange Nov 5, 2021
e7dac02
scoping: spelling
engineerchange Nov 5, 2021
d931163
scoping: spelling, syntax
engineerchange Nov 6, 2021
6b26183
apply: spelling, syntax
engineerchange Nov 6, 2021
e8f5c51
regex: markdown issues, spelling, syntax
engineerchange Nov 6, 2021
24394e2
debugging: spelling, syntax
engineerchange Nov 6, 2021
a166ab6
nutsbolts: consistent spelling of 'modeling'
engineerchange Nov 6, 2021
133ed33
profiling: consistent wording, spelling, syntax
engineerchange Nov 6, 2021
4d86aff
simulation: wording, spelling, syntax
engineerchange Nov 6, 2021
65911e3
example: fix dead url, add proportion metric to narrative, spelling
engineerchange Nov 7, 2021
8685a78
example: fix prop calculation
engineerchange Nov 7, 2021
701ed4d
parallel: update AMD library/link, spelling, syntax
engineerchange Nov 7, 2021
8a9d2bf
revert to previous .md versions
engineerchange Nov 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions manuscript/apply.Rmd
Original file line number Diff line number Diff line change
@@ -47,7 +47,7 @@ Note that the actual looping is done internally in C code for efficiency reasons

It's important to remember that `lapply()` always returns a list, regardless of the class of the input.

Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output.
Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, then the names will be preserved in the output.


```{r}
@@ -86,7 +86,7 @@ x <- 1:4
lapply(x, runif, min = 0, max = 10)
```

So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.
So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10.

The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace.

@@ -165,7 +165,7 @@ where
- `f` is a factor (or coerced to one) or a list of factors
- `drop` indicates whether empty factors levels should be dropped

The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts.
The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts.

Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable.

@@ -413,13 +413,13 @@ With `mapply()`, instead we can do
This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument.


Here's another example for simulating randon Normal variables.
Here's another example for simulating random Normal variables.

```{r}
noise <- function(n, mean, sd) {
rnorm(n, mean, sd)
}
## Simulate 5 randon numbers
## Simulate 5 random numbers
noise(5, 1, 2)

## This only simulates 1 set of numbers, not 5
@@ -484,9 +484,9 @@ Pretty cool, right?

* The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form

* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.
* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and then collating the results and returning the collated results.

* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere
* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere.

* The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.
* The `split()` function can be used to divide an R object into subsets determined by another variable which can subsequently be looped over using loop functions.

19 changes: 9 additions & 10 deletions manuscript/control.Rmd
Original file line number Diff line number Diff line change
@@ -28,7 +28,7 @@ Commonly used control structures are
- `next`: skip an interation of a loop

Most control structures are not used in interactive sessions, but
rather when writing functions or longer expresisons. However, these
rather when writing functions or longer expressions. However, these
constructs do not have to be used in functions and it's a good idea to
become familiar with them before we delve into functions.

@@ -58,8 +58,7 @@ an `else` clause.
```r
if(<condition>) {
## do something
}
else {
} else {
## do something else
}
```
@@ -123,12 +122,12 @@ if(<condition2>) {

[Watch a video of this section](https://youtu.be/FbT1dGXCCxU)

For loops are pretty much the only looping construct that you will
`for` loops are pretty much the only looping construct that you will
need in R. While you may occasionally find a need for other types of
loops, in my experience doing data analysis, I've found very few
situations where a for loop wasn't sufficient.

In R, for loops take an interator variable and assign it successive
In R, for loops take an iterator variable and assign it successive
values from a sequence or vector. For loops are most commonly used for
iterating over the elements of an object (list, vector, etc.)

@@ -210,7 +209,7 @@ functions (discussed later).

[Watch a video of this section](https://youtu.be/VqrS1Wghq1c)

While loops begin by testing a condition. If it is true, then they
`while` loops begin by testing a condition. If it is true, then they
execute the loop body. Once the loop body is executed, the condition
is tested again, and so forth, until the condition is false, after
which the loop exits.
@@ -223,7 +222,7 @@ while(count < 10) {
}
```

While loops can potentially result in infinite loops if not written
`while` loops can potentially result in infinite loops if not written
properly. Use with care!

Sometimes there will be more than one condition in the test.
@@ -259,7 +258,7 @@ not commonly used in statistical or data analysis applications but
they do have their uses. The only way to exit a `repeat` loop is to
call `break`.

One possible paradigm might be in an iterative algorith where you may
One possible paradigm might be in an iterative algorithm where you may
be searching for a solution and you don't want to stop until you're
close enough to the solution. In this kind of situation, you often
don't know in advance how many iterations it's going to take to get
@@ -322,8 +321,8 @@ for(i in 1:100) {

## Summary

- Control structures like `if`, `while`, and `for` allow you to
control the flow of an R program
- Control structures, like `if`, `while`, and `for`, allow you to
control the flow of an R program.

- Infinite loops should generally be avoided, even if (you believe)
they are theoretically correct.
10 changes: 5 additions & 5 deletions manuscript/debugging.Rmd
Original file line number Diff line number Diff line change
@@ -128,7 +128,7 @@ You can see now that the correct messages are printed without any warning or err

## Figuring Out What's Wrong

The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important first understand what you were expecting to occur. Then you need to idenfity what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are
The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important to first understand what you were expecting to occur. Then you need to identify what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are

- What was your input? How did you call the function?
- What were you expecting? Output, messages, other results?
@@ -269,11 +269,11 @@ Enter a frame number, or 0 to exit
Selection:
```

The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around.
The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around.

## Summary

- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal
- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation
- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions
- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal.
- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation.
- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions.
- Debugging tools are not a substitute for thinking!
20 changes: 10 additions & 10 deletions manuscript/dplyr.Rmd
Original file line number Diff line number Diff line change
@@ -40,7 +40,7 @@ Some of the key "verbs" provided by the `dplyr` package are

* `%>%`: the "pipe" operator is used to connect multiple verb actions together into a pipeline

The `dplyr` package as a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.
The `dplyr` package has a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.



@@ -52,7 +52,7 @@ All of the functions that we will discuss in this Chapter will have a few common

2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names).

3. The return result of a function is a new data frame
3. The return result of a function is a new data frame.

4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.

@@ -84,7 +84,7 @@ You may get some warnings when the package is loaded because there are functions

## `select()`

For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my web site.
For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my website.

After unzipping the archive, you can load the data into R using the `readRDS()` function.

@@ -101,7 +101,7 @@ str(chicago)

The `select()` function can be used to select columns of a data frame that you want to focus on. Often you'll have a large data frame containing "all" of the data, but any *given* analysis might only use a subset of variables or observations. The `select()` function allows you to get the few columns you might need.

Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could, for example, use numerical indices. But we can also use the names directly.

```{r}
names(chicago)[1:3]
@@ -197,7 +197,7 @@ and the last few rows.
tail(select(chicago, date, pm25tmean2), 3)
```

Columns can be arranged in descending order too by useing the special `desc()` operator.
Columns can be arranged in descending order too by using the special `desc()` operator.

```{r}
chicago <- arrange(chicago, desc(date))
@@ -221,7 +221,7 @@ Here you can see the names of the first five variables in the `chicago` data fra
head(chicago[, 1:5], 3)
```

The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible.
The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably need to be renamed to something more sensible.

```{r}
chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
@@ -365,11 +365,11 @@ Here we can see that `o3` tends to be low in the winter months and high in the s

The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.

Once you learn the `dplyr` grammar there are a few additional benefits
Once you learn the `dplyr` grammar there are a few additional benefits:

* `dplyr` can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package
* `dplyr` can work with other data frame "backends", such as SQL databases. There is a SQL interface for relational databases via the DBI package.

* `dplyr` can be integrated with the `data.table` package for large fast tables
* `dplyr` can be integrated with the `data.table` package for large fast tables.

The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!
The `dplyr` package is a handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!

Loading