Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added identify the problem episode for review, Dec2024. Changed to Rm… #78

Merged
merged 6 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

#------------------------------------------------------------
# Values for this lesson.
#------------------------------------------------------------
Expand Down Expand Up @@ -61,7 +62,7 @@ contact: '[email protected]'
episodes:
- 1-intro-reproducible-examples.md
- 2-understanding-your-code.md
- 3-identify-the-problem.md
- 3-identify-the-problem.Rmd
- 4-minimal-reproducible-data.Rmd
- 5-minimal-reproducible-code.Rmd
- 6-asking-your-question.md
Expand All @@ -79,5 +80,3 @@ profiles:
#
# This space below is where custom yaml items (e.g. pinning
# sandpaper and varnish versions) should live


279 changes: 279 additions & 0 deletions episodes/3-identify-the-problem.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@

---
title: "Identify the problem and make a plan"
teaching: 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the time estimates for teaching and exercises here.

exercises: 0
---

:::::::::::::::::::::::::::::::::::::: questions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these questions!


- What do I do when I encounter an error?
- What do I do when my code outputs something I don’t expect?
- Why do errors and warnings appear in R?
- Which areas of code are responsible for errors?
- How can I fix my code? What other options exist if I can't fix it?


::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

After completing this episode, participants should be able to...

- decode/describe what an error message is trying to communicate
- Identify specific lines and/or functions generating the error message
- Lookup function syntax, use, and examples using R Documentation (?help calls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be simplified to just "Use R Documentation to support problem-solving" or something

- Describe a general category of error message (e.g. syntax error, semantic errors, package-specific errors, etc.) # be more explicit about semantic errors?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be more specific like... identify the type of error messages or distinguish between different tyes of error message or be explicit about what you are doing, describe the difference between syntax and semantic errors. Overall though I just think shorter would be better.

- Describe the output of code you are seeking. ## identify relevant warnings or code output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should go closer to the beginning, or at least before line 26. Understanding what you're trying to do is an important precursor to figuring out whether you've got a semantic error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense that you first decode your error message and then you pause and think about what you are actually trying to get

- Identify and quickly fix commonly-encountered R errors ###### what was I thinking here
- Identify which problems are better suited for asking for further help, including online help and reprex

::::::::::::::::::::::::::::::::::::::::::::::::


```{r}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this chunk can be hidden

library(readr)
library(dplyr)
library(ggplot2)
library(stringr)

# Read in the data
rodents <- read_csv("../scripts/data/surveys_complete_77_89.csv")

```


As simple as it may seem, the first step we'll cover is what to do when encountering an error or other undesired output from your code. It is our experience that many seemingly-impossible errors can be fixed on the route to create a reproducible example for an expert helper. With this episode, we hope to teach you the basics about identifying how your error might be occurring, and how to isolate the problem for others to look at.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instructor training discourages the use of language like "as simple as it may seem" or "you just have to do x" since what may appear simple to someone may not be to another and we are trying to avoid making people feel bad about themselves (I know this wasn't your intention, I fall into this trap all the time)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still advocate that you should also introduce that there are different kinds of error that may require different approaches and we are teaching them to identify these correctly



## 3.1 What do I do when I encounter an error?

Something that may go wrong is an error in your code. By this, we mean any code which generates an error message. This happens when R is unable to run your code, for a variety of reasons: some common ones include R being unable to read or interpret your commands, expecting different input types than those inputted, and user-written errors from checks you or other package creators have added to ensure your code is running correctly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit verbose--consider simplifying language, making sure to avoid potential jargon words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I would make this into a proper list too so that it is easily referenced

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe to connect it better with the next section you can say "The most common type of error you might encounter is a syntax error, which means..."


The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:

```{r,error=T}
ggplot(x = taxa) + geom_bar()
```

Though we know somewhere there is an object called `taxa` (it is actually a column of the dataset `rodents`), R is trying to communicate that it cannot find any such object in the local environment. Let's try again, appropriately pointing ggplot to the `rodents` dataset and `taxa` column using the `$` operator.

```{r, error=T}
ggplot(aes(x = rodents$taxa)) + geom_bar()
```


Whoops! Here we see another error message -- this time, R responds with a perhaps more-uninterpretable message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less interpretable?


Let's go over each part briefly. First, we see an error from a function called `fortify`, which we didn't even call! Then a much more helpful informational message: Did we accidentally pass `aes()` to the `data` argument? This does seem to relate to our function call, as we do pass `aes`, potentially where our data should go. A helpful starting place when attempting to decipher an error message is checking the documentation for the function which caused the error:

` ?ggplot`

Here, a Help window pops up in RStudio which provides some more information. Skipping the general description at the top, we see ggplot takes positional arguments of `data`, **then** `mapping`, which uses the `aes` call. We can see in "Arguments" that the `aes(x = rodents$taxa)` object used in the plot is attempted by `fortify` to be converted to a data.frame: now the picture is clear! We accidentally passed our `mapping` argument (telling ggplot how to map variables to the plot) into the position it expected `data` in the form of a data frame. And if we scroll down to "Examples", to "Pattern 1", we can see exactly how ggplot expects these arguments in practice. Let's amend our result:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need more explanation here


```{r}
ggplot(rodents, aes(x = taxa)) + geom_bar()
```


Here we see our desired plot.



## Summary

In general, when encountering an error message for which a remedy is not immediately apparent, some steps to take include:

1. Reading each part of the error message, to see if we can interpret and act on any.

2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).

3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful to further emphasize the Examples section in this road map, or even pull it out into its own step. I liked that you specifically focused on it in the example in the lesson. A lot of people stop at the arguments section when trying to read documentation, and then they think the documentation isn't helpful. I often scroll straight to examples, and I think more people would look at documentation if they knew that was there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I guess the first thing I look at is Usage ("did I do it like I was supposed to?"), and compare my code with what it expects. Then we can immediately see that there is a data= that wasn't included. I would then go read about data, because data=NULL is confusing to me, ah! now I see this fortify thing and can understand it. I don't think I'd jump through to the example here, the Details are pretty helpful. Honestly, sometimes I find finding what I need in the example to be harder lol All this to say: I think it's fine leaving the highlighting of the example to later


4. Copying and pasting the error message into a search engine for more interpretable explanations.

And, when all else fails, preparing our code into a reproducible example for expert help.


## 3.2 What do I do when my code outputs something I don't expect

Another type of 'error' which you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called "semantic errors" (as opposed to syntax errors, though these term themselves are vague within computer science and describe a variety of different scenarios). As with actual R errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error -- you must work out where your code is defective yourself!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a hypothetical example here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of 'error' we should refer to it as something else and then be consistent throughout. Maybe just a problem or issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like the term "actual errors" it is vague and potentially confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can have a callout about how these terms are vague and you can add some random facts for those curious? Instead of the parenthetical.


With our rodent analysis, the next step in the plan is to subset to just the `Rodent` taxa (as opposed to other taxa: Bird, Rabbit, Reptile or NA). Let's quickly check to see how much data we'd be throwing out by doing so:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand on this narrative. Why is this the next step? Where are we at? Where are we going?


```{r}
table(rodents$taxa)
```

We're interested in the Rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. Except wait. In our plot, we can clearly see the presence of NA values. Why are we not seeing them here? This is an example of a semantic error -- our command is executed correctly, but the output is not everything we intended. Having no error message to interpret, let's jump straight to the documentation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could benefit from a more explicit statement of "we expected x, but we got y"

"We expected to see a frequency table that would tell us how many of each value we had, as well as how many NAs. But instead we're only seeing the counts for each value, and we don't see a count for the number of NAs!"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be achieved through a figure? I'm envisioning the regular output and then, hand-drawn, an extra column with NAs and a question mark. But this could be added at some later date lol


```{r}
?table
```

Here, the documentation provides some clues: there seems to be an argument called `useNA` that accepts "no", "ifany", and "always", but it's not immediately apparent which one we should use. As a second approach, let's go to `Examples` to see if we can find any quick fixes. Here we see a couple lines further down:

```r
table(a) # does not report NA's
table(a, exclude = NULL) # reports NA's
```

That seems like it should be inclusive. Let's try again:

```{r}
table(rodents$taxa, exclude = NULL)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you weave the research question in throughout this example. I think before going back to it here, this example would be further clarified by explicitly referring back to the thing about expecting the code to do one thing and having it do another.

For example, "That worked! Now we see a table that includes a count of the number of NA values, in addition to counts for the other variables. Now we can tell how many NAs we would be losing [...]" or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw 'useNA' is asking whether to include NAs, so 'ifany' will include them IF there are any, and 'always', well, will always include them. Maybe this can be a callout.

Now, we do see that by subsetting to the "Rodent" taxa, we are losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let's subset our data to the rodent taxa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say more like "However, since we have no way of knowing, we must omit this data if we are trying to ask questions about rodents only"


```{r}
rodents <- rodents %>% filter(taxa == "Rodent")
```

## Summary

In general, when encountering a semantic error for which a remedy is not immediately apparent, some steps to take include:

1. Reading any warning or informational messages that may pop up when executing your code.

2. Changing the input to your function call to see if the behavior is ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here?


2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part supposed to be the same as previous?


3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

And, when all else fails, preparing our code into a reproducible example for expert help. Note, there are fewer options available as when an error message prevents your code from running. You may find yourself isolating and reproducing your problem more often with semantic errors as easily solvable syntax errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"than with easily solvable..."



::::::::::::::::::::: callout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really helpful! Could potentially be re-worded to be a little less jargon-y, but I think it's a very helpful callout to have here.


Generally, the more your code deviates from just using base R functions, or the more you use specific packages, both the quality of documentation and online help available from search engines and Googling gets worse and worse. While base R errors will often be solvable in a couple of minutes from a quick `?help` check or a long online discussion and solutions on a website like Stack Overflow, errors arising from little-used packages applied in bespoke analyses might merit isolating your specific problem to a reproducible example for online help, or even getting in touch with the developers! Such community input and questions are often the way packages and documentation improves over time.

:::::::::::::::::::::

## 3.3 How can I find where my code is failing?

Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which line(s) in your code are producing the error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, error vs. problem vs. issue...


Consider the example below, where we now are attempting to see how which species of kangaroo rodents appear in different plot types over the years.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand on the example a little. "To answer our first research question, we want to see how the abundance of each kangaroo rat in each plot changed over the course of the experiment"



```{r, error=T}
krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus

ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)

```

Uh-oh. Another error here, this time in "combine_vars?" What is that? "Faceting variables must have at least one value": What does that mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an instructor note to pause and have a small discussion, or insert a 2 min exercise "decode the message" or something

Well it may be clear enough that we seem to be missing "species" values where we intend. Maybe we can try to make a different graph looking at what species values are present? Or perhaps there's an error earlier -- our safest approach may actually be seeing what krats looks like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would steer away from "it may be clear enough." And this explanation may require a little bit more clarity... or maybe the bringing up of trying a graph is a bit misleading. Maybe just say that we need to take a closer look at our species



```{r}
krats
```
It's empty! What went wrong with our "Dipadomys" filter?

```{r}
rodents %>% count(genus)
```
We see two things here. For one, we've misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I like the highlighting of the number of values. We can point it out even a little more, at least when checking code, always checking how the dimensions of the data change


```{r}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain what you are doing before you do it?

krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)

```


Our improved code here looks good. A quick "dim" call confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a 'print' statement or some other output before plots or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations.

However, there's something funky happening here. The bins are definitely weirdly spaced -- we can see some bins are not filled with any observations, while those exactly containing one of the integer years happens to contain all the observations for that year.

:::::::: challenge

As a group, name some potential problems or undesired outcomes from this graph...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this relate to the learning outcomes? Maybe we can just say that this graph doesn't look great and we want it to look better by changing the time?


::::::::

:::::::: solution

- The graph looks sparse, and unevenly so -- many bins have no observations
- Suggests that some years had more observations and others fewer based on somewhat arbitrary measurements (i.e. what calendar year happened to fall on)
- Hard to compare trends across time, or even subsequent years...

::::::::

As we discussed in the challenge, there are some issues to visualizing our data this way. A solution here might be to tinker with the bin width in the histogram code, but let's step back a minute. Do we necessarily need to dive into the depths of tinkering with the plot? We can evalulate this problem not in terms of the plot having a problem, but with our data type having a problem. There's an opportunity to encode the observation times outside of coarse, somewhat arbitrary year groupings with the real, interpretable date they were collected. Let's do that using the tidyverse's 'lubridate' package. The important details here are that we are creating a 'datetime'-type variable using the recorded years, months, and days, which are currently all encoded as numeric types.

```{r}
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))

ggplot(krats, aes(date, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)

```

This looks much better, and is easier to see the trends over time as well. Note our x axis still shows bins with year labelings, but the continuous spread of our data over these bins shows that dates are treated more continuously and fall more continuously within histogram bins.

:::::::: callout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this is a good callout, but I'm not entirely following the connection to reproducible examples. Can you explain what you meant by that? Why is this example more relevant to reproducible examples than the other examples in this episode?

One aspect we can see with this exercise above is that by setting up a reproducible example, we can isolate the problem with data rather than simply asking a proximal problem (i.e. 'how can i change my plot to look like so'). This allows helpers and you to directly improve your code, but also allows the community to help in identifying the problem. You don't always need to understand what exact lines of code or function calls are going wrong in order to get help!
::::::::


## Summary

In general, we need to isolate the specific areas of code causing the bug or problem. There is no general rule of thumb as to how large this needs to be, but in general, think about what we would want to include in a reprex. Any early lines which we know run correctly and as intended may not need to be included, and we should seek to isolate the problem area as much as we can to make it understandable to others.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I am following...


Let's add to our list of steps for identifying the problem:

0. Identify the problem area -- add print statements immediately upstream or downstream of problem areas, step into functions, and see whether any intermediate output can be further isolated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audience probably won't know what a "print statement" is or what it means to "step into functions". Do you think it's within the scope of this episode to cover those techniques?


1. Read each part of the error or warning message (if applicable), to see if we can immediately interpret and act on any.

2. Pulling up the R Documentation for any function calls causing the error (which may be specific to a package, such as with ggplot).

3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like 2 and 3 cane just be one?


4. Copying and pasting the error message into a search engine for more interpretable explanations.

And, when all else fails, preparing our code into a reproducible example for expert help.

Whereas before we had a list of steps for addressing spot problems arising in one or two lines, we can now organize identifying the problem into a more organizational workflow. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don't understand anymore how our code fails, at which point we can excise that area for further help.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe flesh this out a little more?





Finally, let's make our plot publication-ready by changing some aesthetics. Let's also add a vertical line to show when the study design changed on the exclosures.

```{r}
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))


krats %>%
ggplot(aes(x = date, fill = plot_type)) +
geom_histogram()+
facet_wrap(~species, ncol = 1)+
theme_bw()+
scale_fill_viridis_d(option = "plasma")+
geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue")

```

It looks like the study change helped to reduce merriami sightings in the Rodent and Short-term Krat exclosures.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest italicizing "merriami"




59 changes: 0 additions & 59 deletions episodes/3-identify-the-problem.md

This file was deleted.

Loading