diff --git a/_freeze/mod_data-viz/execute-results/html.json b/_freeze/mod_data-viz/execute-results/html.json index 04cd52c..51df418 100644 --- a/_freeze/mod_data-viz/execute-results/html.json +++ b/_freeze/mod_data-viz/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "548b2a0252722108ea4f0f0a0dcc026e", + "hash": "6a1b32aed17c224c769757d6600a0e61", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Data Visualization & Exploration\"\ncode-annotations: hover\n---\n\n\n\n\n## Overview\n\nData visualization is a fundamental part of working with data. Visualization can be only used in the final stages of a project to make figures for publication but it can also be hugely valuable for quality control and hypothesis development processes. This module focuses on the fundamentals of graph creation in an effort to empower you to apply those methods in the various contexts where you might find visualization to be helpful.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Explain how data visualization can be used to explore data\n- Define fundamental `ggplot2` vocabulary\n- Identify appropriate graph types for given data type/distribution\n- Discuss differences between presentation- and publication-quality graphs\n- Explain how your graphs can be made more accessible\n\n## Preparation\n\n1. Each Synthesis fellow should download one data file identified for your group's project\n2. _If you are a Mac user_, install [XQuartz](https://www.xquartz.org/)\n3. _If you are an R user_, run the following code:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"librarian\")\nlibrarian::shelf(tidyverse, summarytools, datacleanr, lterdatasampler, supportR, cowplot)\n```\n:::\n\n\n\n\n## Networking Session\n\nWe'll have two guests to kick off today's class. Each has been involved in synthesis as an early career researcher and each uses visualization in different ways to assess, clarify, and communicate their data and analyses.\n\n:::{.panel-tabset}\n\n### 2024 Guests\n\n- [Tim Ohlert](https://www.researchgate.net/scientific-contributions/Timothy-Ohlert-2172949124), Postdoctoral Researcher, Colorado State University; DroughtNet Coordinator\n\n- [Kyle Cavanaugh](https://www.ioes.ucla.edu/person/kyle-cavanaugh/), Associate Professor, UCLA Institute of the Environment and Sustainability and the UCLA Geography Department\n\n:::\n\n## Data Visualization & The Synthesis Workflow\n\nAs shown in the graphic below, visualization can be valuable throughout the lifecycle of a synthesis project, albeit in different ways at different phases of a project.\n\n

\n\"Diagram\n

Diagram of data stages from raw data to published products. Credit: Margaret O'Brian & Li Kui & Sarah Elmendorf
\n

\n\n## Visualization for Exploration\n\nExploratory data visualization is an important part of any scientific project. Before launching into analysis it is valuable to make some simple plots to scan the contents. These plots may reveal any number of issues, such as typos, sensor calibration problems or differences in the protocol over time.\n\nThese \"fitness for use\" visualizations are even more critical for synthesis projects. In synthesis, we are often repurposing publicly available datasets to answer questions that differ from the original motivations for data collection. As a result, the metadata included with a published dataset may be insufficient to assess whether the data are useful for your group's question. Datasets may not have been carefully quality-controlled prior to publication and could include any number of 'warts' that can complicate analyses or bias results. Some of these idiosyncrasies you may be able to anticipate in advance (e.g. spelling errors in taxonomy) and we encourage you to explicitly test for those and rectify them during the data harmonization process (see the [Data Wrangling module](https://lter.github.io/ssecr/mod_wrangle.html)). Others may come as a surprise.\n\nDuring the early stages of a synthesis project, you will want to gain skill to quickly scan through large volumes of data. The figures you make will typically be for internal use only, and therefore have low emphasis on aesthetics.\n\n### Exploratory Visualization Applications\n\nSpecific applications of exploratory data visualization include identifying:\n\n1. Dataset coverage (temporal, spatial, taxonomic)\n - For example, the metadata might indicate a dataset covers the period 2010-2020. That could mean one data point in 2010 and one in 2020! This may not be useful for a time-series analysis.\n2. Errors in metadata \n - Do the units \"make sense\" with the figure? Typos in metadata do occur, so if you find yourself with elephants weighing only a few grams, it may be necessary to reach out to the dataset contact.\n3. Differences in methodology\n - Do the data from sequential years, replicate sites, different providers generally fall into the same ranges or is there sensor drift or changes in protocols that need to be addressed?\n - A risk of synthesis projects is that you may find you are comparing apples to oranges across datasets, as the individual datasets included in your project were likely not collected in a coordinated fashion.\n - A benefit of synthesis projects is you will typically have large volumes of data, collected from many locations or timepoints. This data volume can be leveraged to give you a good idea of how your response variable looks at a 'typical' location as well as inform your gestalt sense of how much site-to-site, study-to-study, or year-to-year variability is expected. In our experience, where one particular dataset, or time period, strongly differs from the others, the most common root cause is differences in methodology that need to be addressed in the data harmonization process. \n\nIn the data exploration stage you may find:\n\n- Harmonization issues\n - Are all your datasets measured in units that can be converted to the same units?\n - If not, can you envision metrics (relative abundance? Effect size?) that would make datasets intercomparable?\n- Some entire datasets cannot be used\n- Parts of some datasets cannot be used\n- Additional quality control is needed (e.g. filtering large outliers)\n\nThese steps are an important precursor to the data harmonization stage, where you will process the datasets you have selected into an analysis-ready format.\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Data Sleuth\n\nIn this activity, you'll play the role of data detective. You will have many potential datasets to look through. It is important to do it correctly, but you likely won't need or want to develop boutique code to examine each dataset, especially since some may be discarded after an initial pass.\n\nAs a project team, discuss the following points:\n\n1. Decide on a structure for tracking results of exploratory data checks\n - Git issues? Additional columns in your team-data-inventory google sheet? Something else?\n - Make a list of checks you would want to apply to each dataset before inclusion\n2. Use the `summarytools` and/or `datacleanr` packages to explore one exemplar dataset that you intend to include in your project\n - Discuss any issues you discover \n - Revise the list of checks as necessary\n - Complete pre-harmonization dataset \"to do\" for the dataset (e.g. remove 1993 due to incomplete sampling, convert concentrations from mmols to mg/L, contact dataset providers to ask about anomalous values in April 2021)\n3. If you choose to save any exploratory images and/or code for reference after running the interactive exploratory checks, decide on a naming convention and storage location\n - Will you add these files to your `.gitignore` or do you plan on committing them?\n4. What additional plots would you ideally make that are not available through these generic tools?\n\n::::{.panel-tabset}\n##### `summarytools` Package\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the library\nlibrary(summarytools)\n\n# Load data\ndataset_1 <- read_csv(\"your_file_name_here.csv\")\n\n# View the data in your Rstudio environment\nsummarytools::view(summarytools::dfSummary(dataset_1), footnote = NA) # <1>\n\n# Alternatively,save the results for viewing later, or to share with your team\nprint(summarytools::dfSummary(dataset_1), footnote = NA,\n file = 'dataset_01_summary.html')\n```\n:::\n\n\n\n1. Careful! Use lowercase 'v' in the `view` function of the `summarytools` package\n\n##### `datacleanr` Package\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the library\nlibrary(datacleanr)\n\n# Load data\ndataset_1 <- read_csv(\"your_file_name_here.csv\")\n\n# Launch the shiny app and view the data interactively\ndatacleanr::dcr_app(dataset_1)\n```\n:::\n\n\n\n\n::::\n\n
\n\nBoth of these packages have extensive vignettes and online instructional materials. See [here](https://cran.r-project.org/web/packages/summarytools/vignettes/introduction.html) for one from `summarytools` and [here](https://the-hull.github.io/datacleanr/) for one from `datacleanr`.\n\n:::\n\n## Graphing with `ggplot2`\n\nYou may already be familiar with the `ggplot2` package in R but if you are not, it is a popular graphing library based on [The Grammar of Graphics](https://bookshop.org/p/books/the-grammar-of-graphics-leland-wilkinson/1518348?ean=9780387245447). Every ggplot is composed of four elements:\n\n1. A 'core' `ggplot` function call\n2. Aesthetics\n3. Geometries\n4. Theme\n\nNote that the theme component may be implicit in some graphs because there is a suite of default theme elements that applies unless otherwise specified. \n\nThis module will use example data to demonstrate these tools but as we work through these topics you should feel free to substitute a dataset of your choosing! If you don't have one in mind, you can use the example dataset shown in the code chunks throughout this module. This dataset comes from the [`lterdatasampler` R package](https://lter.github.io/lterdatasampler/) and the data are about fiddler crabs (_Minuca pugnax_) at the [Plum Island Ecosystems (PIE) LTER](https://pie-lter.mbl.edu/) site.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse); library(lterdatasampler)\n\n# Load the fiddler crab dataset\ndata(pie_crab)\n\n# Check its structure\nstr(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [392 × 9] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n\n\n:::\n:::\n\n\n\n\nWith this dataset in hand, let's make a series of increasingly customized graphs to demonstrate some of the tools in `ggplot2`.\n\n::::{.panel-tabset}\n### 1. Starter Graph\n\nLet's begin with a scatterplot of crab size on the Y-axis with latitude on the X. We'll forgo doing anything to the theme elements at this point to focus on the other three elements.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. We're defining both the data and the X/Y aesthetics in this top-level bit of the plot. Also, note that each line ends with a plus sign\n2. Because we defined the data and aesthetics in the `ggplot()` function call above, this geometry can assume those mappings without re-specificying\n\n### 2. Custom Theme\n\nWe can improve on this graph by tweaking theme elements to make it use fewer of the default settings.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), # <1>\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-2-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. All theme elements require these `element_...` helper functions. `element_blank` removes theme elements but otherwise you'll need to use the helper function that corresponds to the type of theme element (e.g., `element_text` for theme elements affecting graph text)\n\n### 3. Multiple Geometries\n\nWe can further modify `ggplot2` graphs by adding _multiple_ geometries if you find it valuable to do so. Note however that geometry order matters! Geometries added later will be \"in front of\" those added earlier. Also, adding too much data to a plot will begin to make it difficult for others to understand the central take-away of the graph so you may want to be careful about the level of information density in each graph. Let's add boxplots behind the points to characterize the distribution of points more quantitatively.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_boxplot(pch = 21) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), \n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-3-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. By putting the boxplot geometry first we ensure that it doesn't cover up the points that overlap with the 'box' part of each boxplot\n\n### 4. Multiple Datasets\n\n`ggplot2` also supports adding more than one data object to the same graph! While this module doesn't cover map creation, maps are a common example of a graph with more than one data object. Another common use would be to include both the full dataset and some summarized facet of it in the same plot.\n\nLet's calculate some summary statistics of crab size to include that in our plot.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the supportR library\nlibrary(supportR)\n\n# Summarize crab size within latitude groups\ncrab_summary <- supportR::summary_table(data = pie_crab, groups = c(\"site\", \"latitude\"),\n response = \"size\", drop_na = TRUE)\n\n# Check the structure\nstr(crab_summary)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t13 obs. of 6 variables:\n $ site : chr \"BC\" \"CC\" \"CT\" \"DB\" ...\n $ latitude : num 42.2 41.9 41.3 39.1 30 39.6 41.6 33.3 42.7 34.7 ...\n $ mean : num 16.2 16.8 14.7 15.6 12.4 ...\n $ std_dev : num 4.81 2.05 2.36 2.12 1.8 2.72 2.29 2.42 2.3 2.34 ...\n $ sample_size: int 37 27 33 30 28 30 29 30 28 25 ...\n $ std_error : num 0.79 0.39 0.41 0.39 0.34 0.5 0.43 0.44 0.43 0.47 ...\n```\n\n\n:::\n:::\n\n\n\n\nWith this data object in-hand, we can make a graph that includes both this and the original, unsummarized crab data. To better focus on the 'multiple data objects' bit of this example we'll pare down on the actual graph code.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() + # <1>\n geom_point(pie_crab, mapping = aes(x = latitude, y = size, fill = site),\n pch = 21, size = 2, alpha = 0.2) + \n geom_errorbar(crab_summary, mapping = aes(x = latitude, # <2>\n ymax = mean + std_error,\n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, fill = site),\n pch = 23, size = 3) + \n theme(legend.title = element_blank(),\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-4-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. If you want multiple data objects in the same `ggplot2` graph you need to leave this top level `ggplot()` call _empty!_ Otherwise you'll get weird errors with aesthetics later in the graph\n2. This geometry adds the error bars and it's important that we add it before the summarized data points themselves if we want the error bars to be 'behind' their respective points\n\n::::\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P1)\n\nIn a script, attempt the following with one of either yours or your group's datasets:\n\n- Make a graph using `ggplot2`\n - Include at least one geometry\n - Include at least one aesthetic (beyond X/Y axes)\n - Modify at least one theme element from the default\n\n:::\n\n## Streamlining Graph Aesthetics\n\nSynthesis projects often generate an entire network of inter-related papers. Ensuring that all graphs across papers from a given team have a similar \"feel\" is a nice way of implying a certain standard of robustness for all of your group's projects. However, copy/pasting the theme elements of your graphs can (A) be cumbersome to do even once and (B) needs to be re-done every time you make a change anywhere. Fortunately, there is a better way!\n\n`ggplot2` supports adding theme elements to an object that can then be reused as needed elsewhere. This is the same theory behind wrapping repeated operations into custom functions.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Define core theme elements\ntheme_synthesis <- theme(legend.position = \"none\",\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"),\n axis.text = element_text(size = 13)) # <1>\n\n# Create a graph\nggplot(pie_crab, aes(y = water_temp, x = air_temp, color = size, size = size)) +\n geom_point() +\n theme_synthesis +\n theme(legend.position = \"right\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/std-theme-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. This theme element controls the text on the tick marks. `axis.title` controls the text in the _labels_ of the axes\n2. As a bonus, subsequent uses of `theme()` will replace defaults defined in your earlier theme object. So, you can design a set of theme elements that are _usually_ appropriate and then easily change just some of them as needed\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P2)\n\nIn a script, attempt the following:\n\n- Remove all theme edits from the graph you made in the preceding activity and assign them to a separate object\n - Then add that object to your graph\n- Make a second (different) graph and add your consolidated theme object to that graph as well\n\n:::\n\n## Multi-Panel Graphs\n\nIt is sometimes the case that you want to make a single graph file that has multiple panels. For many of us, we might default to creating the separate graphs that we want, exporting them, and then using software like Microsoft PowerPoint to stitch those panels into the single image we had in mind from the start. However, as all of us who have used this method know, this is hugely cumbersome when your advisor/committee/reviewers ask for edits and you now have to redo all of the manual work behind your multi-panel graph. \n\nFortunately, there are two nice entirely scripted alternatives that you might consider: **Faceted graphs** and **Plot grids**. See below for more information on both.\n\n:::{.panel-tabset}\n### Facets\n\nIn a faceted graph, every panel of the graph has the same aesthetics. These are often used when you want to show the relationship between two (or more) variables but separated by some other variable. In synthesis work, you might show the relationship between your core response and explanatory variables but facet by the original study. This would leave you with one panel per study where each would show the relationship only at that particular study.\n\nLet's check out an example.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(pie_crab, aes(x = date, y = size, color = site))+\n geom_point(size = 2) +\n facet_wrap(. ~ site) + # <1>\n theme_bw() +\n theme(legend.position = \"none\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/facet-1-1.png){fig-align='center' width=576}\n:::\n:::\n\n\n\n1. This is a `ggplot2` function that assumes you want panels laid out in a regular grid. There are other `facet_...` alternatives that let you specify row versus column arrangement. You could also facet by multiple variables by putting something to the left of the tilde\n2. We can remove the legend because the site names are in the facet titles in the gray boxes\n\n### Plot Grids\n\nIn a plot grid, each panel is completely independent of all others. These are often used in publications where you want to highlight several _different_ relationships that have some thematic connection. In synthesis work, your hypotheses may be more complicated than in primary research and such a plot grid would then be necessary to put all visual evidence for a hypothesis in the same location. On a practical note, plot grids are also a common way of circumventing figure number limits enforced by journals.\n\nLet's check out an example that relies on the `cowplot` library.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Load a needed library\nlibrary(cowplot)\n\n# Create the first graph\ncrab_p1 <- ggplot(pie_crab, aes(x = site, y = size, fill = site)) + # <1>\n geom_violin() +\n coord_flip() + # <2>\n theme_bw() +\n theme(legend.position = \"none\")\n\n# Create the second\ncrab_p2 <- ggplot(pie_crab, aes(x = air_temp, y = water_temp)) +\n geom_errorbar(aes(ymax = water_temp + water_temp_sd, ymin = water_temp - water_temp_sd),\n width = 0.1) +\n geom_errorbarh(aes(xmax = air_temp + air_temp_sd, xmin = air_temp - air_temp_sd), # <3>\n width = 0.1) +\n geom_point(aes(fill = site), pch = 23, size = 3) +\n theme_bw()\n\n# Assemble into a plot grid\ncowplot::plot_grid(crab_p1, crab_p2, labels = \"AUTO\", nrow = 1) # <4>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/grid-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. Note that we're assigning these graphs to objects!\n2. This is a handy function for flipping X and Y axes without re-mapping the aesthetics\n3. This geometry is responsible for _horizontal_ error bars (note the \"h\" at the end of the function name)\n4. The `labels = \"AUTO\"` argument means that each panel of the plot grid gets the next sequential capital letter. You could also substitute that for a vector with labels of your choosing\n:::\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P3)\n\nIn a script, attempt the following:\n\n- Assemble the two graphs you made in the preceding two activities into the appropriate type of multi-panel graph\n\n:::\n\n## Accessibility Considerations\n\nAfter you've made the graphs you need, it is good practice to revisit them with to ensure that they are as accessible as possible. You can of course also do this during the graph construction process but it is sometimes less onerous to tackle as a penultimate step in the figure creation process. There are many facets to accessibility and we've tried to cover just a few of them below.\n\n### Color Choice\n\nOne of the more well-known facets of accessibility in data visualization is choosing colors that are \"colorblind safe\". Such palettes still create distinctive colors for those with various forms of color blindness (e.g., deuteranomoly, protanomaly, etc.). The classic red-green heatmap for instance is very colorblind unsafe in that people with some forms of colorblindness cannot distinguish between those colors (hence the rise of the yellow-blue heatmap in recent years). Unforunately, the `ggplot2` default rainbow palette--while nice for exploratory purposes--_is not_ colorlbind sfae.\n\nSome websites (such as [colorbewer2.org](https://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=9)) include a simple checkbox for colorblindness safety which automatically limits the listed options to those that are colorblind safe. Alternately, you could use a browser plug-in (such as [Let's get color blind](https://chromewebstore.google.com/detail/lets-get-color-blind/bkdgdianpkfahpkmphgehigalpighjck) on Google Chrome) to simulate colorblindness on a particular page.\n\nOne extreme approach you could take is to dodge this issue entirely and format your graphs such that color either isn't used at all or only conveys information that is also conveyed in another graph aesthetic. We don't necessarily recommend this as color--when the palette is chosen correctly--can be a really nice way of making information-dense graphs more informative and easily-navigable by viewers.\n\n### Multiple Modalities\n\nRelated to the color conversation is the value of mapping multiple aesthetics to the same variable. By presenting information in multiple ways--even if that seems redundant--you enable a wider audience to gain an intuitive sense of what you're trying to display.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, \n fill = site, shape = site)) + # <1>\n geom_jitter(size = 2, width = 0.1, alpha = 0.6) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) + # <2>\n theme_bw() +\n theme(legend.title = element_blank())\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/multi-modal-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. In this graph we're mapping both the fill and shape aesthetics to site\n2. This is a little cumbersome but there are only five 'fill-able' shapes in R so we need to reuse some of them to have a unique one for each site. Using fill-able shapes is nice because you get a crisp black border around each point. See `?pch` for all available shapes\n\nIn the above graph, even though the rainbow palette is not ideal for reasons mentioned earlier, it is now much easier to tell the difference between sites with similar colors. For instance, \"NB\", \"NIB\", and \"PIE\" are all shades of light blue/teal. Now that they have unique shapes it is dramatically easier to look at the graph and identify which points correspond to which site.\n\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Graph Accessibility\n\nWith a group discuss (some of) the following questions:\n\n- What are other facets of accessibility that you think are important to consider when making data visualizations?\n- What changes do you make to your graphs to increase accessibility?\n - What changes _could_ you make going forward?\n\n:::\n\n\n### Presentation vs. Publication\n\nOne final element of accessibility to consider is the difference between a '_presentation_-quality' graph and a '_publication_-quality' one. While it may be tempting to create a single version of a given graph and use it in both contexts that is likely to be less effective in helping you to get your point across than making small tweaks to two separate versions of what is otherwise the same graph.\n\n:::{.panel-tabset}\n### Presentation-Focused\n\n**Do:**\n\n- Increase size of text/points **greatly**\n - If possible, sit in the back row of the room where you'll present and look at your graphs from there\n- _Consider_ adding graph elements that highlight certain graph regions\n- Present summarized data (increases focus on big-picture trends and avoids discussion of minutiae)\n- Map multiple aesthetics to the same variables\n\n**Don't:**\n\n- Use technical language / jargon\n- Include _unnecessary_ background elements\n- Use multi-panel graphs (either faceted or plot grid)\n - If you have multiple graph panels, put each on its own slide!\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(crab_summary, aes(x = latitude, y = mean, \n shape = reorder(site, latitude), # <1>\n fill = reorder(site, latitude))) +\n geom_vline(xintercept = 36.5, color = \"black\", linetype = 1) +\n geom_vline(xintercept = 41.5, color = \"black\", linetype = 2) + # <2>\n geom_errorbar(mapping = aes(ymax = mean + std_error, ymin = mean - std_error),\n width = 0.2) +\n geom_point(size = 4) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Size (mm)\") + # <3>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 17),\n axis.text = element_text(size = 15))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/talk-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. We can use the `reorder` function to make the order of sites in the legend (from top to bottom) match the order of sites in the graph (from left to right)\n2. Adding vertical lines at particular parts in the graph can make comparisons within the same graph easier\n3. `labs` lets us customize the title and label text of a graph\n\n### Publication-Focused\n\n**Do:**\n\n- Increase size of text/points **slightly**\n - You want to be legible but you can more safely assume that many readers will be able to increase the zoom of their browser window if needed\n- Present un-summarized data (with or without summarized points included)\n - Many reviewers will want to get a sense for the \"real\" data so you should include unsummarized values wherever possible\n- Use multi-panel graphs\n - If multiple graphs \"tell a story\" together, then they should be included in the same file!\n- Map multiple aesthetics to the same variables\n- If publishing in a journal available in print, check to make sure your graph still makes sense in grayscale\n - There are nice browser plug-ins (like [Grayscale the Web](https://chromewebstore.google.com/detail/grayscale-the-web-save-si/mblmpdpfppogibmoobibfannckeeleag) for Google Chrome) for this too\n\n**Don't:**\n\n- Include _unnecessary_ background elements\n- Add graph elements that highlight certain graph regions\n - You can--and should--lean more heavily on the text of your publication to discuss particular areas of a graph\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() +\n geom_point(pie_crab, mapping = aes(x = latitude, y = size,\n color = reorder(site, latitude)),\n pch = 19, size = 1, alpha = 0.3) +\n geom_errorbar(crab_summary, mapping = aes(x = latitude, y = mean, \n ymax = mean + std_error, \n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, \n shape = reorder(site, latitude),\n fill = reorder(site, latitude)),\n size = 4) +\n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Carapace Width (mm)\") + # <1>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 15),\n axis.text = element_text(size = 13))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/pub-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. Here we are using a reasonable amount of technical language\n\n### Other Considerations\n\nSome other factors you might consider _regardless of where the graphs will be embedded_ include:\n\n- **White Background**. Ensure figures have a plain, white background for clarity and compatibility with journal formats.\n- **High Resolution**. Use a resolution of at least 300 dpi for print quality. Journals often specify the minimum dpi required.\n- **Bounding Box and Borders**. Add a bounding box or border if it enhances clarity, but avoid excessive framing unless necessary to separate elements clearly.\n- **Clear Axis Labels**. Label axes with clear, concise descriptions, including units of measurement (e.g., \"Temperature (°C)\"). Use readable font sizes that remain legible when scaled.\n- **Consistent Font Style and Size**. Use a uniform font style (e.g., Arial, Helvetica) across all figures and a size that is readable but not overwhelming (typically 8–12 points).\n- **Color Scheme**. Choose a color palette that remains clear in both color and grayscale. Use distinct colors for different categories or groups, and avoid colors that may be difficult for colorblind readers to differentiate (e.g., red-green combinations).\n- **Legend Placement**. Place legends within the figure space if possible, ensuring they don't overlap data or distract from the main content. Keep legends concise.\n- **Minimal Gridlines**. Use minimal and subtle gridlines for reference, but avoid heavy or cluttered lines that may distract from the data.\n- **Error Bars and Statistical Indicators**. Add error bars, confidence intervals, or statistical significance markers as needed to represent variability and support interpretation.\n- **Descriptive Figure Caption**. Include a detailed caption that summarizes the figure's purpose, data source, and any essential methods or abbreviations. Captions should be self-contained to ensure figures are understandable independently.\n\n:::\n\n## Code Demo: Post-Harmonization Visualization\n\nAfter harmonizing your data, you'll want to generate one last set of 'sanity check' plots to make sure (1) you have interpreted the metadata correctly (2) you haven't made any obvious errors in the harmonization and (3) your data are ready for analysis. Nothing is less fun than finding out your analytical results are due to an error in the underlying data.\n\nThe following is a multi-part code demonstration of three common post-harmonization uses of visualization. In addition to being useful graphs, there is also example code on how to export multiple panels of graphs into separate pages of a PDF which can be really helpful when reviewing exploratory visualizations as a group (without needing to scroll through a ton of separate graph files).\n\n### Additional Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this demo, you'll need to install the following packages:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## install.packages(\"librarian\")\nlibrarian::shelf(tidyverse, scales, ggforce, slider)\n```\n:::\n\n\n\n\nThe three sets of plots below encompass many of the most common data structures\nwe have encountered types in ecological synthesis projects. These include \nquantitative measurements collected over many sites, taxonomic data collected\nover many sites, and seasonal time series data.\n\n::: panel-tabset\n### Graph _All_ Numeric Variables\n\nIt can be helpful to visualize all numeric variables in your dataset, grouped by site (or dataset source) to check that the data have been homogenized correctly. As an example, we'll use a 2019 dataset on lake water quality, chemistry, and zooplankton community composition near the [Niwot Ridge](https://nwt.lternet.edu/) LTER. The dataset is a survey of 16 high alpine lakes and has structure similar to one that might be included in a multi-site synthesis. For more information on these data, check out [the data package on EDI](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-nwt.12.1). \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ngreen_biochem <- read.csv(file = file.path(\"data\", \"green-lakes_water-chem-zooplank.csv\")) # <1>\n\n# Check structure\nstr(green_biochem)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t391 obs. of 14 variables:\n $ local_site : chr \"Blue Lake\" \"Blue Lake\" \"Blue Lake\" \"Blue Lake\" ...\n $ location : chr \"LAKE\" \"LAKE\" \"LAKE\" \"LAKE\" ...\n $ depth : num 0 1 2 3 4 5 6 7 8 9 ...\n $ date : chr \"2016-07-08\" \"2016-07-08\" \"2016-07-08\" \"2016-07-08\" ...\n $ time : chr \"09:11:00\" \"09:13:00\" \"09:14:00\" \"09:16:00\" ...\n $ chl_a : num 0.521 NA NA NA NA NA NA NA NA NA ...\n $ pH : num 6.75 6.78 6.72 6.67 6.57 6.55 6.52 6.51 6.48 6.49 ...\n $ temp : num 2.8 2.8 2.73 2.72 7.72 2.65 2.65 2.65 2.64 2.65 ...\n $ std_conduct: num 8 9 10 9 10 9 9 9 9 9 ...\n $ conduct : num 4 5 6 6 6 5 5 5 5 6 ...\n $ DO : num 8.23 8.14 8.14 8.05 8.11 8.07 8.21 8.19 8.17 8.16 ...\n $ sat : num 60.9 60.1 60.2 59.4 59.8 59.4 60.3 60.3 60.1 60 ...\n $ secchi : num 6.25 NA NA NA NA NA NA NA NA NA ...\n $ PAR : num 1472 872 690 530 328 ...\n```\n\n\n:::\n:::\n\n\n\n1. Note that you could also read in this data directly from EDI. See ~line 31 of [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) for a syntax example\n\nOnce we have the data, we can programmatically identify all columns that R knows to be numeric.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# determine which columns are numeric in green_biochem\nnumcols <- green_biochem %>%\n dplyr::select(dplyr::where(~ is.numeric(.x) == TRUE)) %>% # <1>\n names(.) %>% \n sort(.)\n\n# Check that out\nnumcols # <2>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"chl_a\" \"conduct\" \"depth\" \"DO\" \"PAR\" \n [6] \"pH\" \"sat\" \"secchi\" \"std_conduct\" \"temp\" \n```\n\n\n:::\n:::\n\n\n\n1. The tilde (`~`) is allowing us to evaluate each column against this conditional\n2. You may notice that these columns all have `\"num\"` next to them in their structure check. The scripted method is _dramatically_ faster and more reproducible than writing these names down by hand\n\nNow that we have our data and a vector of numeric column names, we can generate a multi-page PDF of scatterplots where each page is specific to a numeric variable and each graph panel within a given page reflects a site-by-date combination.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Open PDF 'device'\ngrDevices::pdf(file = file.path(\"qc_all_numeric.pdf\")) # <1>\n\n# Loop across numeric variables\nfor (var in numcols) {\n \n # Create a set of graphs for onevariable\n myplot <- ggplot(green_biochem, aes(x = date, y = .data[[var]])) +\n geom_point(alpha = 0.5) + # <2>\n facet_wrap(. ~ local_site)\n \n # Print that variable\n print(myplot)\n}\n\n# Close the device\ndev.off() # <3>\n```\n:::\n\n\n\n1. This function tells R that the following code should be saved as a PDF\n2. A scatterplot may not be the best tool for your data; adjust appropriately\n3. This function (when used after a 'device' function like `grDevices::pdf`) tells R when to stop adding things to the PDF and actually save it\n\nThe first page of the resulting plot should look something like the following, with each page having the same content but a different variable on the Y axis.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/demo_all-num-vars_viz-code-real-1.png){fig-align='center' width=768}\n:::\n:::\n\n\n\n\n### Taxonomic Consistency\n\nTaxonomic time series can be tricky to work with due to inconsistencies in nomenclature and/or sampling effort. In particular, 'pseudoturnover' where one species 'disappears' with or without the simultaneous 'appearance' of another taxa can be indicative of either true extinctions, or changes in species names, or changes in methodology that cause particular taxa not to be detected. A second complication is that taxonomic data are often archived as 'presence-only' so it is necessary to _infer_ the absences based on sampling methodology and add them to your dataset before analysis.\n\nWhile there are doubtless many field-collected datasets that have this issue, we've elected to simulate data so that we can emphasize the visualization elements of this problem while avoiding the \"noise\" typical of real data. This simulation is not necessarily vital to the visualization so we've left it out of the following demo. _However_, if that is of interest to you, see [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R)--in particular \\~line 41 through \\~80.\n\nA workflow for establishing taxonomic consistency and plotting the results is included below.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntaxa_df <- read.csv(file.path(\"data\", \"simulated-taxa-df.csv\"))\n\n# Check structure\nstr(taxa_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1025 obs. of 4 variables:\n $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...\n $ plot : int 1 1 1 1 1 1 1 1 1 1 ...\n $ taxon: chr \"Taxon_A\" \"Taxon_B\" \"Taxon_C\" \"Taxon_D\" ...\n $ count: int 8 11 7 13 14 15 11 6 9 7 ...\n```\n\n\n:::\n:::\n\n\n\n\nFirst, we'll define units of sampling (year, plot and taxon) and 'pad out' the zeros. In this example, we have only added zeroes for taxa-plot-year combinations where that taxa is present in at least one year at a given plot. Again, this zero-padding is prerequisite to the visualization but not necessarily part of it so see \\~lines 84-117 of the [prep script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) if that process is of interest.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\nwithzeros <- read.csv(file.path(\"data\", \"simulated-taxa-df_with-zeros.csv\"))\n\n# Check structure\nstr(withzeros) # <1>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1100 obs. of 4 variables:\n $ plot : int 1 2 2 3 4 5 5 6 7 8 ...\n $ taxon: chr \"Taxon_A\" \"Taxon_A\" \"Taxon_A\" \"Taxon_A\" ...\n $ year : int 2014 2014 2019 2014 2014 2014 2019 2014 2014 2013 ...\n $ n : int 0 0 0 0 0 0 0 0 0 0 ...\n```\n\n\n:::\n:::\n\n\n\n1. Notice how there are more rows than the preceding data object and several new zeroes in the first few rows?\n\nNow that we have the data in the format we need, we'll create a plot of species counts over time with zeros filled in. Because there are many plots and it is difficult to see so many panels on the same page, we'll use the `facet_wrap_paginate` function from the `ggforce` package to create a multi-page PDF output.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create the plot of species counts over time (with zeros filled in)\nmyplot <- ggplot(withzeros, aes(x = year, y = n, group = plot, color = plot)) +\n geom_point() +\n scale_x_continuous(breaks = scales::pretty_breaks()) +\n ggforce::facet_wrap_paginate(~ taxon, nrow = 2, ncol = 2) # <1>\n\n# Start the PDF output\ngrDevices::pdf(file.path(\"counts_by_taxon_with_zeros.pdf\"),\n width = 9, height = 5)\n\n# Loop across pages (defined by `ggforce::facet_wrap_paginate`)\nfor (i in seq_along(ggforce::n_pages(myplot))) {\n \n page_plot <- myplot + \n ggforce::facet_wrap_paginate(~taxon, page = i, \n nrow = 2, ncol = 2)\n \n print(page_plot)\n}\n\n# Close the PDF output\ndev.off()\n```\n:::\n\n\n\n1. This allows a faceted graph to spread across more than one page. See `?ggforce::facet_wrap_paginate` for details\n\nThe first page of the resulting plot should look something like this:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/demo_tax-consist_viz-code-real-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nNotice how \"Taxon_A\" is absent from all plots in 2014 whereas \"Taxon_B\" has extremely high counts in the same year. Often this can signify inconsistent use of taxonomic names over time.\n\n### Seasonal Time Series\n\nFor time series, intra-annual variation can often make data issues difficult to spot. In these cases, it can be helpful to plot each year onto the same figure and compare trends across study years.\n\nAs an example, we'll use a 2024 dataset on streamflow near the [Niwot Ridge](https://nwt.lternet.edu/) LTER. The dataset is a 22 year time-series of daily streamflow. For more information on these data, check out [the data package on EDI](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-nwt.105.18). \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read data\ngreen_streamflow <- read.csv(file.path(\"data\", \"green-lakes_streamflow.csv\")) # <1>\n\n# Check structure\nstr(green_streamflow)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t15451 obs. of 6 variables:\n $ LTER_site : chr \"NWT\" \"NWT\" \"NWT\" \"NWT\" ...\n $ local_site : chr \"gl4\" \"gl4\" \"gl4\" \"gl4\" ...\n $ date : chr \"1981-06-12\" \"1981-06-13\" \"1981-06-14\" \"1981-06-15\" ...\n $ discharge : num 9786 8600 7600 6700 5900 ...\n $ temperature: num NA NA NA NA NA NA NA NA NA NA ...\n $ notes : chr \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" ...\n```\n\n\n:::\n:::\n\n\n\n1. Note again that you could also read in this data directly from EDI. See ~line 129 of [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) for a syntax example\n\nLet's now calculate a moving average encompassing the 5 values before and after each focal value.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Do necessary wrangling\nstream_data <- green_streamflow %>%\n # Calculate moving average for each numeric variable\n dplyr::mutate(dplyr::across(.cols = dplyr::all_of(c(\"discharge\", \"temperature\")),\n .fns = ~ slider::slide_dbl(.x = .x, .f = mean,\n .before = 5, .after = 5),\n .names = \"{.col}_move.avg\" )) %>%\n # Handle date format issues\n dplyr::mutate(yday = lubridate::yday(date),\n year = lubridate::year(date))\n\n# Check the structure of that\nstr(stream_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t15451 obs. of 10 variables:\n $ LTER_site : chr \"NWT\" \"NWT\" \"NWT\" \"NWT\" ...\n $ local_site : chr \"gl4\" \"gl4\" \"gl4\" \"gl4\" ...\n $ date : chr \"1981-06-12\" \"1981-06-13\" \"1981-06-14\" \"1981-06-15\" ...\n $ discharge : num 9786 8600 7600 6700 5900 ...\n $ temperature : num NA NA NA NA NA NA NA NA NA NA ...\n $ notes : chr \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" ...\n $ discharge_move.avg : num 7299 6699 5992 5527 5274 ...\n $ temperature_move.avg: num NA NA NA NA NA NA NA NA NA NA ...\n $ yday : num 163 164 165 166 167 168 169 170 171 172 ...\n $ year : num 1981 1981 1981 1981 1981 ...\n```\n\n\n:::\n:::\n\n\n\n\nPlot seasonal timeseries of each numeric variable as points with the moving\naverage included as lines\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Start PDF output\ngrDevices::pdf(file = file.path(\"qc_all_numeric_seasonal.pdf\"))\n\n# Loop across variables\nfor (var in c(\"discharge\", \"temperature\")) {\n \n # Make the graph\n myplot <- ggplot(stream_data, aes(x = yday, group = year, color = year)) +\n geom_point(aes(y = .data[[var]])) + # <1>\n geom_line(aes(y = .data[[paste0(var, \"_move.avg\")]])) + # <2>\n viridis::scale_color_viridis()\n \n # Print it\n print(myplot)\n}\n\n# End PDF creation\ndev.off()\n```\n:::\n\n\n\n1. Add points based on the year\n2. Adding lines based on the average\n\nThe resulting figure should look something like this:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/demo_seasons_viz-code-real-1.png){fig-align='center' width=768}\n:::\n:::\n\n\n\n\nOne of these years is not like the others...\n:::\n\n## Multivariate Visualization\n\nIf you are working with multivariate data (i.e., data where multiple columns are all response variables collectively) you may need to use visualization methods unique to that data structure. For more information, check out the [bonus multivariate visualization module](https://lter.github.io/ssecr/mod_multivar-viz.html).\n\n## Maps\n\nYou may find it valuable to create a map as an additional way of visualizing data. Many synthesis groups do this--particularly when there is a strong spatial component to the research questions and/or hypotheses. Check out the [bonus spatial data module](https://lter.github.io/ssecr/mod_spatial.html) for more information on map-making if this is of interest!\n\n## Additional Resources\n\n### Papers & Documents\n\n- Chang, W. _et al._, [`ggplot2`: Elegant Graphics for Data Analysis](https://ggplot2-book.org/). 3^rd^ edition. **2023**.\n- National Center for Ecological Analysis and Synthesis (NCEAS). [Colorblind Safe Color Schemes](https://www.nceas.ucsb.edu/sites/default/files/2022-06/Colorblind%20Safe%20Color%20Schemes.pdf). **2022**.\n- Wilke, C.O. [Fundamentals of Data Visualization](https://clauswilke.com/dataviz/). **2020**.\n\n### Workshops & Courses\n\n- The Carpentries. [Data Analysis and Visualization in R for Ecologists: Data Visualization with `ggplot2`](https://datacarpentry.org/R-ecology-lesson/visualizing-ggplot.html). **2024**.\n- The Carpentries. [Data Analysis and Visualization in Python for Ecologists: Making Plots with `plotnine`](https://datacarpentry.org/python-ecology-lesson/07-visualization-ggplot-python.html). **2024**.\n- LTER Scientific Computing Team. [Coding in the Tidyverse: 'Visualize' Module](https://lter.github.io/workshop-tidyverse/visualize.html). **2023**.\n\n### Websites\n\n- [The R Graph Gallery](https://r-graph-gallery.com/)\n", + "markdown": "---\ntitle: \"Data Visualization & Exploration\"\ncode-annotations: hover\n---\n\n\n\n\n## Overview\n\nData visualization is a fundamental part of working with data. Visualization can be only used in the final stages of a project to make figures for publication but it can also be hugely valuable for quality control and hypothesis development processes. This module focuses on the fundamentals of graph creation in an effort to empower you to apply those methods in the various contexts where you might find visualization to be helpful.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Explain how data visualization can be used to explore data\n- Define fundamental `ggplot2` vocabulary\n- Identify appropriate graph types for given data type/distribution\n- Discuss differences between presentation- and publication-quality graphs\n- Explain how your graphs can be made more accessible\n\n## Preparation\n\n1. Each Synthesis fellow should download one data file identified for your group's project\n2. _If you are a Mac user_, install [XQuartz](https://www.xquartz.org/)\n3. _If you are an R user_, run the following code:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"librarian\")\nlibrarian::shelf(tidyverse, summarytools, datacleanr, lterdatasampler, supportR, cowplot)\n```\n:::\n\n\n\n\n## Networking Session\n\nWe'll have two guests to kick off today's class. Each has been involved in synthesis as an early career researcher and each uses visualization in different ways to assess, clarify, and communicate their data and analyses.\n\n:::{.panel-tabset}\n\n### 2024 Guests\n\n- [Tim Ohlert](https://www.researchgate.net/scientific-contributions/Timothy-Ohlert-2172949124), Postdoctoral Researcher, Colorado State University; DroughtNet Coordinator\n\n- [Kyle Cavanaugh](https://www.ioes.ucla.edu/person/kyle-cavanaugh/), Associate Professor, UCLA Institute of the Environment and Sustainability and the UCLA Geography Department\n\n:::\n\n## Data Visualization & The Synthesis Workflow\n\nAs shown in the graphic below, visualization can be valuable throughout the lifecycle of a synthesis project, albeit in different ways at different phases of a project.\n\n

\n\"Diagram\n

Diagram of data stages from raw data to published products. Credit: Margaret O'Brian & Li Kui & Sarah Elmendorf
\n

\n\n## Visualization for Exploration\n\nExploratory data visualization is an important part of any scientific project. Before launching into analysis it is valuable to make some simple plots to scan the contents. These plots may reveal any number of issues, such as typos, sensor calibration problems or differences in the protocol over time.\n\nThese \"fitness for use\" visualizations are even more critical for synthesis projects. In synthesis, we are often repurposing publicly available datasets to answer questions that differ from the original motivations for data collection. As a result, the metadata included with a published dataset may be insufficient to assess whether the data are useful for your group's question. Datasets may not have been carefully quality-controlled prior to publication and could include any number of 'warts' that can complicate analyses or bias results. Some of these idiosyncrasies you may be able to anticipate in advance (e.g. spelling errors in taxonomy) and we encourage you to explicitly test for those and rectify them during the data harmonization process (see the [Data Wrangling module](https://lter.github.io/ssecr/mod_wrangle.html)). Others may come as a surprise.\n\nDuring the early stages of a synthesis project, you will want to gain skill to quickly scan through large volumes of data. The figures you make will typically be for internal use only, and therefore have low emphasis on aesthetics.\n\n### Exploratory Visualization Applications\n\nSpecific applications of exploratory data visualization include identifying:\n\n1. Dataset coverage (temporal, spatial, taxonomic)\n - For example, the metadata might indicate a dataset covers the period 2010-2020. That could mean one data point in 2010 and one in 2020! This may not be useful for a time-series analysis.\n2. Errors in metadata \n - Do the units \"make sense\" with the figure? Typos in metadata do occur, so if you find yourself with elephants weighing only a few grams, it may be necessary to reach out to the dataset contact.\n3. Differences in methodology\n - Do the data from sequential years, replicate sites, different providers generally fall into the same ranges or is there sensor drift or changes in protocols that need to be addressed?\n - A risk of synthesis projects is that you may find you are comparing apples to oranges across datasets, as the individual datasets included in your project were likely not collected in a coordinated fashion.\n - A benefit of synthesis projects is you will typically have large volumes of data, collected from many locations or timepoints. This data volume can be leveraged to give you a good idea of how your response variable looks at a 'typical' location as well as inform your gestalt sense of how much site-to-site, study-to-study, or year-to-year variability is expected. In our experience, where one particular dataset, or time period, strongly differs from the others, the most common root cause is differences in methodology that need to be addressed in the data harmonization process. \n\nIn the data exploration stage you may find:\n\n- Harmonization issues\n - Are all your datasets measured in units that can be converted to the same units?\n - If not, can you envision metrics (relative abundance? Effect size?) that would make datasets intercomparable?\n- Some entire datasets cannot be used\n- Parts of some datasets cannot be used\n- Additional quality control is needed (e.g. filtering large outliers)\n\nThese steps are an important precursor to the data harmonization stage, where you will process the datasets you have selected into an analysis-ready format.\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Data Sleuth\n\nIn this activity, you'll play the role of data detective. You will have many potential datasets to look through. It is important to do it correctly, but you likely won't need or want to develop boutique code to examine each dataset, especially since some may be discarded after an initial pass.\n\nAs a project team, discuss the following points:\n\n1. Decide on a structure for tracking results of exploratory data checks\n - Git issues? Additional columns in your team-data-inventory google sheet? Something else?\n - Make a list of checks you would want to apply to each dataset before inclusion\n2. Use the `summarytools` and/or `datacleanr` packages to explore one exemplar dataset that you intend to include in your project\n - Discuss any issues you discover \n - Revise the list of checks as necessary\n - Complete pre-harmonization dataset \"to do\" for the dataset (e.g. remove 1993 due to incomplete sampling, convert concentrations from mmols to mg/L, contact dataset providers to ask about anomalous values in April 2021)\n3. If you choose to save any exploratory images and/or code for reference after running the interactive exploratory checks, decide on a naming convention and storage location\n - Will you add these files to your `.gitignore` or do you plan on committing them?\n4. What additional plots would you ideally make that are not available through these generic tools?\n\n::::{.panel-tabset}\n##### `summarytools` Package\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the library\nlibrary(summarytools)\n\n# Load data\ndataset_1 <- read_csv(\"your_file_name_here.csv\")\n\n# View the data in your Rstudio environment\nsummarytools::view(summarytools::dfSummary(dataset_1), footnote = NA) # <1>\n\n# Alternatively,save the results for viewing later, or to share with your team\nprint(summarytools::dfSummary(dataset_1), footnote = NA,\n file = 'dataset_01_summary.html')\n```\n:::\n\n\n\n1. Careful! Use lowercase 'v' in the `view` function of the `summarytools` package\n\n##### `datacleanr` Package\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the library\nlibrary(datacleanr)\n\n# Load data\ndataset_1 <- read_csv(\"your_file_name_here.csv\")\n\n# Launch the shiny app and view the data interactively\ndatacleanr::dcr_app(dataset_1)\n```\n:::\n\n\n\n\n::::\n\n
\n\nBoth of these packages have extensive vignettes and online instructional materials. See [here](https://cran.r-project.org/web/packages/summarytools/vignettes/introduction.html) for one from `summarytools` and [here](https://the-hull.github.io/datacleanr/) for one from `datacleanr`.\n\n:::\n\n## Graphing with `ggplot2`\n\nYou may already be familiar with the `ggplot2` package in R but if you are not, it is a popular graphing library based on [The Grammar of Graphics](https://bookshop.org/p/books/the-grammar-of-graphics-leland-wilkinson/1518348?ean=9780387245447). Every ggplot is composed of four elements:\n\n1. A 'core' `ggplot` function call\n2. Aesthetics\n3. Geometries\n4. Theme\n\nNote that the theme component may be implicit in some graphs because there is a suite of default theme elements that applies unless otherwise specified. \n\nThis module will use example data to demonstrate these tools but as we work through these topics you should feel free to substitute a dataset of your choosing! If you don't have one in mind, you can use the example dataset shown in the code chunks throughout this module. This dataset comes from the [`lterdatasampler` R package](https://lter.github.io/lterdatasampler/) and the data are about fiddler crabs (_Minuca pugnax_) at the [Plum Island Ecosystems (PIE) LTER](https://pie-lter.mbl.edu/) site.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse); library(lterdatasampler)\n\n# Load the fiddler crab dataset\ndata(pie_crab)\n\n# Check its structure\nstr(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [392 × 9] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n\n\n:::\n:::\n\n\n\n\nWith this dataset in hand, let's make a series of increasingly customized graphs to demonstrate some of the tools in `ggplot2`.\n\n::::{.panel-tabset}\n### 1. Starter Graph\n\nLet's begin with a scatterplot of crab size on the Y-axis with latitude on the X. We'll forgo doing anything to the theme elements at this point to focus on the other three elements.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. We're defining both the data and the X/Y aesthetics in this top-level bit of the plot. Also, note that each line ends with a plus sign\n2. Because we defined the data and aesthetics in the `ggplot()` function call above, this geometry can assume those mappings without re-specificying\n\n### 2. Custom Theme\n\nWe can improve on this graph by tweaking theme elements to make it use fewer of the default settings.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), # <1>\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-2-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. All theme elements require these `element_...` helper functions. `element_blank` removes theme elements but otherwise you'll need to use the helper function that corresponds to the type of theme element (e.g., `element_text` for theme elements affecting graph text)\n\n### 3. Multiple Geometries\n\nWe can further modify `ggplot2` graphs by adding _multiple_ geometries if you find it valuable to do so. Note however that geometry order matters! Geometries added later will be \"in front of\" those added earlier. Also, adding too much data to a plot will begin to make it difficult for others to understand the central take-away of the graph so you may want to be careful about the level of information density in each graph. Let's add boxplots behind the points to characterize the distribution of points more quantitatively.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_boxplot(pch = 21) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), \n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-3-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. By putting the boxplot geometry first we ensure that it doesn't cover up the points that overlap with the 'box' part of each boxplot\n\n### 4. Multiple Datasets\n\n`ggplot2` also supports adding more than one data object to the same graph! While this module doesn't cover map creation, maps are a common example of a graph with more than one data object. Another common use would be to include both the full dataset and some summarized facet of it in the same plot.\n\nLet's calculate some summary statistics of crab size to include that in our plot.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the supportR library\nlibrary(supportR)\n\n# Summarize crab size within latitude groups\ncrab_summary <- supportR::summary_table(data = pie_crab, groups = c(\"site\", \"latitude\"),\n response = \"size\", drop_na = TRUE)\n\n# Check the structure\nstr(crab_summary)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t13 obs. of 6 variables:\n $ site : chr \"BC\" \"CC\" \"CT\" \"DB\" ...\n $ latitude : num 42.2 41.9 41.3 39.1 30 39.6 41.6 33.3 42.7 34.7 ...\n $ mean : num 16.2 16.8 14.7 15.6 12.4 ...\n $ std_dev : num 4.81 2.05 2.36 2.12 1.8 2.72 2.29 2.42 2.3 2.34 ...\n $ sample_size: int 37 27 33 30 28 30 29 30 28 25 ...\n $ std_error : num 0.79 0.39 0.41 0.39 0.34 0.5 0.43 0.44 0.43 0.47 ...\n```\n\n\n:::\n:::\n\n\n\n\nWith this data object in-hand, we can make a graph that includes both this and the original, unsummarized crab data. To better focus on the 'multiple data objects' bit of this example we'll pare down on the actual graph code.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() + # <1>\n geom_point(pie_crab, mapping = aes(x = latitude, y = size, fill = site),\n pch = 21, size = 2, alpha = 0.2) + \n geom_errorbar(crab_summary, mapping = aes(x = latitude, # <2>\n ymax = mean + std_error,\n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, fill = site),\n pch = 23, size = 3) + \n theme(legend.title = element_blank(),\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-4-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. If you want multiple data objects in the same `ggplot2` graph you need to leave this top level `ggplot()` call _empty!_ Otherwise you'll get weird errors with aesthetics later in the graph\n2. This geometry adds the error bars and it's important that we add it before the summarized data points themselves if we want the error bars to be 'behind' their respective points\n\n::::\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P1)\n\nIn a script, attempt the following with one of either yours or your group's datasets:\n\n- Make a graph using `ggplot2`\n - Include at least one geometry\n - Include at least one aesthetic (beyond X/Y axes)\n - Modify at least one theme element from the default\n\n:::\n\n## Streamlining Graph Aesthetics\n\nSynthesis projects often generate an entire network of inter-related papers. Ensuring that all graphs across papers from a given team have a similar \"feel\" is a nice way of implying a certain standard of robustness for all of your group's projects. However, copy/pasting the theme elements of your graphs can (A) be cumbersome to do even once and (B) needs to be re-done every time you make a change anywhere. Fortunately, there is a better way!\n\n`ggplot2` supports adding theme elements to an object that can then be reused as needed elsewhere. This is the same theory behind wrapping repeated operations into custom functions.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Define core theme elements\ntheme_synthesis <- theme(legend.position = \"none\",\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"),\n axis.text = element_text(size = 13)) # <1>\n\n# Create a graph\nggplot(pie_crab, aes(y = water_temp, x = air_temp, color = size, size = size)) +\n geom_point() +\n theme_synthesis +\n theme(legend.position = \"right\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/std-theme-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. This theme element controls the text on the tick marks. `axis.title` controls the text in the _labels_ of the axes\n2. As a bonus, subsequent uses of `theme()` will replace defaults defined in your earlier theme object. So, you can design a set of theme elements that are _usually_ appropriate and then easily change just some of them as needed\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P2)\n\nIn a script, attempt the following:\n\n- Remove all theme edits from the graph you made in the preceding activity and assign them to a separate object\n - Then add that object to your graph\n- Make a second (different) graph and add your consolidated theme object to that graph as well\n\n:::\n\n## Multi-Panel Graphs\n\nIt is sometimes the case that you want to make a single graph file that has multiple panels. For many of us, we might default to creating the separate graphs that we want, exporting them, and then using software like Microsoft PowerPoint to stitch those panels into the single image we had in mind from the start. However, as all of us who have used this method know, this is hugely cumbersome when your advisor/committee/reviewers ask for edits and you now have to redo all of the manual work behind your multi-panel graph. \n\nFortunately, there are two nice entirely scripted alternatives that you might consider: **Faceted graphs** and **Plot grids**. See below for more information on both.\n\n:::{.panel-tabset}\n### Facets\n\nIn a faceted graph, every panel of the graph has the same aesthetics. These are often used when you want to show the relationship between two (or more) variables but separated by some other variable. In synthesis work, you might show the relationship between your core response and explanatory variables but facet by the original study. This would leave you with one panel per study where each would show the relationship only at that particular study.\n\nLet's check out an example.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(pie_crab, aes(x = date, y = size, color = site))+\n geom_point(size = 2) +\n facet_wrap(. ~ site) + # <1>\n theme_bw() +\n theme(legend.position = \"none\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/facet-1-1.png){fig-align='center' width=576}\n:::\n:::\n\n\n\n1. This is a `ggplot2` function that assumes you want panels laid out in a regular grid. There are other `facet_...` alternatives that let you specify row versus column arrangement. You could also facet by multiple variables by putting something to the left of the tilde\n2. We can remove the legend because the site names are in the facet titles in the gray boxes\n\n### Plot Grids\n\nIn a plot grid, each panel is completely independent of all others. These are often used in publications where you want to highlight several _different_ relationships that have some thematic connection. In synthesis work, your hypotheses may be more complicated than in primary research and such a plot grid would then be necessary to put all visual evidence for a hypothesis in the same location. On a practical note, plot grids are also a common way of circumventing figure number limits enforced by journals.\n\nLet's check out an example that relies on the `cowplot` library.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Load a needed library\nlibrary(cowplot)\n\n# Create the first graph\ncrab_p1 <- ggplot(pie_crab, aes(x = site, y = size, fill = site)) + # <1>\n geom_violin() +\n coord_flip() + # <2>\n theme_bw() +\n theme(legend.position = \"none\")\n\n# Create the second\ncrab_p2 <- ggplot(pie_crab, aes(x = air_temp, y = water_temp)) +\n geom_errorbar(aes(ymax = water_temp + water_temp_sd, ymin = water_temp - water_temp_sd),\n width = 0.1) +\n geom_errorbarh(aes(xmax = air_temp + air_temp_sd, xmin = air_temp - air_temp_sd), # <3>\n width = 0.1) +\n geom_point(aes(fill = site), pch = 23, size = 3) +\n theme_bw()\n\n# Assemble into a plot grid\ncowplot::plot_grid(crab_p1, crab_p2, labels = \"AUTO\", nrow = 1) # <4>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/grid-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. Note that we're assigning these graphs to objects!\n2. This is a handy function for flipping X and Y axes without re-mapping the aesthetics\n3. This geometry is responsible for _horizontal_ error bars (note the \"h\" at the end of the function name)\n4. The `labels = \"AUTO\"` argument means that each panel of the plot grid gets the next sequential capital letter. You could also substitute that for a vector with labels of your choosing\n:::\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P3)\n\nIn a script, attempt the following:\n\n- Assemble the two graphs you made in the preceding two activities into the appropriate type of multi-panel graph\n\n:::\n\n## Accessibility Considerations\n\nAfter you've made the graphs you need, it is good practice to revisit them with to ensure that they are as accessible as possible. You can of course also do this during the graph construction process but it is sometimes less onerous to tackle as a penultimate step in the figure creation process. There are many facets to accessibility and we've tried to cover just a few of them below.\n\n### Color Choice\n\nOne of the more well-known facets of accessibility in data visualization is choosing colors that are \"colorblind safe\". Such palettes still create distinctive colors for those with various forms of color blindness (e.g., deuteranomoly, protanomaly, etc.). The classic red-green heatmap for instance is very colorblind unsafe in that people with some forms of colorblindness cannot distinguish between those colors (hence the rise of the yellow-blue heatmap in recent years). Unforunately, the `ggplot2` default rainbow palette--while nice for exploratory purposes--_is not_ colorlbind sfae.\n\nSome websites (such as [colorbewer2.org](https://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=9)) include a simple checkbox for colorblindness safety which automatically limits the listed options to those that are colorblind safe. Alternately, you could use a browser plug-in (such as [Let's get color blind](https://chromewebstore.google.com/detail/lets-get-color-blind/bkdgdianpkfahpkmphgehigalpighjck) on Google Chrome) to simulate colorblindness on a particular page.\n\nOne extreme approach you could take is to dodge this issue entirely and format your graphs such that color either isn't used at all or only conveys information that is also conveyed in another graph aesthetic. We don't necessarily recommend this as color--when the palette is chosen correctly--can be a really nice way of making information-dense graphs more informative and easily-navigable by viewers.\n\n### Multiple Modalities\n\nRelated to the color conversation is the value of mapping multiple aesthetics to the same variable. By presenting information in multiple ways--even if that seems redundant--you enable a wider audience to gain an intuitive sense of what you're trying to display.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, \n fill = site, shape = site)) + # <1>\n geom_jitter(size = 2, width = 0.1, alpha = 0.6) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) + # <2>\n theme_bw() +\n theme(legend.title = element_blank())\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/multi-modal-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. In this graph we're mapping both the fill and shape aesthetics to site\n2. This is a little cumbersome but there are only five 'fill-able' shapes in R so we need to reuse some of them to have a unique one for each site. Using fill-able shapes is nice because you get a crisp black border around each point. See `?pch` for all available shapes\n\nIn the above graph, even though the rainbow palette is not ideal for reasons mentioned earlier, it is now much easier to tell the difference between sites with similar colors. For instance, \"NB\", \"NIB\", and \"PIE\" are all shades of light blue/teal. Now that they have unique shapes it is dramatically easier to look at the graph and identify which points correspond to which site.\n\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Graph Accessibility\n\nWith a group discuss (some of) the following questions:\n\n- What are other facets of accessibility that you think are important to consider when making data visualizations?\n- What changes do you make to your graphs to increase accessibility?\n - What changes _could_ you make going forward?\n\n:::\n\n\n### Presentation vs. Publication\n\nOne final element of accessibility to consider is the difference between a '_presentation_-quality' graph and a '_publication_-quality' one. While it may be tempting to create a single version of a given graph and use it in both contexts that is likely to be less effective in helping you to get your point across than making small tweaks to two separate versions of what is otherwise the same graph.\n\n:::{.panel-tabset}\n### Presentation-Focused\n\n**Do:**\n\n- Increase size of text/points **greatly**\n - If possible, sit in the back row of the room where you'll present and look at your graphs from there\n- _Consider_ adding graph elements that highlight certain graph regions\n- Present summarized data (increases focus on big-picture trends and avoids discussion of minutiae)\n- Map multiple aesthetics to the same variables\n\n**Don't:**\n\n- Use technical language / jargon\n- Include _unnecessary_ background elements\n- Use multi-panel graphs (either faceted or plot grid)\n - If you have multiple graph panels, put each on its own slide!\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(crab_summary, aes(x = latitude, y = mean, \n shape = reorder(site, latitude), # <1>\n fill = reorder(site, latitude))) +\n geom_vline(xintercept = 36.5, color = \"black\", linetype = 1) +\n geom_vline(xintercept = 41.5, color = \"black\", linetype = 2) + # <2>\n geom_errorbar(mapping = aes(ymax = mean + std_error, ymin = mean - std_error),\n width = 0.2) +\n geom_point(size = 4) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Size (mm)\") + # <3>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 17),\n axis.text = element_text(size = 15))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/talk-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. We can use the `reorder` function to make the order of sites in the legend (from top to bottom) match the order of sites in the graph (from left to right)\n2. Adding vertical lines at particular parts in the graph can make comparisons within the same graph easier\n3. `labs` lets us customize the title and label text of a graph\n\n### Publication-Focused\n\n**Do:**\n\n- Increase size of text/points **slightly**\n - You want to be legible but you can more safely assume that many readers will be able to increase the zoom of their browser window if needed\n- Present un-summarized data (with or without summarized points included)\n - Many reviewers will want to get a sense for the \"real\" data so you should include unsummarized values wherever possible\n- Use multi-panel graphs\n - If multiple graphs \"tell a story\" together, then they should be included in the same file!\n- Map multiple aesthetics to the same variables\n- If publishing in a journal available in print, check to make sure your graph still makes sense in grayscale\n - There are nice browser plug-ins (like [Grayscale the Web](https://chromewebstore.google.com/detail/grayscale-the-web-save-si/mblmpdpfppogibmoobibfannckeeleag) for Google Chrome) for this too\n\n**Don't:**\n\n- Include _unnecessary_ background elements\n- Add graph elements that highlight certain graph regions\n - You can--and should--lean more heavily on the text of your publication to discuss particular areas of a graph\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() +\n geom_point(pie_crab, mapping = aes(x = latitude, y = size,\n color = reorder(site, latitude)),\n pch = 19, size = 1, alpha = 0.3) +\n geom_errorbar(crab_summary, mapping = aes(x = latitude, y = mean, \n ymax = mean + std_error, \n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, \n shape = reorder(site, latitude),\n fill = reorder(site, latitude)),\n size = 4) +\n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Carapace Width (mm)\") + # <1>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 15),\n axis.text = element_text(size = 13))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/pub-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n\n\n1. Here we are using a reasonable amount of technical language\n\n### Other Considerations\n\nSome other factors you might consider _regardless of where the graphs will be embedded_ include:\n\n- **White Background**. Ensure figures have a plain, white background for clarity and compatibility with journal formats.\n- **High Resolution**. Use a resolution of at least 300 dpi for print quality. Journals often specify the minimum dpi required.\n- **Bounding Box and Borders**. Add a bounding box or border if it enhances clarity, but avoid excessive framing unless necessary to separate elements clearly.\n- **Clear Axis Labels**. Label axes with clear, concise descriptions, including units of measurement (e.g., \"Temperature (°C)\"). Use readable font sizes that remain legible when scaled.\n- **Consistent Font Style and Size**. Use a uniform font style (e.g., Arial, Helvetica) across all figures and a size that is readable but not overwhelming (typically 8–12 points).\n- **Color Scheme**. Choose a color palette that remains clear in both color and grayscale. Use distinct colors for different categories or groups, and avoid colors that may be difficult for colorblind readers to differentiate (e.g., red-green combinations).\n- **Legend Placement**. Place legends within the figure space if possible, ensuring they don't overlap data or distract from the main content. Keep legends concise.\n- **Minimal Gridlines**. Use minimal and subtle gridlines for reference, but avoid heavy or cluttered lines that may distract from the data.\n- **Error Bars and Statistical Indicators**. Add error bars, confidence intervals, or statistical significance markers as needed to represent variability and support interpretation.\n- **Descriptive Figure Caption**. Include a detailed caption that summarizes the figure's purpose, data source, and any essential methods or abbreviations. Captions should be self-contained to ensure figures are understandable independently.\n\n:::\n\n## Code Demo: Post-Harmonization Visualization\n\nAfter harmonizing your data, you'll want to generate one last set of 'sanity check' plots to make sure (1) you have interpreted the metadata correctly (2) you haven't made any obvious errors in the harmonization and (3) your data are ready for analysis. Nothing is less fun than finding out your analytical results are due to an error in the underlying data.\n\nThe following is a multi-part code demonstration of three common post-harmonization uses of visualization. In addition to being useful graphs, there is also example code on how to export multiple panels of graphs into separate pages of a PDF which can be really helpful when reviewing exploratory visualizations as a group (without needing to scroll through a ton of separate graph files).\n\n### Additional Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this demo, you'll need to install the following packages:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## install.packages(\"librarian\")\nlibrarian::shelf(tidyverse, scales, ggforce, slider)\n```\n:::\n\n\n\n\nThe three sets of plots below encompass many of the most common data structures\nwe have encountered types in ecological synthesis projects. These include \nquantitative measurements collected over many sites, taxonomic data collected\nover many sites, and seasonal time series data.\n\n::: panel-tabset\n### Graph _All_ Numeric Variables\n\nIt can be helpful to visualize all numeric variables in your dataset, grouped by site (or dataset source) to check that the data have been homogenized correctly. As an example, we'll use a 2019 dataset on lake water quality, chemistry, and zooplankton community composition near the [Niwot Ridge](https://nwt.lternet.edu/) LTER. The dataset is a survey of 16 high alpine lakes and has structure similar to one that might be included in a multi-site synthesis. For more information on these data, check out [the data package on EDI](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-nwt.12.1). \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ngreen_biochem <- read.csv(file = file.path(\"data\", \"green-lakes_water-chem-zooplank.csv\")) # <1>\n\n# Check structure\nstr(green_biochem)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t391 obs. of 14 variables:\n $ local_site : chr \"Blue Lake\" \"Blue Lake\" \"Blue Lake\" \"Blue Lake\" ...\n $ location : chr \"LAKE\" \"LAKE\" \"LAKE\" \"LAKE\" ...\n $ depth : num 0 1 2 3 4 5 6 7 8 9 ...\n $ date : chr \"2016-07-08\" \"2016-07-08\" \"2016-07-08\" \"2016-07-08\" ...\n $ time : chr \"09:11:00\" \"09:13:00\" \"09:14:00\" \"09:16:00\" ...\n $ chl_a : num 0.521 NA NA NA NA NA NA NA NA NA ...\n $ pH : num 6.75 6.78 6.72 6.67 6.57 6.55 6.52 6.51 6.48 6.49 ...\n $ temp : num 2.8 2.8 2.73 2.72 7.72 2.65 2.65 2.65 2.64 2.65 ...\n $ std_conduct: num 8 9 10 9 10 9 9 9 9 9 ...\n $ conduct : num 4 5 6 6 6 5 5 5 5 6 ...\n $ DO : num 8.23 8.14 8.14 8.05 8.11 8.07 8.21 8.19 8.17 8.16 ...\n $ sat : num 60.9 60.1 60.2 59.4 59.8 59.4 60.3 60.3 60.1 60 ...\n $ secchi : num 6.25 NA NA NA NA NA NA NA NA NA ...\n $ PAR : num 1472 872 690 530 328 ...\n```\n\n\n:::\n:::\n\n\n\n1. Note that you could also read in this data directly from EDI. See ~line 31 of [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) for a syntax example\n\nOnce we have the data file, we can programmatically identify all columns that R knows to be numeric.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# determine which columns are numeric in green_biochem\nnumcols <- green_biochem %>%\n dplyr::select(dplyr::where(~ is.numeric(.x) == TRUE)) %>% # <1>\n names(.) %>% \n sort(.)\n\n# Check that out\nnumcols # <2>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"chl_a\" \"conduct\" \"depth\" \"DO\" \"PAR\" \n [6] \"pH\" \"sat\" \"secchi\" \"std_conduct\" \"temp\" \n```\n\n\n:::\n:::\n\n\n\n1. The tilde (`~`) is allowing us to evaluate each column against this conditional\n2. You may notice that these columns all have `\"num\"` next to them in their structure check. The scripted method is _dramatically_ faster and more reproducible than writing these names down by hand\n\nNow that we have our data and a vector of numeric column names, we can generate a multi-page PDF of scatterplots where each page is specific to a numeric variable and each graph panel within a given page reflects the time series at each site.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Open PDF 'device'\ngrDevices::pdf(file = file.path(\"qc_all_numeric.pdf\")) # <1>\n\n# Loop across numeric variables\nfor (var in numcols) {\n \n # Create a set of graphs for onevariable\n myplot <- ggplot(green_biochem, aes(x = date, y = .data[[var]])) +\n geom_point(alpha = 0.5) + # <2>\n facet_wrap(. ~ local_site)\n \n # Print that variable\n print(myplot)\n}\n\n# Close the device\ndev.off() # <3>\n```\n:::\n\n\n\n1. This function tells R that the following code should be saved as a PDF\n2. A scatterplot may not be the best tool for your data; adjust appropriately\n3. This function (when used after a 'device' function like `grDevices::pdf`) tells R when to stop adding things to the PDF and actually save it\n\nThe first page of the resulting plot should look something like the following, with each page having the same content but a different variable on the Y axis.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/demo_all-num-vars_viz-code-real-1.png){fig-align='center' width=768}\n:::\n:::\n\n\n\n\n### Taxonomic Consistency\n\nTaxonomic time series can be tricky to work with due to inconsistencies in nomenclature and/or sampling effort. In particular, 'pseudoturnover' where one species 'disappears' with or without the simultaneous 'appearance' of another taxa can be indicative of either true extinctions, or changes in species names, or changes in methodology that cause particular taxa not to be detected. A second complication is that taxonomic data are often archived as 'presence-only' so it is necessary to _infer_ the absences based on sampling methodology and add them to your dataset before analysis.\n\nWhile there are doubtless many field-collected datasets that have this issue, we've elected to simulate data so that we can emphasize the visualization elements of this problem while avoiding the \"noise\" typical of real data. This simulation is not necessarily vital to the visualization so we've left it out of the following demo. _However_, if that is of interest to you, see [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R)--in particular \\~line 41 through \\~80.\n\nA workflow for establishing taxonomic consistency and plotting the results is included below.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntaxa_df <- read.csv(file.path(\"data\", \"simulated-taxa-df.csv\"))\n\n# Check structure\nstr(taxa_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1025 obs. of 4 variables:\n $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...\n $ plot : int 1 1 1 1 1 1 1 1 1 1 ...\n $ taxon: chr \"Taxon_A\" \"Taxon_B\" \"Taxon_C\" \"Taxon_D\" ...\n $ count: int 8 11 7 13 14 15 11 6 9 7 ...\n```\n\n\n:::\n:::\n\n\n\n\nFirst, we'll define units of sampling (year, plot and taxon) and 'pad out' the zeros. In this example, we have only added zeroes for taxa-plot-year combinations where that taxa is present in at least one year at a given plot. Again, this zero-padding is prerequisite to the visualization but not necessarily part of it so see \\~lines 84-117 of the [prep script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) if that process is of interest.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\nwithzeros <- read.csv(file.path(\"data\", \"simulated-taxa-df_with-zeros.csv\")) %>% \n dplyr::mutate(plot = factor(plot))\n\n# Check structure\nstr(withzeros) # <1>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1100 obs. of 4 variables:\n $ plot : Factor w/ 10 levels \"1\",\"2\",\"3\",\"4\",..: 1 2 2 3 4 5 5 6 7 8 ...\n $ taxon: chr \"Taxon_A\" \"Taxon_A\" \"Taxon_A\" \"Taxon_A\" ...\n $ year : int 2014 2014 2019 2014 2014 2014 2019 2014 2014 2013 ...\n $ n : int 0 0 0 0 0 0 0 0 0 0 ...\n```\n\n\n:::\n:::\n\n\n\n1. Notice how there are more rows than the preceding data object and several new zeroes in the first few rows?\n\nNow that we have the data in the format we need, we'll create a plot of species counts over time with zeros filled in. Because there are many plots and it is difficult to see so many panels on the same page, we'll use the `facet_wrap_paginate` function from the `ggforce` package to create a multi-page PDF output.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create the plot of species counts over time (with zeros filled in)\nmyplot <- ggplot(withzeros, aes(x = year, y = n, group = plot, color = plot)) +\n geom_line() +\n scale_x_continuous(breaks = scales::pretty_breaks()) +\n ggforce::facet_wrap_paginate(~ taxon, nrow = 2, ncol = 2) # <1>\n\n# Start the PDF output\ngrDevices::pdf(file.path(\"counts_by_taxon_with_zeros.pdf\"),\n width = 9, height = 5)\n\n# Loop across pages (defined by `ggforce::facet_wrap_paginate`)\nfor (i in seq_along(ggforce::n_pages(myplot))) {\n \n page_plot <- myplot + \n ggforce::facet_wrap_paginate(~taxon, page = i, \n nrow = 2, ncol = 2)\n \n print(page_plot)\n}\n\n# Close the PDF output\ndev.off()\n```\n:::\n\n\n\n1. This allows a faceted graph to spread across more than one page. See `?ggforce::facet_wrap_paginate` for details\n\nThe first page of the resulting plot should look something like this:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/demo_tax-consist_viz-code-real-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nNotice how \"Taxon_A\" is absent from all plots in 2014 whereas \"Taxon_B\" has extremely high counts in the same year. Often this can signify inconsistent use of taxonomic names over time.\n\n### Seasonal Time Series\n\nFor time series, intra-annual variation can often make data issues difficult to spot. In these cases, it can be helpful to plot each year onto the same figure and compare trends across study years.\n\nAs an example, we'll use a 2024 dataset on streamflow near the [Niwot Ridge](https://nwt.lternet.edu/) LTER. The dataset is a 22 year time-series of daily streamflow. For more information on these data, check out [the data package on EDI](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-nwt.105.18). \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read data\ngreen_streamflow <- read.csv(file.path(\"data\", \"green-lakes_streamflow.csv\")) # <1>\n\n# Check structure\nstr(green_streamflow)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t15451 obs. of 6 variables:\n $ LTER_site : chr \"NWT\" \"NWT\" \"NWT\" \"NWT\" ...\n $ local_site : chr \"gl4\" \"gl4\" \"gl4\" \"gl4\" ...\n $ date : chr \"1981-06-12\" \"1981-06-13\" \"1981-06-14\" \"1981-06-15\" ...\n $ discharge : num 9786 8600 7600 6700 5900 ...\n $ temperature: num NA NA NA NA NA NA NA NA NA NA ...\n $ notes : chr \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" ...\n```\n\n\n:::\n:::\n\n\n\n1. Note again that you could also read in this data directly from EDI. See ~line 129 of [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) for a syntax example\n\nLet's now calculate a moving average encompassing the 5 values before and after each focal value.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Do necessary wrangling\nstream_data <- green_streamflow %>%\n # Calculate moving average for each numeric variable\n dplyr::mutate(dplyr::across(.cols = dplyr::all_of(c(\"discharge\", \"temperature\")),\n .fns = ~ slider::slide_dbl(.x = .x, .f = mean,\n .before = 5, .after = 5),\n .names = \"{.col}_move.avg\" )) %>%\n # Handle date format issues\n dplyr::mutate(yday = lubridate::yday(date),\n year = lubridate::year(date))\n\n# Check the structure of that\nstr(stream_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t15451 obs. of 10 variables:\n $ LTER_site : chr \"NWT\" \"NWT\" \"NWT\" \"NWT\" ...\n $ local_site : chr \"gl4\" \"gl4\" \"gl4\" \"gl4\" ...\n $ date : chr \"1981-06-12\" \"1981-06-13\" \"1981-06-14\" \"1981-06-15\" ...\n $ discharge : num 9786 8600 7600 6700 5900 ...\n $ temperature : num NA NA NA NA NA NA NA NA NA NA ...\n $ notes : chr \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" \"flow data estimated from intermittent observations\" ...\n $ discharge_move.avg : num 7299 6699 5992 5527 5274 ...\n $ temperature_move.avg: num NA NA NA NA NA NA NA NA NA NA ...\n $ yday : num 163 164 165 166 167 168 169 170 171 172 ...\n $ year : num 1981 1981 1981 1981 1981 ...\n```\n\n\n:::\n:::\n\n\n\n\nPlot seasonal timeseries of each numeric variable as points with the moving\naverage included as lines\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Start PDF output\ngrDevices::pdf(file = file.path(\"qc_all_numeric_seasonal.pdf\"))\n\n# Loop across variables\nfor (var in c(\"discharge\", \"temperature\")) {\n \n # Make the graph\n myplot <- ggplot(stream_data, aes(x = yday, group = year, color = year)) +\n geom_point(aes(y = .data[[var]])) + # <1>\n geom_line(aes(y = .data[[paste0(var, \"_move.avg\")]])) + # <2>\n viridis::scale_color_viridis()\n \n # Print it\n print(myplot)\n}\n\n# End PDF creation\ndev.off()\n```\n:::\n\n\n\n1. Add points based on the year\n2. Adding lines based on the average\n\nThe first page of the resulting figure should look something like this:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/demo_seasons_viz-code-real-1.png){fig-align='center' width=768}\n:::\n:::\n\n\n\n\nOne of these years is not like the others...\n:::\n\n## Multivariate Visualization\n\nIf you are working with multivariate data (i.e., data where multiple columns are all response variables collectively) you may need to use visualization methods unique to that data structure. For more information, check out the [bonus multivariate visualization module](https://lter.github.io/ssecr/mod_multivar-viz.html).\n\n## Maps\n\nYou may find it valuable to create a map as an additional way of visualizing data. Many synthesis groups do this--particularly when there is a strong spatial component to the research questions and/or hypotheses. Check out the [bonus spatial data module](https://lter.github.io/ssecr/mod_spatial.html) for more information on map-making if this is of interest!\n\n## Additional Resources\n\n### Papers & Documents\n\n- Chang, W. _et al._, [`ggplot2`: Elegant Graphics for Data Analysis](https://ggplot2-book.org/). 3^rd^ edition. **2023**.\n- National Center for Ecological Analysis and Synthesis (NCEAS). [Colorblind Safe Color Schemes](https://www.nceas.ucsb.edu/sites/default/files/2022-06/Colorblind%20Safe%20Color%20Schemes.pdf). **2022**.\n- Wilke, C.O. [Fundamentals of Data Visualization](https://clauswilke.com/dataviz/). **2020**.\n\n### Workshops & Courses\n\n- The Carpentries. [Data Analysis and Visualization in R for Ecologists: Data Visualization with `ggplot2`](https://datacarpentry.org/R-ecology-lesson/visualizing-ggplot.html). **2024**.\n- The Carpentries. [Data Analysis and Visualization in Python for Ecologists: Making Plots with `plotnine`](https://datacarpentry.org/python-ecology-lesson/07-visualization-ggplot-python.html). **2024**.\n- LTER Scientific Computing Team. [Coding in the Tidyverse: 'Visualize' Module](https://lter.github.io/workshop-tidyverse/visualize.html). **2023**.\n\n### Websites\n\n- [The R Graph Gallery](https://r-graph-gallery.com/)\n", "supporting": [ "mod_data-viz_files" ], diff --git a/_freeze/mod_data-viz/figure-html/demo_tax-consist_viz-code-real-1.png b/_freeze/mod_data-viz/figure-html/demo_tax-consist_viz-code-real-1.png index cd2ac33..34a5d09 100644 Binary files a/_freeze/mod_data-viz/figure-html/demo_tax-consist_viz-code-real-1.png and b/_freeze/mod_data-viz/figure-html/demo_tax-consist_viz-code-real-1.png differ diff --git a/_freeze/mod_data-viz/figure-html/multi-modal-1.png b/_freeze/mod_data-viz/figure-html/multi-modal-1.png index 0db1cdb..ac4d7e5 100644 Binary files a/_freeze/mod_data-viz/figure-html/multi-modal-1.png and b/_freeze/mod_data-viz/figure-html/multi-modal-1.png differ diff --git a/mod_data-viz.qmd b/mod_data-viz.qmd index 167e4d3..7b94691 100644 --- a/mod_data-viz.qmd +++ b/mod_data-viz.qmd @@ -590,7 +590,7 @@ str(green_biochem) ``` 1. Note that you could also read in this data directly from EDI. See ~line 31 of [this script](https://github.com/lter/ssecr/blob/main/scripts/prep-data_data-viz-demo.R) for a syntax example -Once we have the data, we can programmatically identify all columns that R knows to be numeric. +Once we have the data file, we can programmatically identify all columns that R knows to be numeric. ```{r demo_all-num-vars_numcols} # determine which columns are numeric in green_biochem @@ -605,7 +605,7 @@ numcols # <2> 1. The tilde (`~`) is allowing us to evaluate each column against this conditional 2. You may notice that these columns all have `"num"` next to them in their structure check. The scripted method is _dramatically_ faster and more reproducible than writing these names down by hand -Now that we have our data and a vector of numeric column names, we can generate a multi-page PDF of scatterplots where each page is specific to a numeric variable and each graph panel within a given page reflects a site-by-date combination. +Now that we have our data and a vector of numeric column names, we can generate a multi-page PDF of scatterplots where each page is specific to a numeric variable and each graph panel within a given page reflects the time series at each site. ```{r demo_all-num-vars_viz-code-fake} #| eval: false @@ -670,7 +670,8 @@ First, we'll define units of sampling (year, plot and taxon) and 'pad out' the z ```{r demo_tax-consist_data-2} # Read in data -withzeros <- read.csv(file.path("data", "simulated-taxa-df_with-zeros.csv")) +withzeros <- read.csv(file.path("data", "simulated-taxa-df_with-zeros.csv")) %>% + dplyr::mutate(plot = factor(plot)) # Check structure str(withzeros) # <1> @@ -684,7 +685,7 @@ Now that we have the data in the format we need, we'll create a plot of species # Create the plot of species counts over time (with zeros filled in) myplot <- ggplot(withzeros, aes(x = year, y = n, group = plot, color = plot)) + - geom_point() + + geom_line() + scale_x_continuous(breaks = scales::pretty_breaks()) + ggforce::facet_wrap_paginate(~ taxon, nrow = 2, ncol = 2) # <1> @@ -718,7 +719,7 @@ The first page of the resulting plot should look something like this: # Make multi-page graph myplot <- ggplot(withzeros, aes(x = year, y = n, group = plot, color = plot)) + - geom_point() + + geom_line() + scale_x_continuous(breaks = scales::pretty_breaks()) + ggforce::facet_wrap_paginate(~ taxon, nrow = 2, ncol = 2) @@ -793,7 +794,7 @@ dev.off() 1. Add points based on the year 2. Adding lines based on the average -The resulting figure should look something like this: +The first page of the resulting figure should look something like this: ```{r demo_seasons_viz-code-real} #| fig-align: center