03--Data_formatting.Rmd

---
bibliography: bibtexlib.bib
---

```{r init, echo=FALSE}
library(knitr)
library(xtable)
library(tools)
library(knitcitations)
opts_chunk$set(fig.width = 10,
               fig.height = 5,
               cache = FALSE)
cite_options(citation_format = "pandoc", max.names = 3, style = "html", 
             hyperlink = "to.doc")
source("functions.R")
gloss <- make_glossary_function()
ftf_def <- paste(
  "a text file to hold data in a tabular format, where cells are separated by a",
  "common character such as comma or tab. Common formats are csv and tsv."
)
```

# Data Formatting

GIGO is an acronym that stands for "Garbage In, Garbage Out", and this is often 
a mantra to remember when gathering data. The same goes for entering and storing
data in a spreadsheet. Often, the first point of error for any analysis is at 
the data entry stage. This section will show you the best practices of data 
storage and why this is important for your analyses, whether you are analyzing 
disease progress, fungicide resistance, or complex bioinformatic data. Much of this
is based off of the Data Carpentry module "[Using Spreadsheet Programs for
Scientific Data](http://www.datacarpentry.org/2015-03-09-ISI-CODATA/lessons/excel/ecology-examples/00-intro.html)"
designed by Christie Bahlai and Aleksandra Pawlik. Please refer to the previous
link if you need any more information.

## Isn't Excel/OpenOffice/Google Sheets good enough?

**No** [@zeeberg2004mistaken]. 
![An example of gene names accidentally being converted to dates in Microsoft
Excel](images/excel_date_error.jpg){.big_window .knit_image .knit_border}

For data entry, Excel is a fantastic program because it's intuitive to enter data into cells, but Excel also reinforces a lot of bad habits that can make your data inaccessible to analyses down the line.

## How should I format my data?

It all depends on what your data is, but ultimately there are a few guidelines
or best practices to follow:

 - Place all data on one sheet in a contiguous rectangle
 - One datum per cell
 - Saved as a `r gloss("flat text file", ftf_def)` (e.g. `.csv`, `.tsv`, `.txt`)
 - Observations in rows (note: 1 sample ≠ 1 observation)
 - Variables in columns (e.g. Treatment, Replicate, Response)
 - Consistent formatting of missing data (use either a blank cell or `NA`)
 - NEVER merge cells
 
The rest of this section will show some examples of good and bad data formatting related to plant pathology. For a more detailed insight on data formatting from someone who has A LOT of experience with different data formats, take a look at [Karl Broman's Data Organization Tutorial](http://kbroman.org/dataorg/). He covers the following topics:
 
 - [Be consistent](http://kbroman.org/dataorg/pages/consistency.html)
 - [Write dates as `YYYY-MM-DD`](http://kbroman.org/dataorg/pages/dates.html)
 - [Fill in all of the cells](http://kbroman.org/dataorg/pages/no_empty_cells.html)
 - [Put just one thing in a cell](http://kbroman.org/dataorg/pages/one_thing_per_cell.html)
 - [Make it a rectangle](http://kbroman.org/dataorg/pages/rectangle.html)
 - [Create a data dictionary](http://kbroman.org/dataorg/pages/dictionary.html)
 - [No calculations in the raw data files](http://kbroman.org/dataorg/pages/no_calculations.html)
 - [Don't use font color or highlighting as data](http://kbroman.org/dataorg/pages/no_highlighting.html)
 - [Choose good names for things](http://kbroman.org/dataorg/pages/names.html)
 - [Make backups](http://kbroman.org/dataorg/pages/backups.html)
 - [Use data validation to avoid data entry mistakes](http://kbroman.org/dataorg/pages/validation.html)
 - [Save the data in plain text files](http://kbroman.org/dataorg/pages/csv_files.html)
 - [Other things to avoid](http://kbroman.org/dataorg/pages/avoid.html)
 
# Same Data, Different Formats

In the following examples, we will use data from a stripe rust case study
provided by APS. [Here is the link to the case study.](http://www.apsnet.org/edcenter/advanced/topics/EcologyAndEpidemiologyInR/DiseaseProgress/Pages/StripeRust.aspx)

## The good: Following the rules outlined above

There are a few flavors 
of well-formatted data presented here, but they all have the above rules in
common and, thus, can all easily be converted to and from each other within R
and other computer programs.

#### Long data

Long data is so named because there are typically a vast number of rows, each 
with a single datum and its associated treatments, creating tables that tend
to have more rows than columns.

![](images/long_data.png){.knit_border .knit_image}

#### Wide data

Wide data is the opposite: each row has data for all treatments and each treatment is a column. 
It's called wide data because the number of columns tends to exceed the number of rows.

![](images/wide_data.png){.knit_border .knit_image}

#### An appropriate mixture of long and wide data

Of course, recording observations in either long or wide data exclusively is not necessarily intuitive for the person entering the data, so a combination of these can be beneficial. Also, tools exist to convert between long and wide tables so data can be entered however it is most convenient. The example below encodes two columns for data: one for fungicide treated and the other for untreated. The date and cultivar are specified as they are in the long data in the first two fields. 

![](images/two_column_data.png){.knit_border .knit_image}

## The bad: Storing data in excel and using multiple sheets

Saving data in Excel is a recipe
for disaster, [**especially if you are storing
dates**](http://ecologybits.com/index.php/2016/07/06/beware-this-scary-thing-excel-can-do-to-your-data/).

#### Multiple sheets

One thing that is commonly done is to store data across multiple sheets. While
this may seem to help organize, it prevents you from storing your data in a flat
text format and makes it more difficult to read it in to R (although it is
possible with the *readxl* package).

![](images/two_sheet_data.png){.knit_border .knit_image}

## The ugly: multiple tables, color-coding, etc...

Ugly data is, well, ugly. At least with bad data, it can be read into R. With
ugly data, there is little to no hope of getting it into R without reformatting
(and at that point, you might as well convert it into good data :).

#### Small Multiples

Separate tables within the spreadsheet with merged cells, numbers and spaces
within headers, and inconsistent formatting (different names for columns, zeroes
vs blanks).

![](images/ugly_data.png){.knit_border .knit_image}

# Resources

There are a few ways to manage data without the use of Excel or Google Sheets, and one way is to use an editor that specifically works on CSV files. A couple of good examples are:

 - [Comma Chameleon](http://comma-chameleon.io/)
 - [Table Tool](https://github.com/jakob/TableTool#readme) (OSX only)
 
These will allow you to enter your data and that's about it. No bells, whistles, automatic reformatting, or guessing on the editors side, giving you complete control.

# Try it! {#exercise}

------

### Question 1

Take a look at the [data set in the Wide Data section
above (csv format)](http://www.apsnet.org/edcenter/advanced/topics/EcologyAndEpidemiologyInR/DiseaseProgress/Documents/FungicideExample.csv).
It's formatted as wide data, but is there anything that could be improved?

### Question 2

Examine the following data sets:

 - [Fungicide Data](https://github.com/grunwaldlab/Reproducible-science-in-R/blob/master/bad_data/Fungicide%20Spread%20Sheet%202012.xls?raw=true) (xls format)
 - [Temperature Data](https://github.com/grunwaldlab/Reproducible-science-in-R/blob/master/bad_data/Temperature%20Spreadsheet%202012.xlsx?raw=true) (xlsx format)
 
Answer the following questions:

 - What categories would you put these data in?
 - What formatting mistakes do you see?
 - Reformat one of these sheets into a proper format.


# Glossary

```{r glossary, results = 'asis', echo = FALSE}
gloss(display = TRUE)
```

# References