From a4c34669b7fd75f98ad406edc5dd06e34084386f Mon Sep 17 00:00:00 2001
From: Jeremy Stanley <jstanley@sailthru.com>
Date: Sun, 5 Apr 2015 07:49:22 -0400
Subject: [PATCH] #29 first draft vignette

---
 .gitignore                             |   1 +
 vignettes/introduction-to-tidyjson.Rmd | 443 +++++++++++++++++++++++++
 2 files changed, 444 insertions(+)
 create mode 100644 vignettes/introduction-to-tidyjson.Rmd

diff --git a/.gitignore b/.gitignore
index 21275f5..7b16023 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,3 +2,4 @@
 .RData
 .Rhistory
 *.swp
+inst/doc
diff --git a/vignettes/introduction-to-tidyjson.Rmd b/vignettes/introduction-to-tidyjson.Rmd
new file mode 100644
index 0000000..b5b99af
--- /dev/null
+++ b/vignettes/introduction-to-tidyjson.Rmd
@@ -0,0 +1,443 @@
+---
+title: "Introduction to tidyjson"
+author: "Jeremy Stanley"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Vignette Title}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\usepackage[utf8]{inputenc}
+---
+
+[JSON](http://json.org/) (JavaScript Object Notation) is a lightweight data
+format that is easy for humans to read and for machines to parse. It is also 
+incredibly flexible. JSON has become a common format used in:
+
+- Public APIs (e.g., [Twitter](https://dev.twitter.com/rest/public))
+- Document oriented NoSQL databases (e.g., [MongoDB](https://www.mongodb.org/))
+- Flexible JSON columns in relational databases (e.g., [PostgreSQL](http://www.postgresql.org/docs/9.4/static/datatype-json.html))
+
+As such, R users are increasingly faced with JSON data sets, and need easy and
+reliable ways to turn those data sets into data.frames for analysis or modeling.
+
+There are already several libraries for working with JSON data in R, such as
+[rjson](http://cran.r-project.org/web/packages/rjson/index.html),
+[rjsonio](http://cran.r-project.org/web/packages/RJSONIO/index.html) and
+[jsonlite](http://cran.r-project.org/web/packages/jsonlite/index.html). Using
+these libraries, you can transform JSON into a nested R list. However, working
+with nested lists using base R functionality is difficult.
+
+The jsonlite package goes farther by automatically creating a nested R data.frame. 
+This is easier to work with than a list, but has two main limitations. First, the 
+resulting data.frame isn't [tidy](http://vita.had.co.nz/papers/tidy-data.pdf), 
+and so it can still be difficult to work with. Second, the structure of the 
+data.frame may vary as the JSON sample changes, which can happen any time you 
+change the database query or API call that generated the data.
+
+The tidyjson package takes an alternate approach to structuring JSON data into tidy 
+data.frames. Similar to [tidyr](http://cran.r-project.org/web/packages/tidyr/index.html), tidyjson builds
+a grammar for manipulating JSON into a tidy table structure. Tidyjson is based
+on the following principles:
+
+- Leverage other libraries for efficiently parsing JSON ([jsonlite](http://cran.r-project.org/web/packages/jsonlite/index.html))
+- Integrate with pipelines built on [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html)
+and the [magrittr](http://cran.r-project.org/web/packages/magrittr/index.html) `%>%` operator
+- Turn arbitrarily complex and nested JSON into tidy data.frames that can be joined later
+- Guarantee a deterministic data.frame column structure
+- Naturally handle 'ragged' arrays and / or objects (varying lengths by document)
+- Allow for extraction of data in values *or* key names
+- Ensure edge cases are handled correctly (especially empty data)
+
+## A Simple Example
+
+A simple example of how tidyjson works is as follows:
+
+```{r, message = FALSE}
+library(dplyr)      # for %>% and other dplyr functions
+
+# Define a simple JSON array of people
+people <- '[{"name": "bob", "age": 32}, {"name": "susan", "age": 54}]'
+
+# Structure the data
+people %>%          # Use the %>% pipe operator to pass json through a pipeline 
+  as.tbl_json %>%   # Parse the JSON and setup a 'tbl_json' object
+  gather_array %>%  # Gather (stack) the array by index
+  spread_values(    # Spread (widen) values to widen the data.frame
+    user.name = jstring("name"),  # Extract the "name" object as a character column "user.name"
+    user.age = jnumber("age")     # Extract the "age" object as a numeric column "user.age"
+  )
+```
+
+In such a simple example, we can use `fromJSON` in the jsonlite package to do
+this much faster:
+
+```{r}
+library(jsonlite)
+jsonlite::fromJSON(people)
+```
+
+However, if the structure of the data changed, so would the output from `fromJSON`.
+So even in this simple example there is value in the explicit structure defined
+in the tidyjson pipeline above.
+
+## A Complex Example
+
+The tidyjson package really shines in a more complex example. Consider the 
+following JSON, which describes three purchases of five items made by two 
+individuals:
+
+```{r}
+purch_json <- '
+[
+  {
+    "name": "bob", 
+    "purchases": [
+      {
+        "date": "2014/09/13",
+        "items": [
+          {"name": "shoes", "price": 187},
+          {"name": "belt", "price": 35}
+        ]
+      }
+    ]
+  },
+  {
+    "name": "susan", 
+    "purchases": [
+      {
+        "date": "2014/10/01",
+        "items": [
+          {"name": "dress", "price": 58},
+          {"name": "bag", "price": 118}
+        ]
+      },
+      {
+        "date": "2015/01/03",
+        "items": [
+          {"name": "shoes", "price": 115}
+        ]
+      }
+    ]
+  }
+]'
+```
+
+Suppose we want to find out how much each person has spent.
+
+Using jsonlite, we can parse the JSON:
+
+```{r}
+library(jsonlite)
+# Parse the JSON into a data.frame
+purch_df <- jsonlite::fromJSON(purch_json)
+# Examine results
+purch_df
+```
+
+However, the resulting data structure is a complex nested data.frame:
+
+```{r}
+str(purch_df)
+```
+
+This is difficult to work with, and we end up writing code like this:
+
+```{r}
+lapply(lapply(purch_df$purchases, `[[`, "items"), lapply, `[[`, "price")
+```
+
+Reasoning about code like this is nearly impossible, and further, the relational
+structure of the data is lost (we no longer have the name of the user).
+
+Using tidyjson, we can build a pipeline to turn this JSON into a tidy data.frame
+where each row corresponds to a purchased item:
+
+```{r}
+purch_items <- purch_json %>% 
+  as.tbl_json %>% gather_array %>%
+  spread_values(person = jstring("name")) %>% 
+  enter_object("purchases") %>% gather_array %>%
+  spread_values(purchase.date = jstring("date")) %>%
+  enter_object("items") %>% gather_array %>%
+  spread_values(
+    item.name = jstring("name"),
+    item.price = jnumber("price")
+  ) %>%
+  select(person, purchase.date, item.name, item.price)
+```
+
+The resulting data.frame is exactly what we want
+
+```{r}
+purch_items
+```
+
+And we can easily continue the pipeline in dplyr to compute derived data
+
+```{r}
+purch_items %>% group_by(person) %>% summarize(spend = sum(item.price))
+```
+
+## Data
+
+The tidyjson package comes with several JSON example datasets:
+
+* `commits`: commit data for the dplyr repo from github API
+* `issues`: issue data for the dplyr repo from github API
+* `worldbank`: world bank funded projects from 
+[jsonstudio](http://jsonstudio.com/resources/)
+* `companies`: startup company data from 
+[jsonstudio](http://jsonstudio.com/resources/)
+
+Each dataset has some example tidyjson queries in `help(commits)`, 
+`help(issues)`, `help(worldbank)` and `help(companies)`.
+
+## JSON
+
+(TODO: Need to describe JSON more here).
+
+### Create a `tbl_json` object
+
+The first step in using tidyjson is to convert your JSON into a `tbl_json` object.
+Almost every function in tidyjson accepts a `tbl_json` object as it's first 
+parameter, and returns a `tbl_json` object for downstream use. `tbl_json` 
+inherits from `dplyr::tbl`.
+
+A `tbl_json` object is comprised of a `data.frame` with an additional attribute,
+`JSON`, that contains a list of JSON data of the same length as the number of
+rows in the `data.frame`. Each row of data in the `data.frame` corresponds to the
+JSON found in the same index of the `JSON` attribute.
+
+The easiest way to construct a `tbl_json` object is directly from a character
+string or vector.
+
+```{r}
+# Will return a 1 row data.frame with a length 1 JSON attribute
+'{"key": "value"}' %>% as.tbl_json
+
+# Will still return a 1 row data.frame with a length 1 JSON attribute as
+# the character string is of length 1 (even though it contains a JSON array of
+# length 2)
+'[{"key1": "value1"}, {"key2": "value2"}]' %>% as.tbl_json
+
+# Will return a 2 row data.frame with a length 2 JSON attribute
+c('{"key1": "value1"}', '{"key2": "value2"}') %>% as.tbl_json
+```
+
+Behind the scenes, `as.tbl_json()` is parsing the JSON strings and creating a
+data.frame with 1 column, `document.id`, which keeps track of the character 
+vector position (index) where the JSON data came from.
+
+TODO: Need to show how to create one from a data.frame
+TODO: Also need to talk about JSON lines format
+
+## Verbs
+
+The rest of tidyjson is comprised of various verbs with operate on `tbl_json`
+objects and return `tbl_json` objects. They are meant to be used in a pipeline
+with the `%>%` operator.
+
+Note that these verbs all operate on *both* the underlying data.frame and the
+JSON, iteratively moving data from the JSON into the data.frame. Any
+modifications of the underlying data.frame outside of these operations
+may produce unintended consequences where the data.frame and JSON become out of
+synch.
+
+The following table provides a reference of how each verb is used and what
+(if any) effect it has on the data.frame rows and columns and on the associated
+JSON.
+
+| Verb                | JSON   | Arguments       | Row Effect        | Column Effect    | JSON Effect    |
+|:--------------------|:-------|:----------------|:------------------|:-----------------|:---------------|
+| `enter_object()`    | object | ... = key path  | Drops without key | none             | object value   | 
+| `json_types()`      | any    | column.name     | Duplicates rows   | type column      | object keys    |
+| `gather_array()`    | array  | column.name     | Duplicates rows   | index column     | array values   |
+| `gather_keys()`     | object | column.name     | Duplicates rows   | key column       | object values  |
+| `spread_values()`   | object | ... = columns   | none              | N value columns  | none           |
+| `append_values_X()` | scalar | colum.name      | none              | column of type X | none           |
+
+### Identify JSON structure with `json_types()`
+
+One of the first steps you will want to take is to investigate the structure of
+your JSON data. The function `json_types()` inspects the JSON associated with 
+each row of the data.frame, and adds a new column (`type` by default) that 
+identifies the type according to the [JSON standard](http://json.org/).
+
+```{r}
+types <- c('{"a": 1}', '[1, 2]', '"a"', '1', 'true', 'null') %>% as.tbl_json %>%
+   json_types
+types$type
+```
+
+This is particularly useful for inspecting your JSON data types, and can added
+after `gather_array()` (or `gather_keys()`) to inspect the types of the elements
+(or values) in arrays (or objects).
+
+### Stack a JSON array with `gather_array()`
+
+Arrays are sometimes vectors (fixed or varying length integer, character or 
+logical vectors). But they also often contain lists of other objects (like
+a list of purchases for a user). The function `gather_array()` takes JSON arrays
+and duplicates the rows in the data.frame to correspond to the indices of the 
+array, and puts the elements of the array into the JSON attribute. 
+This is equivalent to "stacking" the array in the data.frame, and lets you 
+continue to manipulate the remaining JSON in the elements of the array.
+
+```{r}
+'[1, "a", {"k": "v"}]' %>% as.tbl_json %>% gather_array %>% json_types
+```
+
+This allows you to *enter into* an array and begin processing it's elements
+with other tidyjson functions. It retains the array.index in case the relative
+position of elements in the array is useful information.
+
+### Stack a "key": <value> object with `gather_keys()`
+
+Similar to `gather_array()`, `gather_keys()` takes JSON objects and duplicates 
+the rows in the data.frame to correspond to the keys of the object, and puts the 
+values of the object into the JSON attribute.
+
+```{r}
+'{"name": "bob", "age": 32}' %>% as.tbl_json %>% gather_keys %>% json_types
+```
+
+This allows you to *enter into* the keys of the objects just like `gather_array`
+let you enter elements of the array.
+
+### Create new columns with JSON values with `spread_values()`
+
+Adding new columns to your `data.frame` is accomplished with `spread_values()`, 
+which lets you dive into (potentially nested) JSON objects and extract specific 
+values. `spread_values()` takes `jstring()`, `jnumber()` or `jlogical()` 
+function calls as arguments in order to specify the type of the data that should 
+be captured at each desired key location
+
+These values can be of varying types at varying depths, e.g.,
+
+```{r}
+'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>% 
+  as.tbl_json %>% 
+  spread_values(
+    first.name = jstring("name", "first"), 
+    age = jnumber("age")
+  )
+```
+
+### Stack all values of a specified type with `append_values_X()`
+
+The `append_values_X()` functions let you take the remaining JSON and add it as
+a column X (for X in "string", "number", "logical") insofar as it is of the
+JSON type specified. For example:
+
+```{r}
+'{"first": "bob", "last": "jones"}' %>% as.tbl_json %>% 
+  gather_keys() %>% append_values_string()
+```
+
+Any values that do not conform to the type specified will be NA in the resulting
+column. This includes other scalar types (e.g., numbers or logicals if you are
+using `append_values_string()`) and *also* any rows where the JSON is still an
+object or an array.
+
+### Dive into a specific object "key" with `enter_object()`
+
+For complex JSON structures, you will often need to navigate into nested objects
+in order to continue structuring your data. The function `enter_object()` lets 
+you dive into a specific object key in the JSON attribute, so that all further 
+tidyjson calls happen inside that object (all other JSON data outside the object 
+is discarded). If the object doesn't exist for a given row / index, then that 
+data.frame row will be discarded.
+
+```{r}
+c('{"name": "bob", "children": ["sally", "george"]}', '{"name": "anne"}') %>% 
+  as.tbl_json %>% spread_values(parent.name = jstring("name")) %>%
+  enter_object("children") %>% 
+  gather_array %>% append_values_string("children")
+```
+
+This is useful when you want to limit your data to just information found in
+a specific key.
+
+## Strategies
+
+When beginning to work with JSON data, you often don't have easy access to a
+schema describing what is in the JSON. One of the benefits of document oriented
+data structures is that they let developers create data without having to worry
+about defining the schema explicitly.
+
+Thus, the first step is to usually understand the structure of the JSON. A first
+step can be to look at individual records with `jsonlite::prettify()`:
+
+```{r, message = FALSE}
+library(jsonlite)
+prettify(json)
+```
+
+Examining various random records can begin to give you a sense of what the JSON
+contains and how it it structured. However, keep in mind that in many cases
+documents that are missing data (either unknown or unrelevant) may omit the
+entire JSON structure.
+
+Next, you can begin working with the data in R.
+
+```{r}
+# assuming documents are carriage-return delimited, otherwise use readChar
+# json <- readLines(file.json)  # TODO: Need to change this
+
+# Inspect the types of objects
+# json %>% json_types %>% table
+```
+
+Then, if you want to work with a single row of data for each JSON object, use
+`spread_values()` to get at (potentially nested) key-value pairs.
+
+If all you care about is data from a certain sub-object, then use `enter_object()`
+to dive into that object directly. Make sure you first use `spread_values()` to
+capture any top level identifiers you might need for analytics, summarization or
+relational uses downstream.
+
+If you want to access arrays, use `gather_array()` to stack their elements, and
+then proceed as though you had separate documents. (Again, first spread any
+top-level keys you need.)
+
+Finally, if you have data where information is encoded in both keys and values,
+then consider using `gather_keys()` and `append_values_X()` where `X` is the type
+of JSON scalar data you expect in the values.
+
+It's important to remember that any of the above can be combined together
+iteratively to do some fairly complex data extraction. For example:
+
+```{r}
+json <- '{
+  "name": "bob",
+  "shopping cart": 
+    [
+      {
+        "date": "2014-04-02",
+        "basket": {"books": 2, "shirts": 0}
+      },
+      {
+        "date": "2014-08-23",
+        "basket": {"books": 1}
+      }
+    ]
+}'
+json %>% as.tbl_json %>% 
+  spread_values(customer = jstring("name")) %>% # Keep the customer name
+  enter_object("shopping cart") %>%             # Look at their cart
+  gather_array %>%                              # Expand the data.frame and dive into each array element
+  spread_values(date = jstring("date")) %>%     # Keep the date of the cart
+  enter_object("basket") %>%                    # Look at their basket
+  gather_keys("product") %>%                    # Expand the data.frame for each product and capture it's name
+  append_values_number("quantity")              # Capture the values as the quantity
+```
+
+Note that there are often situations where there are multiple arrays or objects
+of differing types that exist at the same level of the JSON hierarchy. In this
+case, you need to use `enter_object()` to enter each of them in *separate*
+pipelines to create *separate* `data.frames` that can then be joined 
+relationally.
+
+Finally, don't forget that once you are done with your JSON tidying, you can
+use [dplyr](http://github.com/hadley/dplyr) to continue manipulating the
+resulting data at your leisure!