From fae9ff2ffbbda398e701b5ce984af41ea46bff65 Mon Sep 17 00:00:00 2001 From: Jeremy Stanley Date: Sun, 5 Apr 2015 07:49:22 -0400 Subject: [PATCH] #29 first draft vignette --- .gitignore | 1 + vignettes/introduction-to-tidyjson.Rmd | 443 +++++++++++++++++++++++++ 2 files changed, 444 insertions(+) create mode 100644 vignettes/introduction-to-tidyjson.Rmd diff --git a/.gitignore b/.gitignore index 21275f5..7b16023 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,4 @@ .RData .Rhistory *.swp +inst/doc diff --git a/vignettes/introduction-to-tidyjson.Rmd b/vignettes/introduction-to-tidyjson.Rmd new file mode 100644 index 0000000..b5b99af --- /dev/null +++ b/vignettes/introduction-to-tidyjson.Rmd @@ -0,0 +1,443 @@ +--- +title: "Introduction to tidyjson" +author: "Jeremy Stanley" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Vignette Title} + %\VignetteEngine{knitr::rmarkdown} + %\usepackage[utf8]{inputenc} +--- + +[JSON](http://json.org/) (JavaScript Object Notation) is a lightweight data +format that is easy for humans to read and for machines to parse. It is also +incredibly flexible. JSON has become a common format used in: + +- Public APIs (e.g., [Twitter](https://dev.twitter.com/rest/public)) +- Document oriented NoSQL databases (e.g., [MongoDB](https://www.mongodb.org/)) +- Flexible JSON columns in relational databases (e.g., [PostgreSQL](http://www.postgresql.org/docs/9.4/static/datatype-json.html)) + +As such, R users are increasingly faced with JSON data sets, and need easy and +reliable ways to turn those data sets into data.frames for analysis or modeling. + +There are already several libraries for working with JSON data in R, such as +[rjson](http://cran.r-project.org/web/packages/rjson/index.html), +[rjsonio](http://cran.r-project.org/web/packages/RJSONIO/index.html) and +[jsonlite](http://cran.r-project.org/web/packages/jsonlite/index.html). Using +these libraries, you can transform JSON into a nested R list. However, working +with nested lists using base R functionality is difficult. + +The jsonlite package goes farther by automatically creating a nested R data.frame. +This is easier to work with than a list, but has two main limitations. First, the +resulting data.frame isn't [tidy](http://vita.had.co.nz/papers/tidy-data.pdf), +and so it can still be difficult to work with. Second, the structure of the +data.frame may vary as the JSON sample changes, which can happen any time you +change the database query or API call that generated the data. + +The tidyjson package takes an alternate approach to structuring JSON data into tidy +data.frames. Similar to [tidyr](http://cran.r-project.org/web/packages/tidyr/index.html), tidyjson builds +a grammar for manipulating JSON into a tidy table structure. Tidyjson is based +on the following principles: + +- Leverage other libraries for efficiently parsing JSON ([jsonlite](http://cran.r-project.org/web/packages/jsonlite/index.html)) +- Integrate with pipelines built on [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html) +and the [magrittr](http://cran.r-project.org/web/packages/magrittr/index.html) `%>%` operator +- Turn arbitrarily complex and nested JSON into tidy data.frames that can be joined later +- Guarantee a deterministic data.frame column structure +- Naturally handle 'ragged' arrays and / or objects (varying lengths by document) +- Allow for extraction of data in values *or* key names +- Ensure edge cases are handled correctly (especially empty data) + +## A Simple Example + +A simple example of how tidyjson works is as follows: + +```{r, message = FALSE} +library(dplyr) # for %>% and other dplyr functions + +# Define a simple JSON array of people +people <- '[{"name": "bob", "age": 32}, {"name": "susan", "age": 54}]' + +# Structure the data +people %>% # Use the %>% pipe operator to pass json through a pipeline + as.tbl_json %>% # Parse the JSON and setup a 'tbl_json' object + gather_array %>% # Gather (stack) the array by index + spread_values( # Spread (widen) values to widen the data.frame + user.name = jstring("name"), # Extract the "name" object as a character column "user.name" + user.age = jnumber("age") # Extract the "age" object as a numeric column "user.age" + ) +``` + +In such a simple example, we can use `fromJSON` in the jsonlite package to do +this much faster: + +```{r} +library(jsonlite) +jsonlite::fromJSON(people) +``` + +However, if the structure of the data changed, so would the output from `fromJSON`. +So even in this simple example there is value in the explicit structure defined +in the tidyjson pipeline above. + +## A Complex Example + +The tidyjson package really shines in a more complex example. Consider the +following JSON, which describes three purchases of five items made by two +individuals: + +```{r} +purch_json <- ' +[ + { + "name": "bob", + "purchases": [ + { + "date": "2014/09/13", + "items": [ + {"name": "shoes", "price": 187}, + {"name": "belt", "price": 35} + ] + } + ] + }, + { + "name": "susan", + "purchases": [ + { + "date": "2014/10/01", + "items": [ + {"name": "dress", "price": 58}, + {"name": "bag", "price": 118} + ] + }, + { + "date": "2015/01/03", + "items": [ + {"name": "shoes", "price": 115} + ] + } + ] + } +]' +``` + +Suppose we want to find out how much each person has spent. + +Using jsonlite, we can parse the JSON: + +```{r} +library(jsonlite) +# Parse the JSON into a data.frame +purch_df <- jsonlite::fromJSON(purch_json) +# Examine results +purch_df +``` + +However, the resulting data structure is a complex nested data.frame: + +```{r} +str(purch_df) +``` + +This is difficult to work with, and we end up writing code like this: + +```{r} +lapply(lapply(purch_df$purchases, `[[`, "items"), lapply, `[[`, "price") +``` + +Reasoning about code like this is nearly impossible, and further, the relational +structure of the data is lost (we no longer have the name of the user). + +Using tidyjson, we can build a pipeline to turn this JSON into a tidy data.frame +where each row corresponds to a purchased item: + +```{r} +purch_items <- purch_json %>% + as.tbl_json %>% gather_array %>% + spread_values(person = jstring("name")) %>% + enter_object("purchases") %>% gather_array %>% + spread_values(purchase.date = jstring("date")) %>% + enter_object("items") %>% gather_array %>% + spread_values( + item.name = jstring("name"), + item.price = jnumber("price") + ) %>% + select(person, purchase.date, item.name, item.price) +``` + +The resulting data.frame is exactly what we want + +```{r} +purch_items +``` + +And we can easily continue the pipeline in dplyr to compute derived data + +```{r} +purch_items %>% group_by(person) %>% summarize(spend = sum(item.price)) +``` + +## Data + +The tidyjson package comes with several JSON example datasets: + +* `commits`: commit data for the dplyr repo from github API +* `issues`: issue data for the dplyr repo from github API +* `worldbank`: world bank funded projects from +[jsonstudio](http://jsonstudio.com/resources/) +* `companies`: startup company data from +[jsonstudio](http://jsonstudio.com/resources/) + +Each dataset has some example tidyjson queries in `help(commits)`, +`help(issues)`, `help(worldbank)` and `help(companies)`. + +## JSON + +(TODO: Need to describe JSON more here). + +### Create a `tbl_json` object + +The first step in using tidyjson is to convert your JSON into a `tbl_json` object. +Almost every function in tidyjson accepts a `tbl_json` object as it's first +parameter, and returns a `tbl_json` object for downstream use. `tbl_json` +inherits from `dplyr::tbl`. + +A `tbl_json` object is comprised of a `data.frame` with an additional attribute, +`JSON`, that contains a list of JSON data of the same length as the number of +rows in the `data.frame`. Each row of data in the `data.frame` corresponds to the +JSON found in the same index of the `JSON` attribute. + +The easiest way to construct a `tbl_json` object is directly from a character +string or vector. + +```{r} +# Will return a 1 row data.frame with a length 1 JSON attribute +'{"key": "value"}' %>% as.tbl_json + +# Will still return a 1 row data.frame with a length 1 JSON attribute as +# the character string is of length 1 (even though it contains a JSON array of +# length 2) +'[{"key1": "value1"}, {"key2": "value2"}]' %>% as.tbl_json + +# Will return a 2 row data.frame with a length 2 JSON attribute +c('{"key1": "value1"}', '{"key2": "value2"}') %>% as.tbl_json +``` + +Behind the scenes, `as.tbl_json()` is parsing the JSON strings and creating a +data.frame with 1 column, `document.id`, which keeps track of the character +vector position (index) where the JSON data came from. + +TODO: Need to show how to create one from a data.frame +TODO: Also need to talk about JSON lines format + +## Verbs + +The rest of tidyjson is comprised of various verbs with operate on `tbl_json` +objects and return `tbl_json` objects. They are meant to be used in a pipeline +with the `%>%` operator. + +Note that these verbs all operate on *both* the underlying data.frame and the +JSON, iteratively moving data from the JSON into the data.frame. Any +modifications of the underlying data.frame outside of these operations +may produce unintended consequences where the data.frame and JSON become out of +synch. + +The following table provides a reference of how each verb is used and what +(if any) effect it has on the data.frame rows and columns and on the associated +JSON. + +| Verb | JSON | Arguments | Row Effect | Column Effect | JSON Effect | +|:--------------------|:-------|:----------------|:------------------|:-----------------|:---------------| +| `enter_object()` | object | ... = key path | Drops without key | none | object value | +| `json_types()` | any | column.name | Duplicates rows | type column | object keys | +| `gather_array()` | array | column.name | Duplicates rows | index column | array values | +| `gather_keys()` | object | column.name | Duplicates rows | key column | object values | +| `spread_values()` | object | ... = columns | none | N value columns | none | +| `append_values_X()` | scalar | colum.name | none | column of type X | none | + +### Identify JSON structure with `json_types()` + +One of the first steps you will want to take is to investigate the structure of +your JSON data. The function `json_types()` inspects the JSON associated with +each row of the data.frame, and adds a new column (`type` by default) that +identifies the type according to the [JSON standard](http://json.org/). + +```{r} +types <- c('{"a": 1}', '[1, 2]', '"a"', '1', 'true', 'null') %>% as.tbl_json %>% + json_types +types$type +``` + +This is particularly useful for inspecting your JSON data types, and can added +after `gather_array()` (or `gather_keys()`) to inspect the types of the elements +(or values) in arrays (or objects). + +### Stack a JSON array with `gather_array()` + +Arrays are sometimes vectors (fixed or varying length integer, character or +logical vectors). But they also often contain lists of other objects (like +a list of purchases for a user). The function `gather_array()` takes JSON arrays +and duplicates the rows in the data.frame to correspond to the indices of the +array, and puts the elements of the array into the JSON attribute. +This is equivalent to "stacking" the array in the data.frame, and lets you +continue to manipulate the remaining JSON in the elements of the array. + +```{r} +'[1, "a", {"k": "v"}]' %>% as.tbl_json %>% gather_array %>% json_types +``` + +This allows you to *enter into* an array and begin processing it's elements +with other tidyjson functions. It retains the array.index in case the relative +position of elements in the array is useful information. + +### Stack a "key": object with `gather_keys()` + +Similar to `gather_array()`, `gather_keys()` takes JSON objects and duplicates +the rows in the data.frame to correspond to the keys of the object, and puts the +values of the object into the JSON attribute. + +```{r} +'{"name": "bob", "age": 32}' %>% as.tbl_json %>% gather_keys %>% json_types +``` + +This allows you to *enter into* the keys of the objects just like `gather_array` +let you enter elements of the array. + +### Create new columns with JSON values with `spread_values()` + +Adding new columns to your `data.frame` is accomplished with `spread_values()`, +which lets you dive into (potentially nested) JSON objects and extract specific +values. `spread_values()` takes `jstring()`, `jnumber()` or `jlogical()` +function calls as arguments in order to specify the type of the data that should +be captured at each desired key location + +These values can be of varying types at varying depths, e.g., + +```{r} +'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>% + as.tbl_json %>% + spread_values( + first.name = jstring("name", "first"), + age = jnumber("age") + ) +``` + +### Stack all values of a specified type with `append_values_X()` + +The `append_values_X()` functions let you take the remaining JSON and add it as +a column X (for X in "string", "number", "logical") insofar as it is of the +JSON type specified. For example: + +```{r} +'{"first": "bob", "last": "jones"}' %>% as.tbl_json %>% + gather_keys() %>% append_values_string() +``` + +Any values that do not conform to the type specified will be NA in the resulting +column. This includes other scalar types (e.g., numbers or logicals if you are +using `append_values_string()`) and *also* any rows where the JSON is still an +object or an array. + +### Dive into a specific object "key" with `enter_object()` + +For complex JSON structures, you will often need to navigate into nested objects +in order to continue structuring your data. The function `enter_object()` lets +you dive into a specific object key in the JSON attribute, so that all further +tidyjson calls happen inside that object (all other JSON data outside the object +is discarded). If the object doesn't exist for a given row / index, then that +data.frame row will be discarded. + +```{r} +c('{"name": "bob", "children": ["sally", "george"]}', '{"name": "anne"}') %>% + as.tbl_json %>% spread_values(parent.name = jstring("name")) %>% + enter_object("children") %>% + gather_array %>% append_values_string("children") +``` + +This is useful when you want to limit your data to just information found in +a specific key. + +## Strategies + +When beginning to work with JSON data, you often don't have easy access to a +schema describing what is in the JSON. One of the benefits of document oriented +data structures is that they let developers create data without having to worry +about defining the schema explicitly. + +Thus, the first step is to usually understand the structure of the JSON. A first +step can be to look at individual records with `jsonlite::prettify()`: + +```{r, message = FALSE} +library(jsonlite) +prettify(json) +``` + +Examining various random records can begin to give you a sense of what the JSON +contains and how it it structured. However, keep in mind that in many cases +documents that are missing data (either unknown or unrelevant) may omit the +entire JSON structure. + +Next, you can begin working with the data in R. + +```{r} +# assuming documents are carriage-return delimited, otherwise use readChar +# json <- readLines(file.json) # TODO: Need to change this + +# Inspect the types of objects +# json %>% json_types %>% table +``` + +Then, if you want to work with a single row of data for each JSON object, use +`spread_values()` to get at (potentially nested) key-value pairs. + +If all you care about is data from a certain sub-object, then use `enter_object()` +to dive into that object directly. Make sure you first use `spread_values()` to +capture any top level identifiers you might need for analytics, summarization or +relational uses downstream. + +If you want to access arrays, use `gather_array()` to stack their elements, and +then proceed as though you had separate documents. (Again, first spread any +top-level keys you need.) + +Finally, if you have data where information is encoded in both keys and values, +then consider using `gather_keys()` and `append_values_X()` where `X` is the type +of JSON scalar data you expect in the values. + +It's important to remember that any of the above can be combined together +iteratively to do some fairly complex data extraction. For example: + +```{r} +json <- '{ + "name": "bob", + "shopping cart": + [ + { + "date": "2014-04-02", + "basket": {"books": 2, "shirts": 0} + }, + { + "date": "2014-08-23", + "basket": {"books": 1} + } + ] +}' +json %>% as.tbl_json %>% + spread_values(customer = jstring("name")) %>% # Keep the customer name + enter_object("shopping cart") %>% # Look at their cart + gather_array %>% # Expand the data.frame and dive into each array element + spread_values(date = jstring("date")) %>% # Keep the date of the cart + enter_object("basket") %>% # Look at their basket + gather_keys("product") %>% # Expand the data.frame for each product and capture it's name + append_values_number("quantity") # Capture the values as the quantity +``` + +Note that there are often situations where there are multiple arrays or objects +of differing types that exist at the same level of the JSON hierarchy. In this +case, you need to use `enter_object()` to enter each of them in *separate* +pipelines to create *separate* `data.frames` that can then be joined +relationally. + +Finally, don't forget that once you are done with your JSON tidying, you can +use [dplyr](http://github.com/hadley/dplyr) to continue manipulating the +resulting data at your leisure!