From 41df4173e3a01a1179c0bfb99056f81757891d7e Mon Sep 17 00:00:00 2001 From: Jeremy Stanley Date: Sun, 12 Apr 2015 09:31:42 -0400 Subject: [PATCH] #29 add json_lengths and resolve todos --- vignettes/introduction-to-tidyjson.Rmd | 67 ++++++++++++++++---------- 1 file changed, 42 insertions(+), 25 deletions(-) diff --git a/vignettes/introduction-to-tidyjson.Rmd b/vignettes/introduction-to-tidyjson.Rmd index cb03af4..e2356f4 100644 --- a/vignettes/introduction-to-tidyjson.Rmd +++ b/vignettes/introduction-to-tidyjson.Rmd @@ -11,6 +11,7 @@ vignette: > ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") +options(dplyr.print_min = 4L, dplyr.print_max = 4L) ``` [JSON](http://json.org/) (JavaScript Object Notation) is a lightweight and @@ -212,9 +213,10 @@ purch_items %>% group_by(person) %>% summarize(spend = sum(item.price)) ### Creating a `tbl_json` object The first step in using tidyjson is to convert your JSON into a `tbl_json` object. -Almost every function in tidyjson accepts a `tbl_json` object as it's first -parameter, and returns a `tbl_json` object for downstream use. `tbl_json` -inherits from `dplyr::tbl`. +Almost every function in tidyjson accepts either a `tbl_json` object or a character +vector of JSON data as it's first parameter, and returns a `tbl_json` object for +downstream use. To facilitate integration with dplyr, `tbl_json` inherits from +`dplyr::tbl`. The easiest way to construct a `tbl_json` object is directly from a character string: @@ -229,7 +231,7 @@ attr(x, "JSON") Behind the scenes, `as.tbl_json` is parsing the JSON string and creating a data.frame with 1 column, `document.id`, which keeps track of the character vector position (index) where the JSON data came from. In addition, each -`tbl_json` object has an additional attribute, `JSON`, that contains a list of +`tbl_json` object has a "JSON" attribute that contains a list of JSON data of the same length as the number of rows in the `data.frame`. Often times you will have many lines of JSON data that you want to work with, @@ -249,12 +251,22 @@ of the character vector. We can see the underlying parsed JSON: attr(y, "JSON") ``` -TODO: +If your JSON data is already embedded in a data.frame, then you will need +to call `as.tbl_json` directly in order to specific which column contains +the JSON data. Note that the JSON in the data.frame should be character data, +and not a factor. Use `stringsAsFactors = FALSE` in constructing the data.frame +to avoid turning the JSON into a factor. -* Describe preservation of JSON under various operations ([, filter, etc.) -* Add sections on files, data.frames -* Show a table of methods for tbl_json -* Explain that you don't have to call as.tbl_json with verbs +```{r} +df <- data.frame( + x = 1:2, + JSON = c('{"key1": "value1"}', '{"key2": "value2"}'), + stringsAsFactors = FALSE +) +z <- df %>% as.tbl_json(json.column = "JSON") +z +attr(z, "JSON") +``` ### JSON included in the package @@ -286,18 +298,19 @@ The following table provides a reference of how each verb is used and what (if any) effect it has on the data.frame rows and columns and on the associated JSON. -| Verb | JSON | Arguments | Row Effect | Column Effect | JSON Effect | -|:--------------------|:-------|:----------------|:------------------|:-----------------|:---------------| -| `enter_object()` | object | ... = key path | Drops without key | none | object value | -| `json_types()` | any | column.name | Duplicates rows | type column | object keys | -| `gather_array()` | array | column.name | Duplicates rows | index column | array values | -| `gather_keys()` | object | column.name | Duplicates rows | key column | object values | -| `spread_values()` | object | ... = columns | none | N value columns | none | -| `append_values_X()` | scalar | column.name | none | column of type X | none | +| Verb | JSON | Arguments | Row Effect | Column Effect | JSON Effect | +|:--------------------|:-------|:----------------|:------------------|:-----------------|:--------------------| +| `enter_object()` | object | ... = key path | Drops without key | none | enter object value | +| `json_types()` | any | column.name | none | type column | none | +| `gather_array()` | array | column.name | Duplicates rows | index column | enter array values | +| `gather_keys()` | object | column.name | Duplicates rows | key column | enter object values | +| `spread_values()` | object | ... = columns | none | N value columns | none | +| `append_values_X()` | scalar | column.name | none | column of type X | none | +| `json_lengths()` | any | column.name | none | length column | none | TODO: -* Add `json_lengths()` here and below +* Add `json_lengths()` below * Length descriptions above * Re-order below and above to be consistent @@ -309,9 +322,7 @@ each row of the data.frame, and adds a new column (`type` by default) that identifies the type according to the [JSON standard](http://json.org/). ```{r} -types <- c('{"a": 1}', '[1, 2]', '"a"', '1', 'true', 'null') %>% - json_types -types$type +c('{"a": 1}', '[1, 2]', '"a"', '1', 'true', 'null') %>% json_types ``` This is particularly useful for inspecting your JSON data types, and can added @@ -404,6 +415,15 @@ c('{"name": "bob", "children": ["sally", "george"]}', '{"name": "anne"}') %>% This is useful when you want to limit your data to just information found in a specific key. +### Identify length of JSON objects with `json_lengths()` + +When investigating JSON data it can be helpful to identify the lengths of the +JSON objects or arrays, especialy when they are 'ragged' across documents: + +```{r} +c('[1, 2, 3]', '{"k1": 1, "k2": 2}', '1') %>% json_lengths +``` + ## A real example Included in the tidyjson package is a `r length(worldbank)` record sample, @@ -461,7 +481,7 @@ summary(amts$total.m) Many are 0, the mean is $80m and the max is over $1bn. Let's now aggregate by the sector and compute, on a dollar weighted basis, -where the money is going by sector +where the money is going by sector: ```{r} amts %>% @@ -474,9 +494,6 @@ amts %>% arrange(desc(spend.dist)) ``` -It looks like in this sample of projects, "Information and Communication" is -really low on the worldbank priority list! - ## Strategies When beginning to work with JSON data, you often don't have easy access to a