Skip to content

Commit

Permalink
improve visualizations and text #4
Browse files Browse the repository at this point in the history
  • Loading branch information
Jeremy committed Aug 26, 2016
1 parent 8c99c24 commit 09c589e
Showing 1 changed file with 206 additions and 56 deletions.
262 changes: 206 additions & 56 deletions vignettes/visualizing-json.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Visualizing JSON"
title: "Visualizing JSON Schema"
author: "Jeremy Stanley"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
Expand All @@ -14,37 +14,128 @@ knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height
options(dplyr.print_min = 4L, dplyr.print_max = 4L)
```

JSON is a very simple data standard that, through nested data structures, can
represent incredibly complex datasets. In some cases, a set of JSON data
closely corresponds to a table in a SQL database. However, more commonly a
JSON document more closely maps to an entire SQL databse.

Understanding the structure of your JSON data is critical before you begin
analyzing the data. In this vignette, we use `tidyjson` to inspect the
structure of JSON data and then create various visualizations to help
understand a complex JSON dataset.

## JSON Definition

For a refrehser, see the [JSON specification](http://www.json.org/), which is
a very concise summary of how JSON is formatted. In essence, there are
three types of JSON data structures.

An object is a name/value pair, like `'{"string": "value"}'`:

![](http://www.json.org/object.gif)

An array is an ordered list, like `'[1, 2, 3]'`:

![](http://www.json.org/array.gif)

A value is a string, number, logical or NULL scalar:

![](http://www.json.org/value.gif)

## Load required libraries

Before we start, let's load `tidyjson` along with other data manipulation and
visualization libraries, and set a seed so we get consistent results.

```{r, message = FALSE}
library(tidyjson) # this library
library(dplyr) # for %>% and other dplyr functions
library(ggplot2) # for plotting
library(igraph) # for graph visualizations
library(RColorBrewer) # for colors
library(jsonlite) # for fromJSON
library(purrr) # for list operations
library(wordcloud)
library(magrittr)
library(needs)
needs(tidyjson, jsonlite, dplyr, purrr, magrittr,
ggplot2, igraph, RColorBrewer, wordcloud, viridis)
set.seed(1)
```

## Companies Data

Let's work with a sample of the companies data
Let's work with the `companies` dataset included in the `tidyjson` package,
originating at [jsonstudio](http://jsonstudio.com/resources). It is a
`r class(companies)` vector of
`r length(companies) %>% format(big.mark = ',')`
JSON strings, each describing a startup company.

First, let's convert the JSON to a nested list using `jsonlite::fromJSON`, where
we use `simplifyVector = FALSE` to avoid any simplification (which, while handy
can lead to inconsistent results across documents which may not have the same
set of objects).

```{r}
set.seed(1)
samp_co <- companies[sample(1:length(companies), 50)]
co_list <- companies %>% map(fromJSON, simplifyVector = FALSE)
```

We can see the structure of a sample record with `str`
We can then find out how complex each record is by recursively unlisting it
and computing the length:

```{r}
str(fromJSON(samp_co[[1]]))
co_length <- co_list %>% map(unlist, recursive = TRUE) %>% map_int(length)
```

Then we can visualize the distribution of lengths on a log-scale:

```{r}
co_length %>%
data_frame(length = .) %>%
ggplot(aes(length)) +
geom_density() +
scale_x_log10() +
annotation_logticks(side = 'b')
```

It appears that some companies have unlisted length less than 10, while others
are in the hundreds or even thousands. The median is `r median(co_length)`.

Let's pick an example that is particularly small to start with:

```{r}
first_examp_index <- co_length %>%
detect_index(equals, 20L)
co_examp <- companies[first_examp_index]
co_examp
```

Even for such a small example it's hard to understand the structure from the
raw JSON. We can instead use `jsonlite::prettify` to print a prettier version:

```{r}
co_examp %>%
prettify(indent = 2) %>%
capture.output %>%
paste(collapse = "\n") %>%
gsub("\\[\n\n( )*\\]", "[ ]", .) %>%
writeLines
```

Where everything after `prettify` is done to collapse empty arrays from
occupying multiple lines, of which there are many.

Alternatively, we can visualize the same object after we converted it into
an R list using `str`:

```{r}
str(co_list[first_examp_index])
```

Alternatively, we can compute the structure using `tidyjson::json_structure`

```{r}
co_examp %>% json_structure %>% select(-document.id)
```

This gives us a `data.frame` where each row corresponds to an object, array
or scalar in the JSON document.

## Working with many companies

```{r}
structure <- companies %>% json_structure
Expand All @@ -62,101 +153,160 @@ As a word cloud
keys %$% wordcloud(key, ndoc, scale = c(1.5, .1), min.freq = 100)
```

In ggplot2
Alternatively in ggplot2

```{r, fig.height = 12}
```{r, fig.height = 9}
keys %>%
ggplot(aes(ndoc, key, colour = type)) +
geom_segment(aes(xend = 0, yend = key)) +
facet_grid(level ~ ., scale = "free_y", space = "free", switch = 'y')
ungroup %>%
group_by(type) %>%
arrange(desc(ndoc), level) %>%
mutate(rank = 1:n()) %>%
ggplot(aes(1, rank)) +
geom_text(aes(label = key, color = ndoc)) +
scale_y_reverse() +
facet_grid(. ~ type) +
theme_void() +
theme(legend.position = "bottom") +
scale_color_viridis(direction = -1)
```

## Visualizing as Graphs

Plot as a network graph

```{r}
plot_structure_graph <- function(json, legend = TRUE) {
plot_structure_graph <- function(json, legend = TRUE, vertex.size = 4,
edge.width = 2, show.labels = TRUE) {
structure <- json %>% json_structure
type_colors <- brewer.pal(6, "Accent")
graph_edges <- structure %>%
filter(!is.na(parent.id)) %>%
select(parent.id, child.id)
graph_vertices <- structure %>%
transmute(child.id,
vertex.color = type_colors[as.integer(type)],
vertex.label = key)
if (!show.labels)
graph_vertices$vertex.label <- rep(NA_character_, nrow(graph_vertices))
g <- graph_from_data_frame(
structure %>%
filter(!is.na(parent.id)) %>%
select(parent.id, child.id),
directed = TRUE,
vertices = structure %>%
transmute(child.id,
vertex.color = type_colors[as.integer(type)],
vertex.label = ifelse(type %in% c("object", "array") & length > 0,
key, NA_character_)))
g <- graph_from_data_frame(graph_edges, vertices = graph_vertices,
directed = FALSE)
op <- par(mar = c(0, 0, 0, 0))
plot(g, edge.arrow.size = .1, vertex.color = V(g)$vertex.color, vertex.size = 4,
vertex.label = V(g)$vertex.label, layout = layout_with_kk,
edge.color = 'grey70', edge.width = 2)
plot(g,
vertex.color = V(g)$vertex.color,
vertex.size = vertex.size,
vertex.label = V(g)$vertex.label,
vertex.frame.color = NA,
layout = layout_with_kk,
edge.color = 'grey70',
edge.width = edge.width)
if (legend)
legend(x = -1.3, y = -.6, levels(structure$type), pch = 21,
col="#777777", pt.bg = type_colors,
col= "white", pt.bg = type_colors,
pt.cex = 2, cex = .8, bty = "n", ncol = 1)
par(op)
NULL
invisible(NULL)
}
```

Plot a single company

```{r}
samp_co[[1]] %>% plot_structure_graph
co_examp %>% plot_structure_graph
```

A lot of variety

```{r}
nrow <- 3
ncol <- 4
```{r, fig.height = 8}
nrow <- 7
ncol <- 6
op <- par(mfrow = c(nrow, ncol))
walk(samp_co[1:(nrow*ncol)], plot_structure_graph, legend = FALSE)
walk(companies[1:(nrow*ncol)], plot_structure_graph, legend = FALSE,
edge.width = .5, show.labels = FALSE)
par(op)
```

The most complex

```{r}
lengths <- companies %>%
map(fromJSON, simplifyVector = FALSE) %>%
map(unlist, recursive = TRUE) %>%
map_int(length)
most_complex <- companies[which(lengths == max(lengths))]
most_complex <- companies[which(co_length == max(co_length))]
most_complex %>% spread_values(name = jstring("name")) %>% extract2("name")
```

Let's try to plot it!

```{r}
plot_structure_graph(most_complex)
plot_structure_graph(most_complex, vertex.size = 2, edge.width = 1,
show.labels = FALSE)
```

That is just too big, let's simplify things
That is just too big, let's simplify things by just looking at the top level
objects

```{r, fig.height = 8}
sub_objects <- most_complex %>%
```{r}
objects <- most_complex %>%
gather_keys %>%
json_types %>%
filter(type == "object")
nrow <- 2
ncol <- 2
op <- par(mfrow = c(nrow, ncol))
for(i in 1:nrow(objects)) {
plot_structure_graph(objects[i, ], legend = FALSE,
vertex.size = 2, edge.width = 1)
title(objects$key[i], col.main = 'red')
}
par(op)
```

Now let's look at just the arrays, and take only the longest for each

```{r}
arrays <- most_complex %>%
gather_keys %>%
json_types %>%
json_lengths %>%
filter(length > 1)
filter(type == "array") %>%
gather_array
nrow <- 5
arrays %>%
ggplot(aes(key)) +
geom_bar() +
coord_flip()
```

Many are very long, and likely have the similar structures (but no guarantees
in JSON!), so let's just look at the first for each:

```{r, fig.height = 9}
arrays %<>% filter(array.index == 1)
nrow <- 4
ncol <- 3
op <- par(mfrow = c(nrow, ncol))
for(i in 1:nrow(sub_objects)) {
plot_structure_graph(sub_objects[i, ], legend = FALSE)
title(sub_objects$key[i], col.main = 'red')
for(i in 1:nrow(arrays)) {
plot_structure_graph(arrays[i, ], legend = FALSE)
title(arrays$key[i], col.main = 'red')
}
par(op)
```

TODO:

1. Start structure levels at 0 (more intuitive for root to be level 0
1. Consider adding a "complexity" function to tidyjson which unlists recursively
and takes the length of the resulting JSON
1. Is there some way to create a schema?


0 comments on commit 09c589e

Please sign in to comment.