-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Crossfilter discussion #1316
Comments
👏 a fantastic tour of the Javascript data flow landscape, along with some very useful |
I feel like the core of all this data flow goodness is topological sorting. If only there were a way to declaratively specify functional reactive dependencies, and have a topological-sorting-based engine evaluate the parts of the dependency graph that change over time (max 60 FPS), this may solve the issue of arbitrary crossfilter interconnects between visualization components. MobX is one nice implementation of the above concept. The Redux pattern does not use topological sort. If you look deeply into solving the problem using RxJS, Bacon, or other FRP libraries, you'll find that the constructs they implement for this (usually called "when") also do not use topological sorting, so they fail on certain cases that come up in complex visualizations.
Possibly a good place to start with implementing the data flow concept in visualizations would be the D3 margin convention, with responding to resize.
I have done some work in this area, though the libraries never saw wide usage:
I'm not sure why these never took off. They solve the core problems, but there must be something hindering adoption. Perhaps the abstractions introduced are too heavy to add as a dependency. Perhaps it's because (unlike MobX), these libraries are not designed from the getgo to be integrated with a strong component model like React. In any case, thank you @monfera for your thoughts here. It's an interesting read for sure. I'm interested in solving the same problems - Grammar of Graphics + interactions + crossfilters + data flows. |
@curran thanks for your note and links. Yes, topological sorting is useful for synchronous updates. A note is that we're investigating various things and there is no rushed decision toward anything (in-house, or FRP inspired, or even whether it'll be different from the current patterns) as Since you mention your tools haven't quite took off, some plausible reasons, unrelated to their technical qualities:
So it might be that one-off projects choose from one of the established libraries, and larger libraries (vega, highland.js) roll their own take that are tailored to their needs and patterns. |
@curran wow you've got a fantastic paper on this! Very illustrative: |
Thanks for the rich discussion @monfera, very informative! Let me clarify/expand on some points related to crosstalk and the linked views infrastructure within plotly (the R package). At it's core, crosstalk's' JS library just provides a "standard" way to set/get values and emit events when their values change (it's a bit like flyd). Crosstalk itself makes no assumptions about the For some more context, this diagram lays out the general idea of how plotly & leaflet are linked via crosstalk (which enables R users to do stuff like this without any knowledge of plotly.js or web technologies): As for the actual implemetation on the plotly end, a long time ago I decided that the update logic in plotly should favor abstraction over speed (i.e., the Of course, in order to implement, I've also had to add my own JSON spec for defining links between traces. Right now, it's based on key/set attributes (plus, attributes I pass along from the
When a desired event is triggered, I subset every trace matching the relevant set (obtained via the event data), then call This is getting a bit into the weeds, but I should also mention that I support different classes of |
@cpsievert thanks for your incredibly useful and detailed comments! The way you summarize it is evocative of something like a 'crossfilter protocol' where relationships are established, as usual with plotly, in a declarative manner, imposing relatively few constraints on the implementation, and allowing a good level of interoperability with components that do not originate from plotly.js. I've already been thinking about using 'duck typing' inside the implementation. As an example, using crossfilter.js, or similar, latency-optimized in-memory database, would be desirable for some applications, but unnecessary payload increase for others, where data quantity is low, or selections are not occurring in a rapid and incremental manner, or group aggregates do not lend themselves to the reducer based incremental updates. For a first version, we may not need to support accelerators like this though, as a lot can be achieved with an initial version that adds no new sizeable dependency. Similarly, some applications would need to use asynchronous operations, e.g. fetching new payload (such as facet data, carto or temporal data, queried from a very large server database) from a server, computing something in a Web Worker, or phasing some dependent widgets in and out of the exploratory or dashboard UI without blocking user interaction. But it's not worth baking in a sizeable dependency (or any) as a lot of use cases just won't need it. Extending on your thoughts and the general utility of not baking in implementation detail, it looks feasible to stick to established declaration patterns such as the way you already specify links, to help with interoperability and pooling of resources. So to summarize, these general goals appear interesting:
I'll run some experiments and circle back on this one. Thanks again for your detailed comments! |
Awesome community notes from OpenVis talk about reactivity, including the announcement of |
@monfera - Another thing to keep in mind is whether crossflitering could work across divs (across multiple graphs) instead of just in one context through subplots. In contexts like plotly dashboards, dash (shiny for python), shiny, and (eventually) our chart marker, it's easier for the user (and/or developer) to build interfaces that involve multiple plots as separate graphs in separate divs instead of a single plotly context with multiple graphs in subplots. This isn't a hard requirement, it's just something to keep in mind and may make cross filtering applicable for more applications |
@chriddyp awesome, thanks for bringing up component scope. Indeed, there should not be boundaries such as subplots and Users may have outside widgets or views such as a data table, numerical aggregates (e.g. showing a currently filtered total counts/values), a custom SVG map etc. as well as outside controls for filtering, all of which would need to be linked into a spreadsheet-like directed acyclic graph (reactive data flow). It'd be useful to leave open the integration with other reactive kit, e.g. Since Probably the chart maker would also benefit from some hooks, because in the past, individual plots/traces got fairly independent data and various plots use disparate data formats containing only the specific data parts (e.g. for a scatterplot, just 2 dimensions out of a possibly multivariate dataset that's also source for a In the crossfiltering concept, there is a notion of data as one or more separate 'repositories' from which plots/traces feed, with the mediation of the current filtering etc. control state. IOW the crossfilter would act as a data flow glue among the currently more independent parts. |
No two days elapsed since the announcement of |
(with experimental plotly component: https://idyll-lang.github.io/idyll-loader-component/examples/basic/build/ ) |
This video Vega-Lite: A Grammar of Interactive Graphics by @arvind has a great overview of the interaction and multiple-view techniques available currently in vega-lite. |
Linking @mbostock's writeup on |
Though now that I've had a chance to read about it, express is for exploration. idyll is really just aimed at presentation 😄 |
hi @monfera - i'm late to this (it took a nudge from @micahstubbs) but would love to discuss further! i'm working on similar stuff for Riffyn and am going to be talking about it at PlotCon. if you're around in Oakland, t'would be great to grab a coffee and swap notes 😉 |
As an update to the great summary in this issue description: Vega-Lite now has support for crossfiltering. See for a small example: https://vega.github.io/new-editor/?mode=vega-lite&spec=layered_crossfilter |
Evolving this small spreadsheet engine toward use at Plotly |
Interesting writeup that's also a jumpboard to academic papers on incremental computation esp. from a rendering viewpoint. It currently focuses on React and D3 and doesn't touch on functional reactive programming or observables. https://blogs.janestreet.com/incrementality-and-the-web/ |
An interesting topic with crossfiltering is the concept of the filtered set. A basic 'dashboard' (Coordinated Multiple Views) made with So, there are often two sets, context and focus. Data scientists and analysts often ask for a third, even more modal set, which is a current selection (be it hover, click or multiple click activated). A global filter which would constrain what's referred to above as the context is also common, for example, for limiting data transfer to the browser, or more interestingly, for adhering to whatever (temporal, spatial, other or combined) context is relevant for the user. So in summary:
This paper on Keshif uses three sets (besides discussing other topics):
|
For those interested, the current state-of-the-art of crossfilter and Plotly.js is documented in the open data public health initiative. There is also a roadmap with planned next steps. |
I've created a small dash app that implements bostock's http://square.github.io/crossfilter crossfilter.js example here: https://gist.github.com/nite/aff146e2b161c19f6d553dc0a4ce3622 - not quite the same level of realtime & slick UI/UX as the original, but good enough for a PoC. |
Closing, as this ticket won't lead to any work in this repo. We've made plotly.js + crossfilter.js example repo -> https://github.com/plotly/plotly.js-crossfilter.js Maybe we should transfer (still in beta 😏 ) this ticket over to https://github.com/plotly/plotly.js-crossfilter.js/issues ? Please let me know if anyone of you would like to continue this discussion. |
Such a great read! |
Reactive, crossfiltered data visualization
Plotly has originally focused on generating visualizations, and interactivity increased over time. Plotly, by now, has acquired rich layout, style and data update facilities, even animations. Data transformations such as declarative input grouping and filtering have also been added.
As there's growing expectations for fluid, efficient, yet still declarative interactions such as crossfiltering, we are starting a discussion with the purpose of shaping an API in line with Plotly conventions, current practices and future expectations.
Crossfilters are behaviors that let the user subset a multivariate dataset via direct manipulation across multiple views on that dataset. It is also known as linked brushing or linked filtering. The set of views included in one crossfilter is called coordinated views by the crossfilter.js doc, or sometimes linked views. There is no clear-cut boundary for the functional scope and features of crossfilters.
The archetypal crossfilter example by Mike Bostock, author of crossfilter.js, showing multidimensional filtering and aggregation on a quarter million records, also updating a sample table:
This text is just to start the ball going. There is prior art, surrounding the Plotly toolchain and its dependencies such as D3. Since these tools are in active use and well-documented, this description doesn't detail them, except enlisting them and highlighting some of their properties.
Also, there is a fantastic discussion on the topic by Carson Sievert, including many of the crossfilter concepts. Due to the richness of that material, this writeup can be a bit sparser on the crossfilter behaviors and more detailed on implementation concerns.
It's still useful to start with one way of thinking about interactivity, as crossfiltering is a particular instance of it. Also, crossfilter cores such as
crossfilter.js
usually peer-depend on change propagation or reactivity. Section 1 may be skipped for directly jumping into the crossfilter-specific part.reactivity
What runtime changes may occur to a visualization?
Not all types of visualizations require sophisticated updates. For example, a command tool such as the typical use of
ggplot2
is technically a single-step execution even if the dataviz maker may repeatedly invoke it with various projections, aesthetics and data. These are common things that need data flow:responsive design
)object constancy
, or it can be used for effects of aesthetic appeal or engagementBrowser standards may cover some of the above items. For example, a CSS media query might provide print layouting; the
<title>
SVG tag provides basic tooltip hover; CSS supports transitions and animations for HTML and DOM elements. Often, these have limitations: CSS transitions, animations do not work for Canvas and WebGL (and in IE11, even SVG is poorly supported); the tooltip is very basic; sometimes the browsers have bugs, making CSS based layout changes hard or impossible (for example,non-scaling-stroke
is buggy in some browser versions, and CSS translations can run into numerical issues).Therefore, while following the standards is important for accessibility and progressive enhancement, they do not in general substitute for JavaScript execution for dataviz recalculation and rerender.
Why do runtime changes need some data flow concept?
Various terms exist for the need of a data flow concept. Perhaps the most often used term is "reactivity", not to be confused with
react
, a library that solves some rendering aspects of a reactive UI. The termresponsive
is sometimes used, although it's often meant in a regrettably limiting sense, such as redrawing on a window resize. There are various technical names such as streams and observables. Below, we'll stick to a generic term "data flow". There are related things like promises, publish/subscribe pattern, observer pattern, all trying to solve some aspect of the data flow problem.Some visualizations may not really need one
For example, a very simple
D3
orreact
based visualization may just rely on these respective libraries for the initial rendering and update (rerendering). BothD3
andreact
have been designed to allow idempotent rendering, such that the user may have a simple concept of 'data in, view out' - and these libraries handle the rest. Even in this case, there's some data flow concept, hidden beneath the library, but expressed through the API. In case ofD3
there areselection
s,data binding
and theGeneral Update Pattern
, involving most DOM-specific API calls such asselection.data().enter()
,selection.attr()
,selection.transition()
.D3
also provides common interactions such as brushing and dragging, as well as simple eventdispatch
and HTTP request handling. Inreact
, the basic idea is that a pure function maps some data object to a DOM fragment; its underlying mechanism is theDOM diffing
via thevirtual DOM
, and it allows methods for component lifecycle events such as insertion or removal of a node.Anyway,
D3
views are often embedded in some framework that provide data flow functions, andreact
, or lighter weight alternatives such asinferno
andpreact
are often accompanied with data-centric tools such asMobX
orredux
.Also, some use cases simply involve a one-off rendering, for example, outputting a static visualization, with no or basic interactivity features.
Some visualizations do need a data flow concept
A lot can be done just by using the simplest approach with
D3
orreact
, so why go further?A reminder is that Web standards are often quite limited (browser version limitations, IE feature lagging, no Canvas/WebGL
animation
support via CSS, more complex dataviz, see above).One reason is declarative, denotational semantics, letting users specify what the visualizations and interactions should result, rather than how a desired effects are achieved (an operational notion, implementation detail).
Some of the larger, more complex, ambitious data visualization libraries such as
Plotly
andVega
/Vega-Lite
strive to be declarative, letting users tell what the dataviz should be - and this principle has merit even as an implementation concept. Current research is going into making not only visual output but also interactions declarative; sensible due to how much interactivity became integral to data visualizations.When a visualization gets complex, working with data flow declaratively helps developer understanding and system overview. Even a most basic view, a single line or area plot has a lot of calculations which are best described as relations in a directed graph (annotations added to a vanilla Apple Numbers template):
For another simple example, consider
-- the points need to move according to the new projection
-- axis ticks must be rerendered
Relationships get much more complex if there are lots of lines, projections and transitions. For example, an exploratory tool may allow the replacement of one axis with another, or even the transition from one plot type (e.g. scatterplot) to another (e.g. beehive plot). Then there may be animations, filter, pan, zoom, small multiple or trellised views, multipanel views and dashboards with diverse sets of visuals on them. Being declarative in the implementation means that new time-varying or reactive behaviors may be easier to compose from existing ones, with easier reuse (pure functions), and testing is easier as mocks aren't needed.
Another reason is efficiency, an operational concern which is important for fluidity thus good user experience. An idealized computer would be able to calculate with infinite speed, no impact on the battery life, and we'd have a way of just recalculating everything from direct inputs and the user's interaction history. Actually, this is a bit like the model for the most basic
react
orD3
use, as well as a main concept ofelm
andredux
time travel, and this works fine for a lot of use cases (we'll consider it a data flow model and come back to its pros and cons later).But computers are not infinitely fast, so there is a host of reasons for why it's not sufficient in general:
In short, a basic reason for thinking about the data flow is that we want fluid user experience in a world of asynchronous actions, limited CPU and battery power. Janky interactions or avoidance of fluid interactions altogether underutilizes the computer medium and is a competitive disadvantage.
A simple example (follow link for writeup) for granular, incremental recalculations to reflect ongoing configuration on a live, real time updated view, e.g. changing bandline quantiles for outlier-vs-not shading:
We also expect that morphing from one visual representation (projections, channels, aesthetics) to another is going to become more common, for dashboard building via direct manipulation as well as exploratory analysis, an early Plotly concept morphs from parcoords panel to scatter, preserving filtering:
Couldn't we solve the problem without some data flow concept? (informal data flow)
We'll categorize such solutions as data flow concepts :-) But here they go anyway:
state
object gets incrementally updated on each new piece of input. For example,newState.min = Math.min(previousState.min, input.newPrice)
. It's theredux
model. It's great for single-layer, relatively simple actions, but isn't that suitable for the type of deeply cascading changes that characterizes data visualization.selection.data()
functions and carefully tailoredenter
vsupdate
discrimination, is a powerful way forSVG
visualizations. For example, it's possible to enhance an initially raw dataset with expensive aggregate statistics, and run a recalculation only if needed (e.g. a new point is added), which requires that thekey
function incorporate the data array length or some surrogate (hash etc.). Limitations: large DOM trees may be slow; more convoluted, and rigid, less component oriented design; data needs to be naturally hierarchical or otherwise crosslinks are needed; easily introduced bugs when a recalculation isn't done though it should be, or the other way around. Canvas support is doable but somewhat convoluted.react
lifecycle methods. The lifecycle methods make it possible to compute things just once. But model calculations are an anti-pattern inreact
; even the presence of lifecycle methods remove quite a bit from thereact
philosophy; and the issues mentioned forD3
above also apply.Again, these approaches work, and can be very compact and natural to use, but they don't scale well to complex visualizations. Now on to some alternatives that are often used for larger projects:
console.log
or adebugger
statement. In contrast, sophisticated approaches require a good amount of learning and debugging practice (non-trivial costs).model
fromview
orview
fromcontroller
is not trivial. In the case ofMVVM
, the separation ofmodel
andviewModel
is also a bit arbitrary. MV* also typically uses some data binding pattern, e.g. observer or pub/sub. There's also a competition among the MV* zoo and the definitions aren't clear enough to even firmly know which is which.The above three approaches have the common problem that they can lead to overreliance on tribal knowledge. There are no hard and fast rules or protocols about these approaches; they're grown organically (manual update processing) or are vague guidelines that leave the details up to debate and an endless stream of 'best practices' books. Often, the data flow code (if this separate aspect is kept as separate code) is developed in-house, and lacks proper documentation.
RxJS
is incorporated (discussed separately, as it's been an established library on its own). Both angular versions are rather large, opinionated frameworks with idiosyncrasies, and neither is quite efficient for dataviz. It's unclear which of the two Angulars will be more popular. Similarly, asreact
is not a comprehensive framework, its complementing (independent) data flow tools are discussed on their own right.Data flow tool categories
The below list includes a few specific libraries, not meant to imply that Plotly should follow or use any of these specifically.
A. Object-centered approaches
Usually, operations are done to objects via method calls, and methods achieve effects via altering various objects. It is hard to establish causal links: during debugging, one can't often get to a root cause just by traversing up the call stack, since the failing calculation fails likely because some of its input object properties are wrong, but those properties were not set in a frame currently on the stack, but some unknown different stack that preceded the current execution. In addition to familiarity with the API, a lot of implementation detail needs to be known to a contributor. Data is often exposed on objects, which commits the solution to particular representation structures, an operational rather than declarative concept. The flow of the data is implicit in the code and hard to have a mental image of.
relayout
/restyle
- idempotent plot updateB. Special-purpose data flow tool: low-level, idempotent, data-driven renderers
Some view generator solutions have their built-in data propagation patterns, such as data binding, which are fairly powerful, yet not quite appropriate for complex functions such as a crossfilter. Also, these tools themselves don't scale well to a moderate number of DOM elements for executions as frequent as the animation frame (60FPS).
D3
data binding and frequent, on-event rerendering for dashboard-level data flowreact
component tree; often, lifecycle methods and stateful componentsreact
alternatives with smaller scope and minimal footprint (inferno
,preact
,react-lite
...)regl
, inspired byreact
, transforms specifications to efficiently generated and executed WebGL API calls (Plotlyparcoords
already usesregl
.)C. Special-purpose data flow tool: pipes
These tools usually facilitate one-off execution of a sequence of data transformations, sometimes including side effecting processing steps or terminal nodes. Due to their one-off nature, they're often built to handle explicit, e.g. command line execution, or individual input events synchronously or via promises. The archetype is the unix pipe. Usually, branching is besides the scope or very limited, therefore it's not as natural for handling diverse inputs that factor into various points in the series of transformations, or intermediary transformations that take data from and/or feed into multiple other transformations.
D. Special-purpose data flow tool: crossfilters
Crossfilters usually want to efficiently and scaleably solve the problem of multidimensional selection of individual data points for fast querying of he resulting sets or their aggregations. They typically process filter range changes and even new data points incrementally. Usually, processing is done with reducer functions, efficient if the incremental change is of limited frequency, but not as efficient when the changes are big enough to warrant for a tight, cache aware numerical processing loop. They often do not want to provide a mechanism for notification, whether it's related to their input (new data or interactions) or output (downstream changes on the changed itemized and aggregate queries), so a crossfilter, on its own, isn't sufficient for crossfiltering; it needs to be embedded in a more general data propagation mechanism. Internally, crossfilters use interesting implementations for efficiently updating query sets, and are rather stateful so as to save computational costs for handling incremental changes with low latency.
1. crossfilter.js
2. vega-crossfilter
3. scijs/cwise based (idea: turn reducer functions into an efficient loop body)
1. plotly vertex shader based mini-crossfilter as in the new Plotly
parcoords
2. regl-cwise based (idea: turn reducer functions into shader code and hierarchical aggregations)
E. General data flow tool categories
These can be thought of as a spreadsheet, in that the developer doesn't state how a
sum
is calculated and updated: whenever some input changes, it propagates downstream in the directed acyclic graph that is the data flow structure. Yet, proper FRP, coined by Conal Elliott, has a rigorous foundation so we call JS libraries FRP inspired, as they center around operational concerns such as a data propagation graph, event emission, backpressure etc. While sound in principle, many of these libraries make it hard to debug userland code, because the stack is usually deep, verbose, nondescript and even with blackboxing, it's hard to see what initial change cascaded down to the current stack, and what transformations took place. MobX puts more emphasis on letting the coder understand cause and effect relationships in the debugger.1. MobX
1. RxJS
2. Bacon
3. Kefir
4. Flyd
5. most.js
6. xstream
Motives and properties recapped here
Libraries not listed as none currently exists for JS
F. Reducer based
Redux is a predictable state container, a reducer based library. It handles singular changes, called actions, elegantly and in a functionally pure way, responsible for the predictability part. Each action is mapped into a transform of a (current) state to a next state; the state object itself is modeled as a large, inert JSON-like object, whose hierarchical structure can represent inputs or derived data. Since redux handles direct actions and doesn't in itself handle the rippling effects of such actions, it's combined with change propagation means for deeper dependency graphs.
G. View and logic together
These tools bind some data propagation concept / tool with a view rendering mechanism such as DOM updates. They can be made to work on Canvas/WebGL, though in this case the benefit of being cycle-oriented is somewhat underutilized.
1. cycle.js
2. motorcycle
3. TSERS (few recent commits)
2. Crossfiltering
Crossfiltering is a major data visualization interaction type that lets the user slice and subset their data, most often by highlighting a range on an axis or an area on a plot. An archetypal implementation (for me, having used it first) is Bostock's
crossfilter.js
published in 2012.Interactivity in data visualization is only limited by creativity and practicality. Yet, there are archetypal interactions that can be easily identified in literature and implementations alike, such as
The latter is often called crossfiltering on a multi-plot view when the purpose of selecting elements or a range of elements is not primarily to get detailed, itemized info on them, but to control what is shown on the other subplots, conveying the notion to users that they interact with a single dataset, filterable in any of the interactive subplots, all of which provide a particular view into the single dataset.
Crossfiltering is an important solution for what we can term as the big problem of data visualization: the focusing problem. Crossfiltering lets the user start exploratory analysis by viewing the visualization based on the entirety of the data, or a pertinent set (e.g. last 30 days), but then focus on subsets of data, guided by their goals and patterns in already rendered subsets. It is also usable in explanatory analytics such as interactive journalism or education: the reader or student may gain useful extra information using the same set of views, altering just the set of data in scope, e.g. selecting his city of residence or highlighting an interesting range of distance.
Common crossfiltering facilities - overview
Interactions:
Responses:
object constancy
is preserved where possibleCrossfilter implementations
To inform crossfilter API design, it's useful to touch on current, actually avaliable crossfiltering methods. Features are enlisted so that the common, and perhaps some rare functions are input to API design. Similarly, current limitation - subject to getting obsolete - are mentioned not as criticism, but simply to gauge the extent its API has needed to cope with planned use cases.
Crossfilter.js
Crossfilter is an in-memory, incremental mapReduce implementation in JS created by Mike Bostock who also authored
D3
.Possible gotcha: "a grouping intersects the crossfilter's current filters, except for the associated dimension's filter. Thus, group methods consider only records that satisfy every filter except this dimension's filter. So, if the crossfilter of payments is filtered by type and total, then group by total only observes the filter by type"
Key features
Very small (10kb uncompressed, 4.4kB compressed)
Very mature and stable
Fast for large datasets e.g. 100k elements, if reducers are fast (though obv. not as fast as array looping)
Does one thing and one thing well
Small API surface
Limitation of scope
These are inherent either in the focused scope of this component (do one thing well), or in the JS language and runtimes (no weak maps, no object finalization etc.) so they're just observations rather than criticism.
Vega crossfiltering
Vega is an interesting, long-running project run by the Interactive Data Lab; its approaches demonstrate important research, and there's a level of rigorousness and compactness about the concepts. Vega implements a visualization grammar (see also Wilkinson's Grammar of Graphics, ggplot2), a declarative format for creating interactive visualizations.
Vega is based on reactive data flow, and has enabled the creation of crossfiltering, although not in a particularly declarative way. The award winning research paper describes the addition of declarative graphics interactions.
Vega
Example: https://vega.github.io/vega-editor/?mode=vega&spec=crossfilter
Depends on
vega-dataflow
andvega-crossfilter
.Vega is a reactive library of broad, general data visualization scope.
Uses its own reactive data flow means rather than depending on another lib.
342kB uncompressed.
While Vega supports crossfiltering in that reactive streams causing a crossfilter mechanism can be established, the creation is somewhat intricate, and isn't a concise, high level, declarative API.
Vega-dataflow
Dependency of
vega-crossfilter
andvega
. Streams scalar and composite data.https://github.com/vega/vega-dataflow
Relatively large, bundle is 88kB uncompressed.
Vega-crossfilter
https://github.com/vega/vega-crossfilter/blob/master/test/crossfilter-test.js
Uses
vega-dataflow
but doesn't use Bostock's crossfilter.jsDependency of
vega
Vega-lite
Vega-lite is a translation layer between the Vega-Lite compact, higher level visualization grammar format and the powerful, more verbose Vega visualization grammar format.
As of January 16 2017 there's no crossfilter or declarative (or any) interactions; declarative interactions are currently in feature branches and slated to arrive soon. If I'm not mistaken, even with declarative interactivity in Vega-lite, it won't be as simple as identifying dimensions and subplots for a crossfiltering relationship. But at the expense of more verbosity, there'll be more flexibility as well, permitting custom and hybrid interactions.
devDepends on
vega
.Crosstalk (htmlwidgets)
Crosstalk is a protocol for linked brushing across multiple, possibly heterogeneous htmlwidgets. It uses shared state (SharedData) among various htmlwidgets. A htmlwidget can be made compatible with crosstalk by following a well-documented protocol.
Limitations (as of writing; evergreen doc):
Bokeh crossfilter
Bokeh has a crossfilter, also referred to as linked brushing that redraws subplots upon the completion of the selection, and the rectangular or lassoed area doesn't persist, therefore cannot be interactively moved. It is a possible way of bypassing stringent latency requirements, and is a useful option to consider for an initial Plotly implementation.
Interestingly, the seen Bokeh examples have no explicit crossfilter specifications beyond enlisting the interaction start buttons
box_select
,auto_select
. According to the text, the only other criterion is that multiple plots use the same dataset (same identity). It has a lot of appeal by virtue of its simplicity, although Plotly, given its numerous connectors, serialized tree representation and granular data structures probably can't follow this model. Yet, it shows that the API search space should include very terse or implied linking. In the absence of relying on dataset identity, a closest option would be simply to add afiltergroup
attribute to all plots (see below).Upshot
Crossfilter API design thoughts for Plotly
Based on the above landscape and some motives below, as well as strong, preexisting Plotly API conventions that have been found useful by a wide base of users, we can start assembling thoughts on possible crossfilter API elements for Plotly.
For simplicity, the term Plotly means Plotly.js here; all the language API bindings and the Plotly Workspace would likely expose the crossfilter specifications to their respective users.
Existing interactive and related features in Plotly.js
Plotly already supports interactivity and data processing features that relate to crossfiltering:
restyle
,relayout
)pointcloud
)groupBy
andfilter
parcoords
parcoords
Currently limiting features in Plotly
data
block, in its name, is about data, and indeed, contains column vectors such asx
andy
. But it also deals with plot (trace) specifications, for example, whether ascatter
or aheatmap
is required, what aesthetics would be present (markers
,lines
) and with what channel styling.filter
transforms that are code strings. Direct calculations are handy for specifying custom predicates for e.g. point inclusion or aggregation.plot
function, and invoking chart-specific mutations such asattributes
,defaults
,calcs
,render
. Some of the main functions have hundreds of code lines. The result is that updates are coarse-grained and much redundant recalculation occurs.Understanding prior art
crossfilter.js
such as the possibility for enumerated or range based filtering, and generation of aggregatesvega
representations aren't crossfilter-specific, but are at a finer granularity of interactions (Vega-Lite) and streams (Vega) - yet it makes a lot of sense to learn from how Vega-Lite represents interactions in a compact yet versatile mannercrosstalk
is a great example that can be very simply used to establish crossfiltering links, although currently not dealing with the complexities of aggregatesDesired functional features in the Plotly crossfilter
Flexible data subsetting in crossfiltering
Specification for
crossfilter
,domain
)filtergroup
attribute per plotDiverse selection sets and filtering algebra
For compact, common representation, both enumerated values and contiguous ranges are ideally supported. We may consider
An initial implementation is already useful with one simple, single range based filter per dimension, as done for
parcoords
.Aggregations
Some crossfilters, e.g. R's
crosstalk
, may only (currently) support crossfiltering over atomic data. It is already useful, since it can yield linked brushing. Going beyond this, most crossfilters support the inclusion of groups or aggregates. Selecting a subset of the scatter points may lead to updated histograms similar to this dc.js example.In addition to updating aggregates, it is desirable if projection ranges (brushed areas) or glyphs corresponding to aggregates, such as histogram bars or choropleth maps, are themselves subject to selection. For example, highlighting a range of bars on a histogram would highlight the source scatter points, and other aggregates would be updated based on this highlighted set of scatter points (link below does this too, relying on crossfilter.js in dc.js).
This reverse direction requires an explicit bijective relationship between an aggregate plot and the source data, otherwise the corresponding atomic data points can't be identified. I think Plotly doesn't yet handle this aspect, but again, aggregates, especially the selection of aggregates need not be part of an initial step. Plotly currently handles few, discrete types of aggregations, such as binning for histograms, so adding inverse mapping doesn't seem burdensome. More challenging is that users do, or are lead to preaggregate data themselves to make their own aggregations, in effect, using Plotly as a dumb, static view with the data processing steps residing outside Plotly - in this case, establishing links is impossible, unless we invent some heavyweight annotation for bijective mapping. Consequently, the Plotly API would need to move more into data handling territory with datasets, dimensions and aggregation keys as first-class JSON structures; then individual plots or traces may refer to said datasets as their data, and dimensions in their axes, as opposed to the current practice of supplying data directly to the traces.
Many dashboards in the wild display solely aggregates (no items in sight). It's good to consider an API with at least eventual aggregation support in mind.
Some other dashboards such as an implementation of Stephen Few's student dashboard in d3 feature itemized data selection, updating aggregates, where each item itself is composite, e.g. a student that's a foreign key in a per-student attendance time series table:
If sorting is present (analogous to using
Plotly.restyle
with a different order for ordinal ticks), the previously contiguous selection range becomes fragmented (or conversely, we may use an ordering-then-brushing facility to avoid complications with multiple set selections), yet the aggregation itself doesn't change:Familiarity
Lots of good work went into crossfilters in JS and other languages via the above mentioned libraries and lots of libraries not mentioned here. To make things easy for users, our design should recognize established, learnt patterns. Since the concepts are fairly transferable across tools, yet the actual behaviors, limitations, method and granularity of specification is diverse, it's best to follow the concepts and do it in a way that's coherent with Plotly patterns, on principle of least surprise to the users.
Time series data
It's often the case that crossfiltering is combined with, or applied to time series data. This poses additional demand, because of the data points and especially DOM impact involved. A headroom in smooth rerendering performance may be achieved by hybrid charts where the single or few performance critical layers are rendered with WebGL e.g. via regl:
There are additional use cases with time series data:
Animating filters
It's useful for animations to also work with crossfiltering, enabling that a single dimension filter is declared for animation, yet the visual effects show in all the rendered plots that involve the filtered dimension.
Desired non-functional features
Serializability
Low latency
The lower the latency, the better - the ideal is 30FPS-60FPS. If it's worse than around 10-15FPS, it eliminates the illusion of direct manipulation, which often underpins crossfiltering, and the users need to wait for debounced, delayed recalculations, i.e. views are out of sync. Therefore it's important to optimize data paths in some systematic manner, or settle with deferred view update.
Low latency has lots of elements: efficient filtering code (e.g. crossfilter.js uses heavy bit fiddling; our parcoords crossfilter in the vertex shader); avoiding unnecessary recalculations, since the changes may be very cheap to calculate compared to an initial rendering; touching the DOM sparingly, using e.g.
d3selection.data(fun, key)
to detect changes and rely on theDOM diffing
of the General Update Pattern.Reusing existing Plotly facilities
A lot of existing Plotly facilities may be reused for crossfiltering.
transforms
such asfilter
andgroupBy
are good candidates for both filtering and grouping for aggregations, although the current implementations are not incremental and are of somewhat high latencyparcoords
parcoords
GPU based crossfilter might be shared for supporting large data quantities plotted with WebGL; it works as an N dimensional crossfilter even for rendering parcoords linesA sample API for simple, atomic crossfiltering
Unlike Bokeh, Plotly can't currently rely on a single, shared data structure to deduce a default crossfiltering behavior. Also, current axis keys (keys of the JSON object) can't serve to indicate dimensional unity, because of their preexisting separation for e.g. layouting in screen space (called
domain
in Plotly).But there would be ways for retaining the current Plotly semantics and API, while introducing
datasets
as first class objects.Establishing unity of data and dimensions can be done by modeling these as first class entities. It would yield a compact, scalable and high level representation.
What looks like this now, with repeated vectors for disparate plots or traces:
may be, in order to preserve relations, represented as
In addition to retaining data relations, it would have other benefits:
Tools surrounding Plotly.js, such as the Workspace, already have analogous facilities, so it can be considered a natural absorption of useful features into Plotly.js.
API possibilities for grouping
Groups, in general, can be many things: nodes in a normalized relational star schema model; or calculated on the fly, such as specific bins; or in the simplest case, just another dimension (denormalized representation). There's ample precedent for this last option in Plotly, such as the current
transforms/groupBy
specification, or the use ordinal or nominal dimensions (e.g. overplotting points with semitransparent markers).Therefore, groups might be specified, quite verbosely, as
Recognizing that this is a lot of text for mundane aggregations, supposedly coming from a list of Plotly-implemented aggregation functions, it should either be made much briefer, or the reward should be a lot more power, for example, some plugin mechanism for custom, programmed aggregator and filter components, even if the API doesn't go to the length Vega goes, which encourages infix and functional algebraic expressions represented as strings.
An example for the former option - much briefer notation - could be a simple reference to the aggregation in the view:
API for filter state; other API elements
Compared to representing data relations such as shared data and aggregations, the problem of representing and serializing filter states is quite trivial. It just falls into place once these larger problems are resolved. The
crossfilter.js
API doc contains sensible options, such as using[from, to]
filter domains, or[elem1, elem2, ...]
enumerations for specifying filter state. Inspired by this, Plotly may addthough some questions remain, such as whether the ranges, denoted with arrays, are right-open or right-closed.
An alternative is to use the relations similar to the current
filter
transforms, building up the filtered set more verbosely but perhaps giving more flexibility:though there'll need to be more algebra such as specifying unions and intersections.
Draft conclusions
Adding crossfilter to Plotly is sought after, given the current level of interactivity, the user expectations toward data exploration and Plotly facilities, support for heterogeneous subplots/dashboards, and upcoming plots that need to use crossfiltering such as parcoords, small multiple charts, SPLOM and trellised plots.
A Plotly crossfilter would benefit from (depend on) concurrently introduced new concepts, such as
As this list contains elements on which current libraries have iterated for years - such as LINQ.js for specifying aggregates and other derived queries, not to mention host language features and common libraries in R, Python etc. such as the very compact
dplyr
API - a question is, where should boundaries be drawn, whether the API of an existing tool should be adopted, or whether it's possible to postpone the introduction of such concepts altogether.Also, the listed changes may require some refactoring and API change (or addition) such as
graphDiv.data
, so that internal representations aren't committed to, and the code is free to delay data exports until neededdata
and traces/plots in the JSON specificationThe text was updated successfully, but these errors were encountered: