Skip to content

Streaming API

Ari Hartikainen edited this page Sep 30, 2020 · 6 revisions

Streaming API needs a way to define suitable/wanted plots, statistics, and diagnostics. For the general interface, we probably need a wrapper for different libraries. There is a high probability that needed interfaces are not yet developed to different samplers, which means we need a simple interface, which is easy to implement by the sampler devs.

Also, we need to consider if there is a performance penalty and how big it is.

Some information we probably want to have as soon as possible (e.g. new draws), but the output can have a small lag in the update rate, so this would probably mean that we can access new data "in chunks" if needed.

For online algorithms, we need to consider do we implement our own (with numba) or use some external package.

  • Plotting
  • Stats & Diagnostics
    • Online algorithms
    • Default algorithms
  • General tools and callbacks
  • Refresh: Per draw vs Time-triggered

See

Timed window with 0.1 seconds could be ok

plots

There should be a way to update only the new parts for the figure when new data arrives. For example bokeh.ColumnDataSource.stream does this. Creating a new plot each time data is updated is wasteful, and usually slow.

streamz

To get the most out of the streaming, we should have a fast way to move data from the sampling process to another process/thread without blocking the sampling. This can probably be done with async or running streamz in another thread, there are builtin ways to do both of them in streamz.

def update_source(data, cds):
    # with timed_window
    # data is a list of outputs
    data = f(data) # if needed
    cds.stream(data)

# asynchronous=None -> normal
# asynchronous=False -> run in subthread (non-blocking)
# asynchronous=True (non-blocking)
source = streamz.Stream(asynchronous=None)
stream = source.timed_window(0.1).map(update, cds=cds)
pane = pn.pane.Streamz(stream, always_watch=True)
...
source.emit(data)

Data container

We need to figure out should we keep all the data in plotting container too, this could mean 2-3x space, or access data dynamically, if users ask for it.

bokeh.ColumnDataSource

ColumnDataSource can be used to incrementally add more data.

cds = bokeh.ColumnDataSource(df) # pandas.DataFrame or dict
for _ in range(n):
    cds.stream(new_df)

Adding data to cds one by one is a slow process (list.append / pd.DataFrame.append). Sending data in batches might be enough, but it is still slow.

Clone this wiki locally