Skip to content

Commit

Permalink
Update Timeseries user-guide
Browse files Browse the repository at this point in the history
  • Loading branch information
jonmmease committed Jan 30, 2019
1 parent 8696a95 commit dc8e83f
Showing 1 changed file with 10 additions and 9 deletions.
19 changes: 10 additions & 9 deletions examples/user_guide/3_Timeseries.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -387,11 +387,11 @@
"\n",
"The examples above all used a small number of very long time series, which is one important use case for Datashader. Another important use case is visualizing very large numbers of time series, even if each individual curve is relatively short. If you have hundreds of thousands of timeseries, putting each one into a Pandas dataframe column and aggregating it individually will not be very efficient. \n",
"\n",
"Luckily, Datashader can render arbitrarily many separate curves, limited only by what you can fit into a Dask dataframe (which in turn is limited only by your system's total disk storage). Instead of having a dataframe with one column per curve, you would instead use a single column for 'x' and one for 'y', with an extra row containing a NaN value to separate each curve from its neighbor (so that no line will connect between them). In this way you can plot millions or billions of curves efficiently.\n",
"Luckily, Datashader can render arbitrarily many separate curves, limited only by what you can fit into a Dask dataframe (which in turn is limited only by your system's total disk storage). Instead of having a dataframe where each pair of columns (one for `x` one for `y`) represents a curve, you can have a dataframe where each row represents a fixed-length curve. In this case, the `x` and `y` arguments should be set to lists of the labels of the columns that represent the curve coordinates and the `axis` argument should be set to 1. If all of the lines share the same coordiantes for one of the dimensions, then the corresponding argument (either `x` or `y`) can be replaced with a 1-dimensional numpy array containing these coordinates.\n",
"\n",
"To make it simpler to construct such a dataframe for the special case of having multiple time series of the same length, Datashader includes a utility function accepting a 2D Numpy array and returning a NaN-separated dataframe. (See [datashader issue 286](https://github.com/bokeh/datashader/issues/286#issuecomment-334619499) for background.) \n",
"In this way you can plot millions or billions of fixed length curves efficiently.\n",
"\n",
"As an example, let's generate 100,000 sequences, each with 10 points, as a Numpy array:"
"As an example, let's generate a Numpy array containing 100,000 sequences with 10 points each, where each sequence represents a 1-dimensional random walk with step size drawn from the standard normal distribution:"
]
},
{
Expand All @@ -402,15 +402,16 @@
"source": [
"n = 100000\n",
"points = 10\n",
"data = np.random.normal(0, 100, size = (n, points))\n",
"time = np.linspace(0, 1, points)\n",
"data = np.cumsum(np.random.randn(n, points) , axis=1)\n",
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a suitable Datashader-compatible tidy dataframe using the utility:"
"We can create a pandas dataframe from this numpy array directly, where each row in the resulting dataframe represents an independent trial of the random walk:"
]
},
{
Expand All @@ -419,15 +420,15 @@
"metadata": {},
"outputs": [],
"source": [
"df = ds.utils.dataframe_from_multiple_sequences(np.arange(points), data)\n",
"df = pd.DataFrame(data)\n",
"df.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then render it as usual:"
"To render these lines we set the `x` argument to `time` (the 1-dimensional numpy array containing the shared x-coordinates) the `y` argument to `list(range(points))` (a list of the column labels of `df` the contain the y-coordinates of each line), and the `axis` argument to `1` (indicating that each row of the dataframe represents a line rather than each pair of columns)."
]
},
{
Expand All @@ -437,7 +438,7 @@
"outputs": [],
"source": [
"cvs = ds.Canvas(plot_height=400, plot_width=1000)\n",
"agg = cvs.line(df, 'x', 'y', ds.count()) \n",
"agg = cvs.line(df, x=time, y=list(range(points)), agg=ds.count(), axis=1)\n",
"img = tf.shade(agg, how='eq_hist')\n",
"img"
]
Expand All @@ -446,7 +447,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the 10 high and low peaks each represent one of the 10 values in each sequence, with lines connecting those random values to the next one in the sequence. Thanks to the `eq-hist` colorization, you can see subtle differences in the likelihood of any particular pixel being crossed by these line segments, with the values towards the middle of each gap most heavily crossed as you would expect. You'll see a similar plot for 1,000,000 or 10,000,000 curves, and much more interesting plots if you have real data to show!"
"Here, each line represents an independent trial of this random walk process. At time 0 (all the way to the left) the lines are positioned according to the standard normal distribution. At each time step, each line moves upward or downward from it's prior position by a distance drawn from the standard normal distribution. Thanks to the `eq-hist` colorization, you can see the dispersion in the density of the overall distribution as time advances. You can also see the individual outliers at the extremes of the distribution. You'll see a similar plot for 1,000,000 or 10,000,000 curves, and much more interesting plots if you have real data to show!"
]
}
],
Expand Down

0 comments on commit dc8e83f

Please sign in to comment.