-
-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plotting many lines as separate columns takes a long time #286
Comments
Out of curiosity, I wanted to see how fast bokeh could do these same plots (not going to look good but interested in comparing the timings). Bokeh was much faster but it looks like bokeh increases at the n^2 rate instead of the n log(n) rate so it will soon crossover.
|
Running in a notebook should have the same speed as a bare script, for something like this. Here it seems like you are aggregating each line separately, then summing over all of them? I haven't tried that approach with large numbers of lines, and I'm not sure what its computational complexity would be. What we usually do is to have all such lines in a single data frame, separated by all-NaN rows, in which case it's approximately linear in the total number of points (assuming the total number of points is much greater than the number of lines). |
@jbednar can you provide an example of what you are describing, perhaps using the example sine data I have above? My implementation was patterned after the time series notebook you have on Anaconda Cloud. Here is the core bit of code I am timing:
That was the best example I found of plotting many lines. |
See the OpenSky example, which has about 200,000 separate trajectories. The timeseries notebook focuses on a few curves with very many samples each, so it's maybe not the best starting point for this case. |
In the OpenSky example, it is not clear how a single flight trajectory is stored in the Are these simply straight lines between two points? The city specific plots make it look like that is not the case. I am not sure how I would reformat many lines from a 2 column text file into a comparable DataFrame. I have tens of thousands of data files. As they all have the same x-points, I currently load all the y-values into a NumPy array. I could just as easily put each set of y-values into a column in a DataFrame, but that is nothing like what the OpenSky DataFrame looks like. It almost looks like it was drawn by connecting subsequent longitude-latitude points, but this would draw lines between where one plane lands and the next takes off (unless it is all the same plane, which does not seem likely). Mimicking this with thousands of sine curves, I would end up with lines from the last point of one curve to the first point of the next curve. How do you avoid this? |
Plotting line segments, time series, and trajectories all uses the same underlying Line glyph support. Fundamentally, datashader accepts a dataframe where each row represents one point along a multiline, and it will then connect that point to the point provided in the subsequent row. You are correct that doing so would falsely connect between unrelated lines, but as I briefly referred to above, whenever a row has a NaN in the X or Y (or usually both), this separator is treated as a break between lines, and the preceding and subsequent points are not connected. So, if you have 5 curves of 10 points each, then the dataframe would store the 10 (x,y) pairs from the first curve, then (NaN,NaN), then the 10 (x,y) pairs from the second curve, and so on, ending up with 54 or 55 rows in the dataframe, not 50. Creating such a data structure from numpy arrays shouldn't be very difficult:
|
(But note that the released version of datashader has a bug in the line support, skipping over the last or first point (can't remember which), so you should use the github master until we can do a new release.) |
This is MUCH faster! Here are the timing comparisons. This prompts me to ask, why not use this method of rearranging the data for the time series data? With any number of data sets, reformatting this into two columns is orders of magnitude faster (~4000x faster for 8 data sets, close to the 10 used in the time series data). I also wonder, what would it take to apply the faster functionality of the two column method to many columns? Here is the code to reformat the above data and apply the OpenSky method (with the timing bits).
|
This provides a complete example (without the timing bits). Perhaps this could go in the gallery examples (#288 ):
|
While it remains awkward to plot NumPy arrays, I think the second method resolves the speed issue. In connection with issue #283, it would be nice to make the input method for plotting lines from NumPy arrays more intuitive. This would remove the awkward DataFrame generation. For now, the two column DataFrame seems a functional option. |
Thanks @StevenCHowell for this helpful discussion. Also thanks to @jbednar for suggesting that we use datashader in response to our paper in Scipy 2017 (http://conference.scipy.org/proceedings/scipy2017/narendra_mukherjee.html). I was just wondering what the current state of affairs is for plotting multiple sequences stored in a numpy array with datashader. Although the workaround described above works very well on the memory consumption side of things for me (reduces memory use by about a factor of 4 compared to plain overlaying the sequences on top of each other with matplotlib), creating the pandas dataframe consumes a significantly large amount of time compared to everything else. Here I am plotting ~65000 time series, each with 45 time points - here are some rough timing numbers:
Creating the pandas dataframe is by far the slowest step of the process - it takes almost a minute. While that is not that significant for plotting just one set of 65000 curves, I usually have to plot 40-50 of these sets in one go. That, obviously, takes a lot of time :| |
Hmm. We too are a bit frustrated here, but I don't know the best way forward. The NaN-separated format is fast to plot, but it is very expensive to create, and not very usable once it's created. E.g. there's no obvious way to index into it to select individual polylines to select or filter them out. We could support directly iterating over a multidimensional numpy or xarray array of polylines, but unlike the NaN-separated format, that approach would be limited to equal-length lines, which is only one of many common cases. For instance, when rendering network graphs, we want to plot a very large number of short polylines of variable lengths. It would be great if there were a very efficient ragged array format that we could use to fit all of these requirements, i.e. for something indexable, quickly iterable in Numba code, and memory efficient. I'm pretty sure there isn't one readily available in Python, but as fixed-length arrays are an important special case for graph rendering (directly connected nodes), maybe we should just bite the bullet and support the special case of large Numpy/xarray arrays of fixed-length polylines? @philippjfr, what do you think? |
Since it doesn't seem like we are going to solve the ragged array issue any time soon, I'd be in favor of adding some handling to support fix-length arrays in the short term. What I'm not clear on which format we'd want to support, the most obvious format is a dataframe with an index for x-values and N-columns for the y-values or simply an array where the first column represents the x-values and the remainder the y-columns. The dataframe format seems to make more sense since that's what we support already. |
Sorry for the confusion @jbednar and @philippjfr - I think the dataframe appending step in the solution posted above was the slow step, and I just copied it lazily. Here's a much faster implementation that avoids the appending step with some simple Numpy operations:
That's a simple fix to the problem, at least for equal length sequences - as long as the appending part is not done in pandas, the speed issue shouldn't be a problem. Maybe we could put this up as a cleaner example of how to deal with large number of sequences? |
We can certainly provide utility functions for converting common data structures that are used in practice into ones that datashader can deal with directly. I'd be happy to see the above generalized and put into a function in With that in place we can then determine if there's a big savings to be had by supporting a fixed-length array type directly. |
Cool, I will have a look at ds.utils and tseries.ipynb and put in a PR soon :) |
I am hoping there is a misunderstanding or misuse on my part but plotting lots of lines with Datashader is taking much longer than I would expect, at least compared to plotting points. To provide an example case, I wrote the following code which demonstrates what I am seeing. As a summary, this figure shows how many seconds it takes to plot
n
lines, each with 50 data points.While this shows it scales as n log(n), better than n^2, the large amount of time, over 4.5 hours for 8192 lines, makes me wonder if it is actually running using optimized numba code. Again, I hope this is simply misuse on my part as I want to use this to plot over 100,000 lines. I have got similar results when running on in linux desktop (12 core, 32 GB ram), and on my mac laptop. I did run this in a jupyter notebook; would it perform better in a python script?
Here is the code I am running:
Here is the last image (using 8192 of the total 100000 lines):
The text was updated successfully, but these errors were encountered: