-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Dask trimesh support #696
Conversation
a602b90
to
bf47918
Compare
…k dataframe. Computing the mesh still requires bringing the entire vertices/simplices dataframes into memory, but the resulting mesh is now a Dask dataframe with partitions that are chosen intentionally to not cause triangles to straddle partitions.
bf47918
to
ca5b977
Compare
@@ -123,7 +123,8 @@ def __init__(self, x, y, z=None, weight_type=True, interp=True): | |||
|
|||
@property | |||
def inputs(self): | |||
return tuple([self.x, self.y] + list(self.z)) | |||
return (tuple([self.x, self.y] + list(self.z)) + | |||
(self.weight_type, self.interpolate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was needed because the inputs
tuple is used by the parent class to implement hashing and equality, which are in tern used for memoization. Without this, I was seeing cases where repeated use of canvas.trimesh
with different values for interpolate
was not resulting in updated aggregation behavior.
vals = vals.reshape(np.prod(vals.shape[:2]), vals.shape[2]) | ||
res = pd.DataFrame(vals, columns=vertices.columns) | ||
# TODO: For dask: avoid .compute() calls | ||
res = _pd_mesh(vertices.compute(), simplices.compute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were calling compute
on both vertices
and simplices
anyway, so I opted to just call the pandas versions. In addition making this more concise, the pandas version has winding auto-detection enabled that was not previously enabled here.
@jbednar Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like good progress without making it more complex; thanks!
@jbednar tests passing! |
Overview
This PR provides initial support for parallel aggregation of trimesh glyphs using dask
Note: This PR is based on #694 as it relies on some of the refactoring performed in that PR.
Usage
To take advantage of dask trimesh support, the
datashader.utils.mesh
utility function should be called with daskDataFrame
s for the thevertices
andsimplices
arguments. In this case, the resultingmesh
DataFrame will be a dask DataFrame rather than a pandas DataFrame.When this dask mesh is passed into the
cvs.trimesh
functino, the trimesh aggregations will be performing in parallel. For exampleImplementation Notes
The job of the
du.mesh
function is to return a DataFrame containing the coordinates of every vertex in every triangle in the mesh in the proper winding order. A triangle is represented in this data structure by three rows, one for each vertex. If a single vertex is used by more then one triangle, then the coordinates of that vertex will show up in multiple rows in this DataFrame.One important characteristic of the updated
du.mesh
function when called with a dask DataFrame is that it makes sure that no triangles straddle a partition boundary in the output. This amounts to making sure that the number of rows in each partition is a multiple of 3. The function attempts to build the output dask dataframe with the greater of the number of partitions invertices
andsimplices
, but the constraint to avoid breaking up triangles takes precedence over the number of partitions.The speedup here is only in the call to
cvs.trimesh
, the call todu.mesh
still requires pulling thevertices
andsimplices
DataFrames into memory. Parallelizing this step in the calculation will take a bit more thought, and may require some spatial ordering of the input DataFrames.Bechmarking
I ran some benchmark tests comparing the pandas aggregation with this new dask aggregation. These were run on a 2015 MBP with a quadcore processor. All of the dask tests were run using 4 partitions.
To scale the number of triangles I used the Chesapeake Bay mesh (https://github.com/pyviz/datashader/blob/master/examples/topics/bay_trimesh.ipynb), and then duplicated the simplices between 1 and 100 times. This scales from ~1 million to ~100 million triangles.
pandas vs dask runtime:
dask speedup factor (pandas runtime / dask runtime)
So the dask implementation is about 1.2 times faster on 1 million triangles and up the 4.3 times faster on 100 million triangles.