-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements tree reduction in the dask layer #926
Conversation
Tree reductions are go
Quick fix for pandas Categorical dtypes
It looks like a95dd1a doesn't handle the full range of types that can be present on Dataframe Arrays and is breaking the test suite. I'll work on a fix. |
Improve dtype inspection
@jbednar I'd be interested in your thoughts on this PR and the use of a Reduction Operator to aggregate the images produced in each chunk. |
@jbednar any chance to look at this? |
@jbednar, ping. It would be very nice to get this (and #927) merged so we can do a proper shadeMS release in time for ADASS. I attach a copy of my poster in the hope that it motivates: shadeMS ADASS Poster.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of merging this, but are there any concrete benchmarks about the impact on performance and memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here are way over my head, so I have no idea how to tell whether they are a good idea. The basic idea of a tree reduction definitely makes sense, but I wouldn't be able to detect problems with this implementation if there are any, and as @philippjfr suggests it would be good to know what the actual impact on performance is, given that it's a clearly more complex approach than the current one.
@@ -70,18 +76,103 @@ def default(glyph, df, schema, canvas, summary, cuda=False): | |||
y_mapper = canvas.y_axis.mapper | |||
extend = glyph._build_extend(x_mapper, y_mapper, info, append) | |||
|
|||
def chunk(df): | |||
# Here be dragons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but explaining precisely what those dragons are would be helpful! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low-level dask graphs can represent a very broad range of operations and data. Dask Arrays and Dataframes are simply metadata describing the layer of an associated Low-Level Dask Array. For example:
>>> dsk = {
('a', 0, 0): (np.full, (2, 2), 1),
('a', 0, 1): (np.full, (2, 2), 2),
('a', 1, 0): (np.full, (2, 2), 3),
('a', 1, 1): (np.full, (2, 2), 4)}
>>> chunks = ((2, 2), (2, 2))
>>> array = da.Array(dsk, 'a', chunks, meta=np.empty((0,0), dtype=np.int32))
>>> array
dask.array<a, shape=(4, 4), dtype=int32, chunksize=(2, 2), chunktype=numpy.ndarray>
>>> array.compute()
Out[7]:
array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
In the above example, the low-level graph, chunks describe the purported size of the arrays produced by the 4 dask tasks, while the meta describes the type of each chunk, here represented with an empty array. This is to support arrays other that NumPy (e.g. sparse arrays). At low-level graph
The dragon is that it is entirely possibly to construct dsk in a way that disagrees with the metadata and this doesn't matter at all at the graph level. The only time there's a problem is:
- if
array.compute()
is directly called (Because dask will try to construct an array of type meta) - If the output of
arrays
tasks are used as input to tasks that expect ndarrays.
(1) doesn't apply in this case and here, a dask Array is constructed from tasks producing Pandas Dataframes which are passed into tasks (combine, aggregate), that expect Dataframes, circumventing (2).
The final dask Array produced by the reduction describes a task that does return an ndarray so everything works out, under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so in summary, no issues in the way it's currently used in the codebase (given that everything goes via dataframes), just leaving a signpost for future developers to look out for if they start hacking the code for another purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I can leave a link to the explanation in the code.
dtype = np.result_type(*dtypes) | ||
# Create a meta object so that dask.array doesn't try to look | ||
# too closely at the type of the chunks it's wrapping | ||
# they're actually dataframes, tell dask they're ndarrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
Oh, and the poster looks amazing! I'm so glad that Datashader has been useful for this, and I apologize that it's taken so long for me to catch up to all the cool things you guys are doing! I particularly like your "Principle of Maximal Beigeness", and plan to use that in my own work. :-) |
I ran a few measurements for ratt-ru/shadeMS#29 and ratt-ru/shadeMS#34 back in the day. Basically, without this PR, my plotting problems were blowing out the RAM (on a 512GB box), so the comparison was rather binary. But I can try to do some tests with a reduced dataset. |
@philippjfr, I added some benchmarks here: ratt-ru/shadeMS#34 (comment). Bottom line is, it depends on many factors, but the tree reduction version never does worse, and in some regimes (big problems, medium chunk sizes) does considerably better. |
I've fixed tests on master, so could you please rebase? |
All good to go. |
Thanks for the contribution! Sorry I couldn't get the tests fixed in time for the conference presentation; all I accomplished last week is checking election results about 40,000 times a day. :-/ Should be ready to make a new release soon. |
Aye, I think the election immobilized everyone... anyway, I realized we'll need a release anyway before I can release shadeMS to PyPI, so it was never going to be on time. Motivated people have been able to install off the github branch anyway. |
I've tagged a dev release that's building now; see https://travis-ci.org/github/holoviz/datashader/builds/743042551 . If it completes successfully, you'll be able to do |
Thanks -- any idea why I can't |
The conda package is up and I've verified that it installs successfully using |
Build failed:
|
Yep; looks like twine has undeclared dependencies. :-( I've forced those to be installed, with the new build at https://travis-ci.org/github/holoviz/datashader/builds/743245089 (It's been running for nearly 3 hours now!). The result will be v0.11.2a2 if it succeeds. |
It did not; apparently a further missing dependency. Another try: https://travis-ci.org/github/holoviz/datashader/builds/743321413 |
Failed again, sadly... |
Aye. Still building with another minor tweak. Rather than me posting all the URLs each time, just look for a green one with a version number in this table: https://travis-ci.org/github/holoviz/datashader/builds/ Note that this is only a dev release. To get a real release, we have to make sure the new categorical stuff works ok on GPUs (even if just to raise an appropriate error message), fix line rendering on GPUs (broken by a different PR), and firmly commit to abandoning Python2 (as the recent PRs are only working in Python3). Not trivial! |
The pip build finally completed, and says that it uploaded datashader-0.11.2a5.tar.gz to https://test.pypi.org/legacy/ (https://travis-ci.org/github/holoviz/datashader/jobs/743444276). I assume that means that it should be available from PyPI, but |
Not me... @Athanaseus or @gijzelaerr, any idea? |
I think probably PyPI doesn't handle version extensions like |
Thing is, I remember pushing out packages like radiopadre 1.0pre9, and PyPI was fine with that (and knew that 1.0pre9 < 1.0). So I'm not sure what exactly is different here! |
The difference is that we upload dev releases to test.pypi.org not the main pypi.org repo. We've long debated changing that but for now @jbednar or I could manually copy it over. |
If pip handles dev releases properly at the main repo, i.e. |
In any case, I do seem to remember pip now handling dev releases fine, though I don't remember the details, and we definitely aren't ready to make a full x.y.z release while CUDA and py2 are so badly broken. |
Yes support for --pre is pretty universal at this point, only very old pip installations won't handle it I believe. For now you can add --index-url https://test.pypi.org --pre to install from the other repo. |
But I can't put that into my |
2 things:
|
Following discussions in ratt-ru/shadeMS#29 (in particular, using small chunks in the dask layer would, counterintuitive, explode RAM usage), @sjperkins replaced the current chunk aggregation code in dask.py with a tree reduction.
This has been tested extensively with https://github.com/ratt-ru/shadeMS, and was found to reduce RAM usage considerably. Not tested in a CUDA context at all though, so if somebody more knowledgable than me can take a look at it, that'd be great.