Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements tree reduction in the dask layer #926

Merged
merged 19 commits into from
Nov 11, 2020

Conversation

o-smirnov
Copy link
Contributor

Following discussions in ratt-ru/shadeMS#29 (in particular, using small chunks in the dask layer would, counterintuitive, explode RAM usage), @sjperkins replaced the current chunk aggregation code in dask.py with a tree reduction.

This has been tested extensively with https://github.com/ratt-ru/shadeMS, and was found to reduce RAM usage considerably. Not tested in a CUDA context at all though, so if somebody more knowledgable than me can take a look at it, that'd be great.

@sjperkins
Copy link

It looks like a95dd1a doesn't handle the full range of types that can be present on Dataframe Arrays and is breaking the test suite. I'll work on a fix.

@sjperkins
Copy link

@jbednar I'd be interested in your thoughts on this PR and the use of a Reduction Operator to aggregate the images produced in each chunk.

@o-smirnov
Copy link
Contributor Author

@jbednar any chance to look at this?

@o-smirnov
Copy link
Contributor Author

@jbednar, ping. It would be very nice to get this (and #927) merged so we can do a proper shadeMS release in time for ADASS. I attach a copy of my poster in the hope that it motivates: shadeMS ADASS Poster.pdf

Copy link
Member

@philippjfr philippjfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of merging this, but are there any concrete benchmarks about the impact on performance and memory usage?

datashader/data_libraries/dask.py Outdated Show resolved Hide resolved
Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are way over my head, so I have no idea how to tell whether they are a good idea. The basic idea of a tree reduction definitely makes sense, but I wouldn't be able to detect problems with this implementation if there are any, and as @philippjfr suggests it would be good to know what the actual impact on performance is, given that it's a clearly more complex approach than the current one.

@@ -70,18 +76,103 @@ def default(glyph, df, schema, canvas, summary, cuda=False):
y_mapper = canvas.y_axis.mapper
extend = glyph._build_extend(x_mapper, y_mapper, info, append)

def chunk(df):
# Here be dragons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but explaining precisely what those dragons are would be helpful! :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-level dask graphs can represent a very broad range of operations and data. Dask Arrays and Dataframes are simply metadata describing the layer of an associated Low-Level Dask Array. For example:

>>> dsk = {
      ('a', 0, 0): (np.full, (2, 2), 1),
      ('a', 0, 1): (np.full, (2, 2), 2),
      ('a', 1, 0): (np.full, (2, 2), 3),
      ('a', 1, 1): (np.full, (2, 2), 4)}

>>> chunks = ((2, 2), (2, 2))
>>> array = da.Array(dsk, 'a', chunks, meta=np.empty((0,0), dtype=np.int32))
>>> array
dask.array<a, shape=(4, 4), dtype=int32, chunksize=(2, 2), chunktype=numpy.ndarray>
>>> array.compute()                                                                                                                                                      
Out[7]: 
array([[1, 1, 2, 2],
       [1, 1, 2, 2],
       [3, 3, 4, 4],
       [3, 3, 4, 4]])

In the above example, the low-level graph, chunks describe the purported size of the arrays produced by the 4 dask tasks, while the meta describes the type of each chunk, here represented with an empty array. This is to support arrays other that NumPy (e.g. sparse arrays). At low-level graph

The dragon is that it is entirely possibly to construct dsk in a way that disagrees with the metadata and this doesn't matter at all at the graph level. The only time there's a problem is:

  1. if array.compute() is directly called (Because dask will try to construct an array of type meta)
  2. If the output of arrays tasks are used as input to tasks that expect ndarrays.

(1) doesn't apply in this case and here, a dask Array is constructed from tasks producing Pandas Dataframes which are passed into tasks (combine, aggregate), that expect Dataframes, circumventing (2).

The final dask Array produced by the reduction describes a task that does return an ndarray so everything works out, under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so in summary, no issues in the way it's currently used in the codebase (given that everything goes via dataframes), just leaving a signpost for future developers to look out for if they start hacking the code for another purpose?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I can leave a link to the explanation in the code.

dtype = np.result_type(*dtypes)
# Create a meta object so that dask.array doesn't try to look
# too closely at the type of the chunks it's wrapping
# they're actually dataframes, tell dask they're ndarrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@jbednar
Copy link
Member

jbednar commented Nov 2, 2020

Oh, and the poster looks amazing! I'm so glad that Datashader has been useful for this, and I apologize that it's taken so long for me to catch up to all the cool things you guys are doing! I particularly like your "Principle of Maximal Beigeness", and plan to use that in my own work. :-)

@o-smirnov
Copy link
Contributor Author

but are there any concrete benchmarks about the impact on performance and memory usage?

I ran a few measurements for ratt-ru/shadeMS#29 and ratt-ru/shadeMS#34 back in the day. Basically, without this PR, my plotting problems were blowing out the RAM (on a 512GB box), so the comparison was rather binary. But I can try to do some tests with a reduced dataset.

@o-smirnov
Copy link
Contributor Author

@philippjfr, I added some benchmarks here: ratt-ru/shadeMS#34 (comment).

Bottom line is, it depends on many factors, but the tree reduction version never does worse, and in some regimes (big problems, medium chunk sizes) does considerably better.

@o-smirnov
Copy link
Contributor Author

This is perhaps the most striking example of where the master branch almost runs aground -- check that RAM curve just going up and up -- while tree reduction just trundles along happily. (This tends to happen when a large number of categories, 64 in this case, is employed):

master version tree reduction
image image

@jbednar
Copy link
Member

jbednar commented Nov 11, 2020

I've fixed tests on master, so could you please rebase?

@o-smirnov
Copy link
Contributor Author

All good to go.

@jbednar jbednar merged commit 1f54bcf into holoviz:master Nov 11, 2020
@jbednar
Copy link
Member

jbednar commented Nov 11, 2020

Thanks for the contribution! Sorry I couldn't get the tests fixed in time for the conference presentation; all I accomplished last week is checking election results about 40,000 times a day. :-/ Should be ready to make a new release soon.

@o-smirnov
Copy link
Contributor Author

Aye, I think the election immobilized everyone... anyway, I realized we'll need a release anyway before I can release shadeMS to PyPI, so it was never going to be on time. Motivated people have been able to install off the github branch anyway.

@jbednar
Copy link
Member

jbednar commented Nov 11, 2020

I've tagged a dev release that's building now; see https://travis-ci.org/github/holoviz/datashader/builds/743042551 . If it completes successfully, you'll be able to do conda install -c pyviz/label/dev datashader=0.11.2a1 to try it out with this PR's contents and also #927's.

@o-smirnov
Copy link
Contributor Author

Thanks -- any idea why I can't pip install --pre datashader==0.11.2a1? The badge for the dev release is up on https://pypi.org/project/datashader/, but pip refuses to see it. I've never played with pip and dev releases before, so I'm fully ignorant here...

@jbednar
Copy link
Member

jbednar commented Nov 12, 2020

The conda package is up and I've verified that it installs successfully using conda install -c pyviz/label/dev datashader=0.11.2a1. Looks like the pip build didn't run because of a previous unrelated job, so I restarted it. When https://travis-ci.org/github/holoviz/datashader/jobs/743042557 completes I'd hope that either a0 or a1 will be installable with pip (not sure which one this run is, as I tagged it twice! :-).

@o-smirnov
Copy link
Contributor Author

Build failed:

pkg_resources.DistributionNotFound: The 'rfc3986>=1.4.0' distribution was not found and is required by the application

@jbednar
Copy link
Member

jbednar commented Nov 12, 2020

Yep; looks like twine has undeclared dependencies. :-( I've forced those to be installed, with the new build at https://travis-ci.org/github/holoviz/datashader/builds/743245089

(It's been running for nearly 3 hours now!). The result will be v0.11.2a2 if it succeeds.

@jbednar
Copy link
Member

jbednar commented Nov 13, 2020

It did not; apparently a further missing dependency. Another try: https://travis-ci.org/github/holoviz/datashader/builds/743321413

@o-smirnov
Copy link
Contributor Author

Failed again, sadly...

@jbednar
Copy link
Member

jbednar commented Nov 13, 2020

Aye. Still building with another minor tweak. Rather than me posting all the URLs each time, just look for a green one with a version number in this table: https://travis-ci.org/github/holoviz/datashader/builds/
Then when there is one, that's the version that can be installed on pip. (It's already installable on conda.)

Note that this is only a dev release. To get a real release, we have to make sure the new categorical stuff works ok on GPUs (even if just to raise an appropriate error message), fix line rendering on GPUs (broken by a different PR), and firmly commit to abandoning Python2 (as the recent PRs are only working in Python3). Not trivial!

@jbednar
Copy link
Member

jbednar commented Nov 13, 2020

The pip build finally completed, and says that it uploaded datashader-0.11.2a5.tar.gz to https://test.pypi.org/legacy/ (https://travis-ci.org/github/holoviz/datashader/jobs/743444276). I assume that means that it should be available from PyPI, but pip install --pre datashader==0.11.2a5 doesn't find any alpha releases. Anyone know whether the upload command needs updating? I don't have any experience with pip dev releases myself...

@o-smirnov
Copy link
Contributor Author

Not me... @Athanaseus or @gijzelaerr, any idea?

@gijzelaerr
Copy link

gijzelaerr commented Nov 15, 2020

I think probably PyPI doesn't handle version extensions like a* very well, and I recommend against using them in general, just use simple x.y.z versioning.

@o-smirnov
Copy link
Contributor Author

Thing is, I remember pushing out packages like radiopadre 1.0pre9, and PyPI was fine with that (and knew that 1.0pre9 < 1.0). So I'm not sure what exactly is different here!

@philippjfr
Copy link
Member

The difference is that we upload dev releases to test.pypi.org not the main pypi.org repo. We've long debated changing that but for now @jbednar or I could manually copy it over.

@jbednar
Copy link
Member

jbednar commented Nov 15, 2020

If pip handles dev releases properly at the main repo, i.e. pip install datashader only installs actual releases, then I think they should automatically go to the main repo. Alternatively, is there an easy way for people to install dev releases from test.pypi.org?

@jbednar
Copy link
Member

jbednar commented Nov 15, 2020

In any case, I do seem to remember pip now handling dev releases fine, though I don't remember the details, and we definitely aren't ready to make a full x.y.z release while CUDA and py2 are so badly broken.

@philippjfr
Copy link
Member

Yes support for --pre is pretty universal at this point, only very old pip installations won't handle it I believe. For now you can add --index-url https://test.pypi.org --pre to install from the other repo.

@o-smirnov
Copy link
Contributor Author

For now you can add --index-url https://test.pypi.org --pre to install from the other repo

But I can't put that into my setup.py, or can I? For dev purposes I'm fine installing straight off github -- it's for the pre-release that I need a package on pypi.

@gijzelaerr
Copy link

2 things:

  • afaik test.pypi.org is for testing the upload of your package, not for alpha packages.
  • A custom upload repo is a pypi configuration setting, and should be set in the pypi configuration.

@philippjfr philippjfr added this to the v0.12.0 milestone Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants