-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
category_modulo and category_binning #927
Conversation
Hmm ok, I gotta get the tests to work first, clearly... |
This looks great, thanks! I recently proposed implementing binning of numeric dimensions in #875 (comment) to make 3D aggregations, and it's fun to see it actually appear! Once the tests pass I'm happy to try it out and give feedback. |
Thanks, @jbednar! I'm quite keen for this and #926 to go in, because then I can release shadems on PyPI. This PR should also allow other interesting user-defined categorizers, for example an outer product over mutiple categories, or a category remapper... I might need some help with the Travis build though. It seems everything passes except this one thing:
....which has absolutely nothing to do with any code I touched, so I'm at a loss where to start here. |
Actually it looks like master is currently failing in exactly the same way, so I'll just wait for you to fix it and merge master in again... |
Ok, I've asked @kebowen730 to look into that. Stay tuned! |
@jbednar with the latest changes to master all the tests pass. Could you please look into merging this? |
Sure, but at the moment I'm distracted by SciPy2020, and haven't had a chance to look at it yet. Soon! |
@jbednar any chance to look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; sorry for the delay in reviewing! It's hard to test it out without examples and tests, though. Would it be possible to add those?
Thanks for the review @jbednar, I've implemented the suggested fixes, and there's now a whole bunch of tests for the reductions added in. They pass for me locally. |
I think master itself is currently not passing tests (I also see something like "fixture 'benchmark' not found" failing elsewhere). So once master is fixed, this should be good to go. |
I've fixed tests on master, so could you please rebase? |
Good to go! |
Thanks so much for the contribution; this is really cool stuff! I'll have to think about how to highlight it in the docs; any suggestions or sample bits of code welcome! |
Sure, shall I just write something up right here and you can cut-and-paste appropriately? |
That would be great, thanks! |
Let's say you have a dataframe with columns named Counting categoriesThe traditional agg = canvas.points(df,'x','y', agg=ds.count_cat('gender')) The resulting cube has three axes: x, y and gender. Each pixel in the cube will contain a count of the points of the appropriate gender that fall into the corresponding x,y bin. The agg = canvas.points(df,'x','y', agg=ds.by('gender', ds.count())) Aggregating statistics by categoryHowever, more elaborate reduction functions can also be supplied: agg = canvas.points(df,'x','y', agg=ds.by('gender', ds.mean('weight'))) This returns a 3D cube where each x, y, gender pixel gives the mean weight of that gender over the x, y bin. Categorizing by non-categorical columnsThe examples above only work with a categorial column type. What if one wanted to categorize by a non-categorical column such as age (or even weight)? This can be done by creating a categorizer object for that column, and giving it to cat = ds.category_modulo('age', modulo=10, offset=16)
agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.mean('weight'))) This returns a 3D cube containing 10 slices. The category is computed as The previous example is, admittedly, a contrived example. Here is something more realistic: let us look at the standard deviation in weight for particular age brackets: cat = ds.category_binning('age', lower=20, higher=100, nbins=8, include_under=False, include_over=False)
agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.std('weight'))) This returns a 3D cube containing 9 (nbins+1) slices. Slice 0 gives the stddev in weight (per each x, y bin) for ages [20,30), slice 1 for ages [30,40), ..., slice 7 for ages [90,100). The last slice, #8, is the "odd bin", i.e. it catches all "other" categories -- in this case, it gives the stddev in weight for ages below 20, over 100, and for an age of NaN (the latter would only be a possibility if age was a float, of course). If we were to give Binning can also be done over a float-valued column: cat = ds.category_binning('weight', lower=0, higher=200, nbins=10)
agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.max('age'))) This returns a 3D cube containing 11 slices. Each point gives the maximum age over a particular x, y and weight bin. The last slice (#10) will catch negative and NaN weights, as well as weights >= 200. Custom categorizersIt is possible to implement your own custom categorizations. You must derive a subclass from |
I suggested this in #907 -- I would like to humbly submit a proposed implementation.
Background
I often need to colourize (categorize) data using (a) indices from an integer column (possibly modulo some preset number of categories), or (b) by binning a float-valued column and assigning categories based on which bin a value falls into.
The data is typically too big to be in core, and is computed or loaded in a lazy fashion using dask dataframes and https://github.com/ska-sa/dask-ms. I have not found a way to map this into a Categorical column without triggering a computation (and the ensuring disk I/O), which defeats the purpose of the lazy-evaluation dask layer.
In addition, in #907 @maihde requested a mechanism for mapping a large number of categories into a reduced number of colours.
Proposal
This PR does the following:
Tweaks the
by()
reduction so that it deals with an abstract "categorizer" object (rather than explicitly usingcategory_codes
orcategory_values
)Implements two new categorizers,
category_modulo
andcategory_binning
, derived fromcategory codes
. These assign categories based on the two use cases described above.The
by
constructor will now accept a categorizer object, as an alternative to a column name. Thus e.g.constructs a by-reduction using 16 categories, with category 0 being 0<=x<10, category 1 being 10<=x<20, etc.
Calling the constructor with a column name implicitly constructs a
category_codes
, thus retaining the old behaviour.TODO
The cuda code path of these two new categorizers currently throws a
NotImplementedError
. It's probably trivial to implement in cuda (as a minimum, categorization can be done on the CPU, then a to_gpu_array() can be called), but I haven't got the knowledge.@maihde's
simplify_categories()
function can be implemented as another categorizer.