Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

category_modulo and category_binning #927

Merged
merged 9 commits into from
Nov 11, 2020

Conversation

o-smirnov
Copy link
Contributor

I suggested this in #907 -- I would like to humbly submit a proposed implementation.

Background

I often need to colourize (categorize) data using (a) indices from an integer column (possibly modulo some preset number of categories), or (b) by binning a float-valued column and assigning categories based on which bin a value falls into.

The data is typically too big to be in core, and is computed or loaded in a lazy fashion using dask dataframes and https://github.com/ska-sa/dask-ms. I have not found a way to map this into a Categorical column without triggering a computation (and the ensuring disk I/O), which defeats the purpose of the lazy-evaluation dask layer.

In addition, in #907 @maihde requested a mechanism for mapping a large number of categories into a reduced number of colours.

Proposal

This PR does the following:

  • Tweaks the by() reduction so that it deals with an abstract "categorizer" object (rather than explicitly using category_codes or category_values)

  • Implements two new categorizers, category_modulo and category_binning, derived from category codes. These assign categories based on the two use cases described above.

  • The by constructor will now accept a categorizer object, as an alternative to a column name. Thus e.g.

    by(category_binning('x', 0, 10, 16))
    

    constructs a by-reduction using 16 categories, with category 0 being 0<=x<10, category 1 being 10<=x<20, etc.

    Calling the constructor with a column name implicitly constructs a category_codes, thus retaining the old behaviour.

TODO

  • The cuda code path of these two new categorizers currently throws a NotImplementedError. It's probably trivial to implement in cuda (as a minimum, categorization can be done on the CPU, then a to_gpu_array() can be called), but I haven't got the knowledge.

  • @maihde's simplify_categories() function can be implemented as another categorizer.

@o-smirnov
Copy link
Contributor Author

Hmm ok, I gotta get the tests to work first, clearly...

@jbednar
Copy link
Member

jbednar commented Jun 16, 2020

This looks great, thanks! I recently proposed implementing binning of numeric dimensions in #875 (comment) to make 3D aggregations, and it's fun to see it actually appear! Once the tests pass I'm happy to try it out and give feedback.

@o-smirnov
Copy link
Contributor Author

Thanks, @jbednar! I'm quite keen for this and #926 to go in, because then I can release shadems on PyPI.

This PR should also allow other interesting user-defined categorizers, for example an outer product over mutiple categories, or a category remapper...

I might need some help with the Travis build though. It seems everything passes except this one thing:

Traceback (most recent call last):
  File "/home/travis/miniconda/envs/3.6/bin/datashader", line 11, in <module>
    load_entry_point('datashader', 'console_scripts', 'datashader')()
  File "/home/travis/build/holoviz/datashader/datashader/__main__.py", line 9, in main
    return pyct.cmd.substitute_main('datashader',args=args)
  File "/home/travis/miniconda/envs/3.6/lib/python3.6/site-packages/pyct/cmd.py", line 455, in substitute_main
    args.func(args) if hasattr(args,'func') else parser.error("must supply command to run")
  File "/home/travis/miniconda/envs/3.6/lib/python3.6/site-packages/pyct/cmd.py", line 394, in <lambda>
    parser.set_defaults(func=lambda args: fn(name, **{k: getattr(args,k) for k in vars(args) if k!='func'} ))
TypeError: fetch_data() got an unexpected keyword argument 'verbose'

....which has absolutely nothing to do with any code I touched, so I'm at a loss where to start here.

@o-smirnov
Copy link
Contributor Author

I might need some help with the Travis build though. It seems everything passes except this one thing:

Actually it looks like master is currently failing in exactly the same way, so I'll just wait for you to fix it and merge master in again...

@jbednar
Copy link
Member

jbednar commented Jun 18, 2020

Ok, I've asked @kebowen730 to look into that. Stay tuned!

@o-smirnov
Copy link
Contributor Author

@jbednar with the latest changes to master all the tests pass. Could you please look into merging this?

@jbednar
Copy link
Member

jbednar commented Jul 8, 2020

Sure, but at the moment I'm distracted by SciPy2020, and haven't had a chance to look at it yet. Soon!

@o-smirnov
Copy link
Contributor Author

@jbednar any chance to look at this?

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; sorry for the delay in reviewing! It's hard to test it out without examples and tests, though. Would it be possible to add those?

datashader/reductions.py Outdated Show resolved Hide resolved
datashader/reductions.py Outdated Show resolved Hide resolved
datashader/reductions.py Outdated Show resolved Hide resolved
@o-smirnov
Copy link
Contributor Author

Thanks for the review @jbednar, I've implemented the suggested fixes, and there's now a whole bunch of tests for the reductions added in. They pass for me locally.

@o-smirnov
Copy link
Contributor Author

I think master itself is currently not passing tests (I also see something like "fixture 'benchmark' not found" failing elsewhere). So once master is fixed, this should be good to go.

@jbednar
Copy link
Member

jbednar commented Nov 11, 2020

I've fixed tests on master, so could you please rebase?

@o-smirnov
Copy link
Contributor Author

Good to go!

@jbednar jbednar merged commit 86d4498 into holoviz:master Nov 11, 2020
@jbednar
Copy link
Member

jbednar commented Nov 11, 2020

Thanks so much for the contribution; this is really cool stuff! I'll have to think about how to highlight it in the docs; any suggestions or sample bits of code welcome!

@jbednar jbednar changed the title suggested implementation for category_modulo and category_binning category_modulo and category_binning Nov 11, 2020
@jbednar jbednar mentioned this pull request Nov 11, 2020
3 tasks
@o-smirnov
Copy link
Contributor Author

Sure, shall I just write something up right here and you can cut-and-paste appropriately?

@jbednar
Copy link
Member

jbednar commented Nov 11, 2020

That would be great, thanks!

@o-smirnov
Copy link
Contributor Author

Let's say you have a dataframe with columns named x, y, gender, age and weight. Gender is a categorical column with N categories, age is integer, the others are floats.

Counting categories

The traditional count_cat aggregator can be used to aggregate points categorized by gender.

    agg = canvas.points(df,'x','y', agg=ds.count_cat('gender'))

The resulting cube has three axes: x, y and gender. Each pixel in the cube will contain a count of the points of the appropriate gender that fall into the corresponding x,y bin.

The by aggregator is a generalization of this. It is constructed with two arguments: a column, and a reduction function. The following is an exact equivalent of ds.count_cat('gender'):

    agg = canvas.points(df,'x','y', agg=ds.by('gender', ds.count()))

Aggregating statistics by category

However, more elaborate reduction functions can also be supplied:

    agg = canvas.points(df,'x','y', agg=ds.by('gender', ds.mean('weight')))

This returns a 3D cube where each x, y, gender pixel gives the mean weight of that gender over the x, y bin.

Categorizing by non-categorical columns

The examples above only work with a categorial column type. What if one wanted to categorize by a non-categorical column such as age (or even weight)? This can be done by creating a categorizer object for that column, and giving it to by:

    cat = ds.category_modulo('age', modulo=10, offset=16)
    agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.mean('weight')))

This returns a 3D cube containing 10 slices. The category is computed as (age - 16)%10. Thus, the first slice will aggregate the mean weight (over an x, y bin) for ages 16, 26, 36, ..., the second slice ages 17, 27, 37, ...., etc.

The previous example is, admittedly, a contrived example. Here is something more realistic: let us look at the standard deviation in weight for particular age brackets:

    cat = ds.category_binning('age', lower=20, higher=100, nbins=8, include_under=False, include_over=False)
    agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.std('weight')))

This returns a 3D cube containing 9 (nbins+1) slices. Slice 0 gives the stddev in weight (per each x, y bin) for ages [20,30), slice 1 for ages [30,40), ..., slice 7 for ages [90,100). The last slice, #8, is the "odd bin", i.e. it catches all "other" categories -- in this case, it gives the stddev in weight for ages below 20, over 100, and for an age of NaN (the latter would only be a possibility if age was a float, of course).

If we were to give include_under=True (default), ages below 20 would be included in the first bin, nominally [20,30). Likewise, with include_over=True, ages >= 100 will be included in the last bin, [90,100). In this case, the "odd bin" will only contain points with NaN ages.

Binning can also be done over a float-valued column:

    cat = ds.category_binning('weight', lower=0, higher=200, nbins=10)
    agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.max('age')))

This returns a 3D cube containing 11 slices. Each point gives the maximum age over a particular x, y and weight bin. The last slice (#10) will catch negative and NaN weights, as well as weights >= 200.

Custom categorizers

It is possible to implement your own custom categorizations. You must derive a subclass from category_codes, category_modulo or category_binning. You must then implement the following methods __init__, _hashable_inputs, categories, validate, and apply. See e.g. the implementation of category_modulo in reductions.py for an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants