Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with _colorize() #899

Closed
o-smirnov opened this issue Apr 20, 2020 · 7 comments
Closed

Problems with _colorize() #899

o-smirnov opened this issue Apr 20, 2020 · 7 comments
Milestone

Comments

@o-smirnov
Copy link
Contributor

ALL software version info

0.10.0 release, but same problem on master.

Description of expected behavior and the observed behavior

I'm using an aggregator similar to color_cat() to colorize data. The resulting raster comes out very under-saturated, as it seems to have only two values in the alpha channel: 0 and 40.

Stepping through with the debugger, I can see the problem is here: https://github.com/holoviz/datashader/blob/master/datashader/transfer_functions/__init__.py#L375

I haven't got an explicit span set, so span is equal to [0, max(total)] at this point (the latter being some very large number), while the a array is normalized to [0,1]. Calling interp with a huge span and such a small a results in all non-zero values of a pinned to min_alpha=40.

Comparing to the implementation of _interpolate, span should be set to min(a),max(a) at this point for correct results...

@o-smirnov
Copy link
Contributor Author

Looking at the code, I think there's actually a more insidious problem if _colorize is used together with a by() reduction. Let's say I do by('category_column', mean('value_column')) (I'd need to get #900 to work first, but for the sake of argument let's say it does), and the mean of the value column is negative in some categories. I think this bit of code implicitly assumes that the result of the reduction is positive (no doubt because it started life when the only reduction was "count"):

    total = data.sum(axis=2)
    # zero-count pixels will be 0/0, but it's safe to ignore that when dividing
    with np.errstate(divide='ignore', invalid='ignore'):
        r = (data.dot(rs)/total).astype(np.uint8)
        g = (data.dot(gs)/total).astype(np.uint8)
        b = (data.dot(bs)/total).astype(np.uint8)

...and also the same assumption down below, where total is masked on >0. So it's all going to break down with reductions that can return negative values.

I guess the solution is to shift and rescale the data into a positive interval first up. In fact, as a workaround, I can do this to the raster before I even call shade.

@jbednar jbednar added this to the 0.11.0 milestone Apr 25, 2020
@jbednar jbednar changed the title _colorize() results in alpha channel all set to 0 or min_alpha Problems with _colorize() May 14, 2020
@jbednar
Copy link
Member

jbednar commented May 14, 2020

We've improved a lot of the by and colorize code in #910, and I think the original issue has been addressed. However, even using the latest version of that PR as of now, there are still problems with negative and zero values as @o-smirnov suggests above.

Theory/Design

Colorize (part of tf.shade) is always meant to use color to indicate the average category in this pixel according to the color_key, and alpha to indicate the average or total value in this pixel. For counts, the color is then the average color of all categories in that pixel, weighted by their relative counts, and the alpha channel reflects the total counts per pixel. So far, so good.

For non-count aggregates that can include negative or zero values, it's difficult to work out precisely what should happen to the colors and alpha. My intuition is that colors should still reflect an average of the categories present, now weighted by value rather than count, and that alpha should map the minimum value found (even if negative or zero) to min_alpha, and the maximum value found (even if negative or zero) to alpha.

As suspected by @o-smirnov , the dot code is not working properly for zero values and may not be handling color properly for negative values, though we do appear to be handling alpha properly already.

Color handling for zero and negative values

For the aggn example in 2_Pipeline.ipynb:

dfn = df.copy()
dfn.val.replace({20:-20, 30:0, 40:-40}, inplace=True)
aggn = ds.Canvas().points(dfn,'x','y', agg=ds.mean("val"))

only the agg_c (count) plot respects this design. Focusing first on the colors in the agg_m (mean) plot, the color appears to be calculated correctly in general, but not for the case where the value is zero:

image

The medium blob (third largest and third smallest, in the lower left) shows up gray instead of the correct green color, indicating that it was mapped to black (with some transparency) rather than green. That blob has a value of 0, and in the current version of the quoted bit of _colorize code I think it is clear that pixels with 0 value will not end up with a meaningful color (0/0):

color_data = data.copy()
color_data[np.isnan(data)] = 0
with np.errstate(divide='ignore', invalid='ignore'):
    r = (color_data.dot(rs)/total).astype(np.uint8)
    g = (color_data.dot(gs)/total).astype(np.uint8)
    b = (color_data.dot(bs)/total).astype(np.uint8)

Here the color_data replaces nans with 0 to avoid having nans propagate throughout the results, but we have to find another approach.

The other colors all look correct; the largest is orange as expected, and if you take that away you can see each of the others matches the colors from agg_c, with only the bottom left black blob shown with an incorrect color:

image

After fixing what happens for zero values, we'll need to look more closely at the color mixing to see if there are any other problems with how the colors are weighted. It's possible that it's working properly when negative values are present only when one color dominates the results, so that the sign in the dot product matches between the color_data and the total (cancelling out the minus sign); not sure.

alpha mapping for negative and zero values

We can look at the alpha handling by editing _colorize to force the color to be red for all cases:

r=r*0+255
g=g*0
b=b*0

When we do that, the alpha channel mapping is:

image

Remember that the values here are, from smallest blob to largest blob, d1(10), d2(-20), d3(0), d4(-40), d5(50), so we'd expect the blobs to be ordered d4(-40), d2(-20), d3(0), d1(10), d5(50)
in alpha value. It's a bit hard to see, but as best I can make out in an image editor the alpha channel is indeed ordered as expected.

So, to make progress, we need to first ensure that zero-valued categories still get the correct color, then can look at the overall results again.

Offsetting the data

I did try adding data -= np.nanmin(data) as a test before using data in colorize(), but that won't work, as it simply makes the -40 blob map to zero and thus lose its color.

@jbednar
Copy link
Member

jbednar commented May 21, 2020

As explained on #910, I ended up having to add a special case for any pixels mapping to zero, as the color was undefined in those cases. Defining the color as the average color for all non-NaN category values seems to work well in practice.

@jbednar
Copy link
Member

jbednar commented May 22, 2020

Fixed in #910.

@jbednar jbednar closed this as completed May 22, 2020
@o-smirnov
Copy link
Contributor Author

Thanks for looking into this @jbednar! I have been distracted by a different software release here, so I couldn't keep up from my side. I'm going to have to find some time over the next few days to absorb the changes into shadems, and report back here.

I think we're really on the bleeding edge here in terms of using colour and alpha to represent data, so the uncertainty is to be expected, and kind of exciting!

@jbednar
Copy link
Member

jbednar commented May 22, 2020

Thanks! The notebook 2_Pipeline.ipynb illustrates how it all works now, so please look at the new examples in that section and see if it fits your intuitions. There's a breaking change to the behavior that we'll mention in the release notes, with color_baseline=0 required to get the old behavior that was always referenced to zero; everything is now referenced to the minimum of the data observed. In every case that I tried the actual minimum of the data was in fact 0 already, so it should not be a change except in unusual cases (e.g. zoomed in very close to some oddly behaving region of a dataset). But if you find otherwise, please let us know. We are preparing a release now, and will release later today if we find no problems.

@o-smirnov
Copy link
Contributor Author

@jbednar, thanks, I have belated verified that this all seems to work as expected now, at least with positive-valued reductions. I have taken all my workarounds out of shades, and the colours and alphas come up as expected.

The negative values (and, even worse, total=0) case still breaks my head a little bit. I'm not sure what the right approach is, or if there even is one...

Just thinking out loud now:

  • I would like to use colour to indicate that a category is dominant (or is an outlier) w.r.t. some statistic.

  • Let's say I have three categories: red, green and blue. If I use a strictly-positive reduction like std, the picture is simple: reddish points indicate that the red category has a significantly higher standard deviation compared to the other two categories. So far, so useful.

  • Let's say I use mean instead, and red comes out with a mean of -1, green with mean 0, and blue with mean 1. Is there a right way to colorize this? What is the "opposite" of red? Or should the colour be something like R=0, G=128, B=255 in this scenario? That might be useful -- but how to "marry" it consistently with the positive-only case?

I still don't know -- but if I come up with something clever, I'll post here...

jbednar pushed a commit that referenced this issue Nov 11, 2020
* Fixes performance issue raised in #899
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants