Problems with _colorize() #899

o-smirnov · 2020-04-20T13:06:44Z

ALL software version info

0.10.0 release, but same problem on master.

Description of expected behavior and the observed behavior

I'm using an aggregator similar to color_cat() to colorize data. The resulting raster comes out very under-saturated, as it seems to have only two values in the alpha channel: 0 and 40.

Stepping through with the debugger, I can see the problem is here: https://github.com/holoviz/datashader/blob/master/datashader/transfer_functions/__init__.py#L375

I haven't got an explicit span set, so span is equal to [0, max(total)] at this point (the latter being some very large number), while the a array is normalized to [0,1]. Calling interp with a huge span and such a small a results in all non-zero values of a pinned to min_alpha=40.

Comparing to the implementation of _interpolate, span should be set to min(a),max(a) at this point for correct results...

The text was updated successfully, but these errors were encountered:

o-smirnov · 2020-04-22T08:48:27Z

Looking at the code, I think there's actually a more insidious problem if _colorize is used together with a by() reduction. Let's say I do by('category_column', mean('value_column')) (I'd need to get #900 to work first, but for the sake of argument let's say it does), and the mean of the value column is negative in some categories. I think this bit of code implicitly assumes that the result of the reduction is positive (no doubt because it started life when the only reduction was "count"):

    total = data.sum(axis=2)
    # zero-count pixels will be 0/0, but it's safe to ignore that when dividing
    with np.errstate(divide='ignore', invalid='ignore'):
        r = (data.dot(rs)/total).astype(np.uint8)
        g = (data.dot(gs)/total).astype(np.uint8)
        b = (data.dot(bs)/total).astype(np.uint8)

...and also the same assumption down below, where total is masked on >0. So it's all going to break down with reductions that can return negative values.

I guess the solution is to shift and rescale the data into a positive interval first up. In fact, as a workaround, I can do this to the raster before I even call shade.

jbednar · 2020-05-14T04:58:53Z

We've improved a lot of the by and colorize code in #910, and I think the original issue has been addressed. However, even using the latest version of that PR as of now, there are still problems with negative and zero values as @o-smirnov suggests above.

Theory/Design

Colorize (part of tf.shade) is always meant to use color to indicate the average category in this pixel according to the color_key, and alpha to indicate the average or total value in this pixel. For counts, the color is then the average color of all categories in that pixel, weighted by their relative counts, and the alpha channel reflects the total counts per pixel. So far, so good.

For non-count aggregates that can include negative or zero values, it's difficult to work out precisely what should happen to the colors and alpha. My intuition is that colors should still reflect an average of the categories present, now weighted by value rather than count, and that alpha should map the minimum value found (even if negative or zero) to min_alpha, and the maximum value found (even if negative or zero) to alpha.

As suspected by @o-smirnov , the dot code is not working properly for zero values and may not be handling color properly for negative values, though we do appear to be handling alpha properly already.

Color handling for zero and negative values

For the aggn example in 2_Pipeline.ipynb:

dfn = df.copy()
dfn.val.replace({20:-20, 30:0, 40:-40}, inplace=True)
aggn = ds.Canvas().points(dfn,'x','y', agg=ds.mean("val"))

only the agg_c (count) plot respects this design. Focusing first on the colors in the agg_m (mean) plot, the color appears to be calculated correctly in general, but not for the case where the value is zero:

The medium blob (third largest and third smallest, in the lower left) shows up gray instead of the correct green color, indicating that it was mapped to black (with some transparency) rather than green. That blob has a value of 0, and in the current version of the quoted bit of _colorize code I think it is clear that pixels with 0 value will not end up with a meaningful color (0/0):

color_data = data.copy()
color_data[np.isnan(data)] = 0
with np.errstate(divide='ignore', invalid='ignore'):
    r = (color_data.dot(rs)/total).astype(np.uint8)
    g = (color_data.dot(gs)/total).astype(np.uint8)
    b = (color_data.dot(bs)/total).astype(np.uint8)

Here the color_data replaces nans with 0 to avoid having nans propagate throughout the results, but we have to find another approach.

The other colors all look correct; the largest is orange as expected, and if you take that away you can see each of the others matches the colors from agg_c, with only the bottom left black blob shown with an incorrect color:

After fixing what happens for zero values, we'll need to look more closely at the color mixing to see if there are any other problems with how the colors are weighted. It's possible that it's working properly when negative values are present only when one color dominates the results, so that the sign in the dot product matches between the color_data and the total (cancelling out the minus sign); not sure.

alpha mapping for negative and zero values

We can look at the alpha handling by editing _colorize to force the color to be red for all cases:

r=r*0+255
g=g*0
b=b*0

When we do that, the alpha channel mapping is:

Remember that the values here are, from smallest blob to largest blob, d1(10), d2(-20), d3(0), d4(-40), d5(50), so we'd expect the blobs to be ordered d4(-40), d2(-20), d3(0), d1(10), d5(50)
in alpha value. It's a bit hard to see, but as best I can make out in an image editor the alpha channel is indeed ordered as expected.

So, to make progress, we need to first ensure that zero-valued categories still get the correct color, then can look at the overall results again.

Offsetting the data

I did try adding data -= np.nanmin(data) as a test before using data in colorize(), but that won't work, as it simply makes the -40 blob map to zero and thus lose its color.

jbednar · 2020-05-21T07:30:51Z

As explained on #910, I ended up having to add a special case for any pixels mapping to zero, as the color was undefined in those cases. Defining the color as the average color for all non-NaN category values seems to work well in practice.

jbednar · 2020-05-22T20:29:45Z

Fixed in #910.

o-smirnov · 2020-05-22T20:41:06Z

Thanks for looking into this @jbednar! I have been distracted by a different software release here, so I couldn't keep up from my side. I'm going to have to find some time over the next few days to absorb the changes into shadems, and report back here.

I think we're really on the bleeding edge here in terms of using colour and alpha to represent data, so the uncertainty is to be expected, and kind of exciting!

jbednar · 2020-05-22T20:45:17Z

Thanks! The notebook 2_Pipeline.ipynb illustrates how it all works now, so please look at the new examples in that section and see if it fits your intuitions. There's a breaking change to the behavior that we'll mention in the release notes, with color_baseline=0 required to get the old behavior that was always referenced to zero; everything is now referenced to the minimum of the data observed. In every case that I tried the actual minimum of the data was in fact 0 already, so it should not be a change except in unusual cases (e.g. zoomed in very close to some oddly behaving region of a dataset). But if you find otherwise, please let us know. We are preparing a release now, and will release later today if we find no problems.

o-smirnov · 2020-05-29T18:33:24Z

@jbednar, thanks, I have belated verified that this all seems to work as expected now, at least with positive-valued reductions. I have taken all my workarounds out of shades, and the colours and alphas come up as expected.

The negative values (and, even worse, total=0) case still breaks my head a little bit. I'm not sure what the right approach is, or if there even is one...

Just thinking out loud now:

I would like to use colour to indicate that a category is dominant (or is an outlier) w.r.t. some statistic.
Let's say I have three categories: red, green and blue. If I use a strictly-positive reduction like std, the picture is simple: reddish points indicate that the red category has a significantly higher standard deviation compared to the other two categories. So far, so useful.
Let's say I use mean instead, and red comes out with a mean of -1, green with mean 0, and blue with mean 1. Is there a right way to colorize this? What is the "opposite" of red? Or should the colour be something like R=0, G=128, B=255 in this scenario? That might be useful -- but how to "marry" it consistently with the positive-only case?

I still don't know -- but if I come up with something clever, I'll post here...

* Fixes performance issue raised in #899

o-smirnov added a commit to o-smirnov/datashader that referenced this issue Apr 20, 2020

fixes holoviz#899, hopefully

dba9080

o-smirnov mentioned this issue Apr 21, 2020

odd performance degradation when row chunks are set too small ratt-ru/shadeMS#29

Closed

jbednar added this to the 0.11.0 milestone Apr 25, 2020

jbednar mentioned this issue May 7, 2020

Additional fixes for colorize span #910

Merged

jbednar changed the title ~~_colorize() results in alpha channel all set to 0 or min_alpha~~ Problems with _colorize() May 14, 2020

jbednar closed this as completed May 22, 2020

jbednar pushed a commit that referenced this issue Nov 11, 2020

Implements tree reduction in the dask layer (#926)

1f54bcf

* Fixes performance issue raised in #899

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with _colorize() #899

Problems with _colorize() #899

o-smirnov commented Apr 20, 2020

o-smirnov commented Apr 22, 2020

jbednar commented May 14, 2020

jbednar commented May 21, 2020

jbednar commented May 22, 2020

o-smirnov commented May 22, 2020

jbednar commented May 22, 2020

o-smirnov commented May 29, 2020

Problems with _colorize() #899

Problems with _colorize() #899

Comments

o-smirnov commented Apr 20, 2020

ALL software version info

Description of expected behavior and the observed behavior

o-smirnov commented Apr 22, 2020

jbednar commented May 14, 2020

Theory/Design

Color handling for zero and negative values

alpha mapping for negative and zero values

Offsetting the data

jbednar commented May 21, 2020

jbednar commented May 22, 2020

o-smirnov commented May 22, 2020

jbednar commented May 22, 2020

o-smirnov commented May 29, 2020