-
-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix upper-bound bin error for auto-ranged data #459
Conversation
Thanks, @jbcrail. I'm trying to decide whether this is truly the appropriate fix. To start with, I don't think either of the behaviors shown in #457 is correct -- previously, datashader resulted in the point being just over the edge to the top right and thus not visible, but the numpy.histogram2d example had an extra row and column on one side, which doesn't seem correct either. With this PR, datashader now does the right thing: Given our upper-exclusive bounds policy, it achieves this result by enlarging the floating-point auto range just barely enough to ensure that all points will map into the resulting set of pixels, without changing any user-supplied range. This clearly works, but one implication is that if the user does compute a range based on finding the max and min, the max points won't be included in the plot. One could argue that such behavior is just doing what the user asks, but we should consider whether it would be less surprising to users if we just changed the definition of x_range and y_range to be upper inclusive in general, e.g. by adding np.spacing to compute_scale_and_translate instead. That might surprise users less and is less likely to hide data points unbeknownst to the users. Any opinions, @jbcrail, @philippjfr, and @hsparra? |
Naive question, why doesn't auto-scale should include all values from the
input. If the user supplies min,max values i'm guessing that is the same.
If both methods are in question is there a more fundamental problem, I was
'hoping 2D histogram and datashader would have the same result.
BTW: I'm comparing the results with datashader (before this change) and 2d
histogram on larger datasets, they are always different, furthermore on
very large sparse input datashader exhausts memory on problems where 2D
histogram does not.
…On Wed, Sep 13, 2017 at 7:38 PM, James A. Bednar ***@***.***> wrote:
Thanks, @jbcrail <https://github.com/jbcrail>. I'm trying to decide
whether this is truly the appropriate fix.
To start with, I don't think either of the behaviors shown in #457
<#457> is correct --
previously, datashader resulted in the point being just over the edge to
the top right and thus not visible, but the numpy.histogram2d example had
an extra row and column on one side, which doesn't seem correct either.
With this PR, datashader now does the right thing:
[image: image]
<https://user-images.githubusercontent.com/1695496/30405426-61ecd42c-98b1-11e7-87fd-0896adc17b48.png>
Given our upper-exclusive bounds policy, it achieves this result by
enlarging the floating-point auto range just barely enough to ensure that
all points will map into the resulting set of pixels, without changing any
user-supplied range. This clearly works, but one implication is that if the
user does compute an auto range based on the max and min, the max points
won't be included in the plot.
One could argue that such behavior is just doing what the user asks, but
we should consider whether it would be less surprising to users if we just
changed the definition of x_range and y_range to be upper inclusive in
general, e.g. by adding np.spacing to compute_scale_and_translate instead.
That might surprise users less and is less likely to hide data points
unbeknownst to the users.
Any opinions, @jbcrail <https://github.com/jbcrail>, @philippjfr
<https://github.com/philippjfr>, and @hsparra <https://github.com/hsparra>
?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#459 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABXVTXH2KmggQKpfH5QdVuSZGE-QX7Mzks5siGddgaJpZM4PT_OT>
.
|
Internally, datashader has to map every single datapoint to one, and only one, grid cell. That's true everywhere in the array, from the bottom left to the top right, and the only way that can be achieved for all locations, including those precisely on the borders between two grid cells, is if two of the borders on each internal grid cell are inclusive, and two are exclusive. If the borders were all inclusive, a datapoint on the border between two grid cells would be counted in both of them, which is clearly incorrect statistically. Unfortunately, if that policy is applied consistently to all cells, which for good performance is the only reasonable choice, then at the border of the overall array, some of the borders will be exclusive. Having those borders be exclusive is fine as a definition, but with autoranging it's clearly not the right overall answer, and is clearly confusing and surprising. This PR works around the problem by bumping up the two exclusive edges just very slightly so that they include the value on the edge, effectively making the outer borders all be inclusive and thus avoiding the problem in the autoranging case but not if someone naively supplies bounds that they expect to be inclusive. The behavior of histogram2d is not correct in this case, at least not based on the example you supplied, and so we are certainly not going to match that (adding an extra dummy row and column). But I would like to address the specific problem of not including the edge datapoints, as this PR addresses. Memory and performance issues should be addressed in other issues, which I think have been raised already, but those will take some investigation. |
I recently observed this difference between DS and NP 2DH, in places other
than the edges 'obviously' ? due to the approach you describe.
CONCUR with counting on the seams.
Should there be a reference aggregation data set for benchmark and known
results? I have not reviewed all of your test cases perhaps there is one.
Something tiny simple, medium and 'large'.
Too much to do, however correct or repeatable results would may be helpful.
Comparing the methods of implementation through inspection of outputs for:
2DH, DS, (there are some other techniques out there, which 'very' fast with
hash codes), etc.
I also 'wonder' what the GRID (weather) community has for their
aggregations.
…On Thu, Sep 14, 2017 at 8:53 AM, James A. Bednar ***@***.***> wrote:
Internally, datashader has to map every single datapoint to one, and only
one, grid cell. That's true everywhere in the array, from the bottom left
to the top right, and the only way that can be achieved for all locations,
including those precisely on the borders, is if two of the borders on each
grid cell are inclusive, and two are exclusive. If the borders were all
inclusive, a datapoint on the border between two grid cells would be
counted in both of them, which is clearly incorrect statistically.
Unfortunately, if that policy is applied consistently to all cells, which
for good performance is the only reasonable choice, then at the border of
the overall array, some of the borders will be exclusive. Having those
borders be exclusive is fine as a definition, but with autoranging it's
clearly not the right overall answer, and is clearly confusing and
surprising. This PR works around the problem by bumping up the two
exclusive edges just very slightly so that they include the value on the
edge, effectively making the outer borders all be inclusive and thus
avoiding the problem in the autoranging case but not if someone naively
supplies bounds that they expect to be inclusive.
The behavior of histogram2d is not correct in this case, at least not
based on the example you supplied, and so we are certainly not going to
match that (adding an extra dummy row and column). But I would like to
address the specific problem of not including the edge datapoints, as this
PR addresses.
Memory and performance issues should be addressed in other issues, which I
think have been raised already, but those will take some investigation.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#459 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABXVTR4P1xBe8FJSwYsu3MR6xTE4eE-iks5siSHPgaJpZM4PT_OT>
.
|
@jbednar I agree that extending the upper-exclusive bounds policy for the entire array could be confusing and surprising to users. Anecdotally, I pre-calculated the min and max when generating my first Datashader examples, so I also ran into this issue. I think upper-inclusive bounds are the least surprising and probably why they are used in other frameworks. Here's why upper-exclusive bounds might cause confusion: First, the Second, even though Datashader only handles numeric values, if we were plotting dates or strings, then an upper-inclusive range would also be least surprising, e.g. plotting As you mentioned, I think |
OK, @jbcrail, can you please update the PR to add the fix in the new spot? |
This reverts to exclusive ranges both manual and auto for all glyphs (points/line) without introducing regressions of holoviz#318, holoviz#330, and holoviz#343. I refactored several tests to make xarray coordinate indices easier to read and more explicit.
I also fixed extend_line to only calculate with floats until drawing the line if accepted. We were previously juggling floats and their associated integer values. This caused incorrect mapping to the grid.
Let me know when it's ready to review, once the tests pass... |
I had two strategies to address the uniform distribution of points, which we did not previously test. The first strategy was to increase the upper bound on the x-axis and y-axis by the next nearest floating point value by using The second strategy (currently used) was to adjust the point and line glyphs to decrement data that landed on either upper bound using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this seems workable. @philippjfr, would be worth having a second opinion, given how difficult it has been to get this right.
datashader/glyphs.py
Outdated
if y == ymax: | ||
yy -= np.spacing(yy) | ||
return int(xx), int(yy) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tested this, but would it be better to fix it up in the integer domain, instead of relying on np.spacing to be sufficient to bump it back by an integer?
xx = int(x_mapper(x) * sx + tx)
yy = int(y_mapper(y) * sy + ty)
return (xx-1 if x==xmax else xx,
yy-1 if y==ymax else yy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusting in the integer domain instead fails compared to the original. For instance,
x_mapper, y_mapper = lambda n: n, lambda n: n
xmax, ymax = 1.0, 1.0
sx, tx, sy, ty = 0.1, 0.1, 0.1, 0.1
given the point (1.0, 1.0), the original algorithm returns (0, 0) while the above returns (-1, -1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this script to compare implementations:
import numpy as np
def map_onto_pixel(x, bound, s, t):
xx = x * s + t
if x == bound:
xx -= np.spacing(xx)
return int(xx)
def map_onto_pixel2(x, bound, s, t):
xx = int(x * s + t)
return xx-1 if x == bound else xx
def floats(origin, variance):
fs = [origin]
n = origin
for _ in range(variance):
n = n - np.spacing(n)
fs.append(n)
n = origin
for _ in range(variance):
n = n + np.spacing(n)
fs.append(n)
fs.sort()
return fs
bound = 1.0
values = [0.1 * n for n in range(10)]
for s in values:
for t in values:
for f in floats(bound, 10):
a = map_onto_pixel(f, bound, s, t)
b = map_onto_pixel2(f, bound, s, t)
if a != b:
print("{:.20f} {} {} {} {}".format(f, a, b, s, t))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that script, all the cases where the two methods differ are when f==bound, which makes sense given that both versions will otherwise always return int(x * s + t)
. But what I can't tell is which answer (if either) is correct when the two do differ. When they differ, is that meant to be a valid actual input, given that the points are being cropped against the specified bounds already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of the first implementation, a
, is assumed to be the correct value since its the current implementation used by all of the passing tests. The script was meant to verify the second implementation would be a drop-in replacement.
I just verified that the newer implementation passes the tests, which makes me believe more investigation may be needed to improve the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's true; the test cases could well be fine. It's true that the above script finds differences, but I can't tell whether it's finding them in cases that would ever be able to be encountered in practice, given that s and t are not arbitrary, but are presumably derived from the bounds (and the aggregate array size), and given that the points are already being cropped against the bounds. What would be helpful would be to take a case where the values are observed to differ, and then figure out whether that case could ever happen, given how s and t are actually calculated. My guess is that it couldn't ever happen, but I have not worked it through myself. And given that both approaches pass the tests, I am not at all confident that the existing approach in the PR is safer or more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your assessment. I took my modified script and created additional uniformity test cases. Both implementations produced the same results with the test suite, so I replaced the mapping with the simpler implementation.
datashader/glyphs.py
Outdated
xx -= np.spacing(xx) | ||
if y == ymax: | ||
yy -= np.spacing(yy) | ||
return int(xx), int(yy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above regarding np.spacing vs. -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ready to merge, thanks!
Fix #457