Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum does not use K-B-N summation for >1D arrays #1258

Closed
kmsquire opened this issue Sep 5, 2012 · 16 comments
Closed

sum does not use K-B-N summation for >1D arrays #1258

kmsquire opened this issue Sep 5, 2012 · 16 comments

Comments

@kmsquire
Copy link
Member

kmsquire commented Sep 5, 2012

julia> v = [1,1e100,1,-1e100]*1000
4-element Float64 Array:
 1000.0    
    1.0e103
 1000.0    
   -1.0e103

julia> sum(v)
2000.0

julia> A = [v reverse(v) v2 reverse(v2)]
4x4 Float64 Array:
 1000.0        -1.0e103  1000.0         1.0e103
    1.0e103  1000.0        -1.0e103  1000.0    
 1000.0         1.0e103  1000.0        -1.0e103
   -1.0e103  1000.0         1.0e103  1000.0    

julia> sum(A)
8000.0

julia> sum(A,1)
1x4 Float64 Array:
 0.0  1000.0  0.0  1000.0

julia> [sum(A[:,1]) sum(A[:,2]) sum(A[:,3]) sum(A[:,4])]
1x4 Float64 Array:
 2000.0  2000.0  2000.0  2000.0

julia> sum(A,2)
4x1 Float64 Array:
    0.0
 1000.0
    0.0
 1000.0

julia> [sum(A[1,:]), sum(A[2,:]), sum(A[3,:]), sum(A[4,:])]
4-element Float64 Array:
 2000.0
 2000.0
 2000.0
 2000.0
@JeffBezanson
Copy link
Member

I believe we decided to use naive summation by default. Reasons:

  • Every other environment we know of does it that way
  • On some platforms the performance penalty is 2x, which is steep
  • K-B-N does not give a perfect answer, just a better answer, so it is kind of an arbitrary tradeoff

@kmsquire
Copy link
Member Author

kmsquire commented Oct 6, 2012

Using naive summation by default is fine, although it would still be nice to have a way to have more accurate summation. My main real interest in this is that I often work with probability vectors with 20,000 values, and with naive summation, those may not sum to 1.0.

julia> a = rand(20000,1);

julia> s = 0; for i = 1:length(a)
          s = s + a[i]
       end

julia> s
9954.844294400755

julia> sum(a)
9954.84429440082

julia> a1 = a/s;

julia> a2 = a/sum(a);

julia> sum(a1)
1.0000000000000064

julia> sum(a2)
0.9999999999999999

I guess this shows that KBN summation is better, but not perfect either...

There are usually ways around the issue, if you know the problem exists.

It might be worthwhile to specifically use KBN summation when normalizing probability vectors in multinomial and categorical distributions, especially if it is going away for the normal summation algorithm.

CC: @johnmyleswhite @dmbates

@StefanKarpinski
Copy link
Member

Even precise floating-point summation may not help here. It's entirely possible that the normalized values simply don't add up to 1.

@kmsquire
Copy link
Member Author

kmsquire commented Oct 6, 2012

True, but in my example, the KBN normalized sum was closer (and in the few
other trials I tested, worked out to 1.0 within machine precision). I'll
try to quantify this more at some point.

Kevin

On Sat, Oct 6, 2012 at 3:28 PM, Stefan Karpinski
[email protected]:

Even precise floating-point summation may not help here. It's entirely
possible that the normalized values simply don't add up to 1.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1258#issuecomment-9203117.

@johnmyleswhite
Copy link
Member

I don't think I have anything intelligent to add here. It seems like it would be nice to have the option to produce more accurate behavior, but it's hard to say how important these differences are in any given application. The numbers being cited are differences in probabilities that I would guess could rarely be reliably measured in most data collection environments. In the multinomial draws case, it seems like you'd need trillions of samples to tell the difference between them, but I may be overlooking some obvious case where things break down.

@StefanKarpinski
Copy link
Member

I think the issue here is less that the probabilities be accurate and more that it's problematic when they don't appear to sum to 1. But yes, we probably want a way to switch this, the issue is how and limited to what scope.

@kmsquire
Copy link
Member Author

kmsquire commented Oct 9, 2012

Right, at some point in the past, I had some issues sampling from discrete distributions with very small probabilities when those probabilities did not add up to 1.0, so I was (am) concerned about that here. I tested this by drawing samples from an exponential distribution and normalizing, for various sizes (https://gist.github.com/3856163). Note that this tries to show the worst case by sorting from high to low before summing.

Generally speaking, up to 100,000 values, the normalized summation error (difference from 1.0) is a couple of magnitudes (or more) lower than the smallest probability, so there's probably not much effect on the accuracy. If this were used, e.g., as a Multinomial distribution, there may be some small issues with the last element getting more or less probability mass than it should, with the exact issue being problem dependent.

So I guess it's worth closing this issue and perhaps (re)opening a separate issue to provide the possibility of replacing standard summation with more accurate summing. This may be slightly challenging to provide universally, though, since e.g., the Multinomial distribution does it's own normalization because it checks for negative values.

@stevengj
Copy link
Member

PS. For interested parties on this thread, note that we now use pairwise summation (#4039), which is often surprisingly close to Kahan summation for large arrays, but without the performance penalty. But we still use naive summation for sums along particular dimensions of multidimensional arrays.

@afbarnard
Copy link

Python has had a specialized floating-point sum implementation, math.fsum, since version 2.6. Here are some recipes for various implementations with different characteristics.

I think Julia ought to have a specialized floating-point summation algorithm given its emphasis on numerical computing. It is definitely something I would appreciate seeing added. Then, given documentation of the trade-offs, users could pick either the naive or the correct version.

@StefanKarpinski
Copy link
Member

See @stevengj's comment right above – we use pairwise summation. Near-optimal performance, near-optimal accuracy, so it side-steps the trade-off altogether. However, since we're using a base case of iterative summation in that algorithm, we do fail the test case given in that article:

julia> v = [1, 1e100, 1, -1e100] * 10000
4-element Array{Float64,1}:
 10000.0
     1.0e104
 10000.0
    -1.0e104

julia> sum(v)
0.0

@timholy
Copy link
Member

timholy commented Mar 27, 2014

And we also have sum_kbn in base.

@ivarne
Copy link
Member

ivarne commented Mar 27, 2014

julia> sum_kbn(v)
20000.0

@afbarnard
Copy link

Yes, I read issue #4039, but I was still concerned because the guarantees for pairwise summation might not be good enough for a particular application. KBN does better and Python's fsum does even better than that. KBN would be fine for my uses but I didn't see it documented (the suffix "kbn" is not intuitive to me, especially compared to something like "float_sum" or "fsum", but I found it now through search) and so did not know it was part of base rather than just part of someone's code. In particular, it might be helpful to mention sum_kbn under the documentation for sum.

@ivarne
Copy link
Member

ivarne commented Mar 27, 2014

The documentation for sum should definitely suggest using sum_kbn if the numbers vary with many orders of magnitude, and speed does not matter.

@timholy
Copy link
Member

timholy commented Mar 27, 2014

We could really use a systematic approach for "See also:" in our documentation system.

@ivarne
Copy link
Member

ivarne commented Mar 27, 2014

@timholy So that it renders to links on platforms that support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants