Fused computation of statistics #78

lindahua · 2014-06-23T15:18:37Z

I have been considering a uniform interface for computing multiple statistics all at once, while allowing they share part of the computation.

Consider the following example. We want to compute sum, mean, var, and std from x:

s, m, v, sd = sum(x), mean(x), var(x), std(x)

This clearly would waste a lot of computation (e.g it actually computes sum four times, mean three times, and variance twice).

A more efficient way would be

s = sum(x)
m = s / length(x)
v = varm(x, m)
sd = sqrt(v)

This is more efficient, but not as concise and convenient.

I am considering the following way:

s, m, v, sd = stats(x, (sum_, mean_, var_, std_))

Internally, it should find an efficient routine that computes them altogether. Here, sum_ and mean_ are typed indicators defined as

type Sum_ end
type Mean_ end
type Var_ end
type Std_ end

const sum_ = Sum_()
const mean_ = Mean_()
const var_ = Var_()
const std_ = Std_()

Different combinations of statistics are different tuple types, and therefore we can leverage Julia's multiple dispatch mechanism to choose the optimal computation paths.

This is not urgent, but would be really nice to have. I am not going to implement this in near future. Just open this thread to collect ideas, suggestions, and opinions.

The text was updated successfully, but these errors were encountered:

lindahua · 2014-06-23T15:22:20Z

Another way is to use macros:

s, m, v, sd = @stats(x, (sum, mean, var, std))

Instead of using multiple dispatch (which would grow exponentially as the tuple size increase), the macro can generate the optimal codes at compile-time.

And we don't have to define a bunch of things like sum_, mean_, etc.

simonbyrne · 2014-06-23T18:41:45Z

I agree we need something like this, it would be even better if we could tie it into the sufficient statistics stuff in Distributions.

nalimilan · 2014-06-23T19:52:09Z

Good idea (better than mean_and_std :-). The macro solution looks the best to me.

johnmyleswhite · 2014-06-24T15:47:20Z

This would be great to have. Similar issues come in HypothesisTests.jl, where many statistics require the calculation of other simpler statistics.

tonyhffong · 2014-12-20T10:29:15Z

How about this idea: http://en.wikipedia.org/wiki/Incremental_computing

So we could have an increment computing model that is initialized on the data set. Statistics on this model would then cache any reusable intermediate values, so that subsequent requests on those values or other statistics that depend on them just get the cached values from the model, speeding up computations. This way we are not limited just on the few standard functions.

The code may look something like (very hand-wavy)

model = IC_single_factor_model()
setdata!( model, x )
mean = get( model, :mean ) # this depends on sum, so
sum = get( model, :sum)  # this becomes constant time
std = get( model, :std ) # this requires another O(n) sum of squares, but O(1) cached lookup of sum.
var = get( model, :var ) # but at this point var becomes constant time

With all these flexibilities, these are the prices we'd pay (cons):

the caching mechanism presumably forces us to store intermediate values in a dictionary at runtime, not variables that may be optimized into registers at compile time, when the range of stats needed are known to be small.
the incremental computing model needs to parse what we want done with statistics with those data, in order to properly generate the dependency graph. This feels like duplicated coding efforts for the sake of a constant factor improvement (although this constant factor can be significant, as data size grows).
aesthetically, the lookup may seem a bit unnatural, but it can be hidden via macros.

Nosferican · 2019-06-19T22:30:47Z

Status of this?

nalimilan · 2019-06-20T07:46:51Z

Note that we can now use s, m, v, sd = stats(x, (sum, mean, var, std)) since functions have their own types.

nalimilan · 2023-10-28T17:27:07Z

This isn't implemented, my last comment just meant that we can now use (sum, mean) instead of (:sum, :mean).

ParadaCarleton · 2023-10-28T18:39:23Z

Thanks, sorry about that.

ParadaCarleton · 2023-10-29T16:57:34Z

I'll note that this can be done automatically using the lazy computation provided by something like Transducers.jl, or graph reduction algorithms in general.

lindahua added the question label Jun 23, 2014

ParadaCarleton closed this as completed Oct 28, 2023

nalimilan reopened this Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused computation of statistics #78

Fused computation of statistics #78

lindahua commented Jun 23, 2014

lindahua commented Jun 23, 2014

simonbyrne commented Jun 23, 2014

nalimilan commented Jun 23, 2014

johnmyleswhite commented Jun 24, 2014

tonyhffong commented Dec 20, 2014

Nosferican commented Jun 19, 2019

nalimilan commented Jun 20, 2019

nalimilan commented Oct 28, 2023

ParadaCarleton commented Oct 28, 2023

ParadaCarleton commented Oct 29, 2023

Fused computation of statistics #78

Fused computation of statistics #78

Comments

lindahua commented Jun 23, 2014

lindahua commented Jun 23, 2014

simonbyrne commented Jun 23, 2014

nalimilan commented Jun 23, 2014

johnmyleswhite commented Jun 24, 2014

tonyhffong commented Dec 20, 2014

Nosferican commented Jun 19, 2019

nalimilan commented Jun 20, 2019

nalimilan commented Oct 28, 2023

ParadaCarleton commented Oct 28, 2023

ParadaCarleton commented Oct 29, 2023