Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused computation of statistics #78

Open
lindahua opened this issue Jun 23, 2014 · 10 comments
Open

Fused computation of statistics #78

lindahua opened this issue Jun 23, 2014 · 10 comments
Labels

Comments

@lindahua
Copy link
Contributor

I have been considering a uniform interface for computing multiple statistics all at once, while allowing they share part of the computation.

Consider the following example. We want to compute sum, mean, var, and std from x:

s, m, v, sd = sum(x), mean(x), var(x), std(x)

This clearly would waste a lot of computation (e.g it actually computes sum four times, mean three times, and variance twice).

A more efficient way would be

s = sum(x)
m = s / length(x)
v = varm(x, m)
sd = sqrt(v)

This is more efficient, but not as concise and convenient.

I am considering the following way:

s, m, v, sd = stats(x, (sum_, mean_, var_, std_))

Internally, it should find an efficient routine that computes them altogether. Here, sum_ and mean_ are typed indicators defined as

type Sum_ end
type Mean_ end
type Var_ end
type Std_ end

const sum_ = Sum_()
const mean_ = Mean_()
const var_ = Var_()
const std_ = Std_()

Different combinations of statistics are different tuple types, and therefore we can leverage Julia's multiple dispatch mechanism to choose the optimal computation paths.

This is not urgent, but would be really nice to have. I am not going to implement this in near future. Just open this thread to collect ideas, suggestions, and opinions.

@lindahua
Copy link
Contributor Author

Another way is to use macros:

s, m, v, sd = @stats(x, (sum, mean, var, std))

Instead of using multiple dispatch (which would grow exponentially as the tuple size increase), the macro can generate the optimal codes at compile-time.

And we don't have to define a bunch of things like sum_, mean_, etc.

@simonbyrne
Copy link
Member

I agree we need something like this, it would be even better if we could tie it into the sufficient statistics stuff in Distributions.

@nalimilan
Copy link
Member

Good idea (better than mean_and_std :-). The macro solution looks the best to me.

@johnmyleswhite
Copy link
Member

This would be great to have. Similar issues come in HypothesisTests.jl, where many statistics require the calculation of other simpler statistics.

@tonyhffong
Copy link

How about this idea: http://en.wikipedia.org/wiki/Incremental_computing

So we could have an increment computing model that is initialized on the data set. Statistics on this model would then cache any reusable intermediate values, so that subsequent requests on those values or other statistics that depend on them just get the cached values from the model, speeding up computations. This way we are not limited just on the few standard functions.

The code may look something like (very hand-wavy)

model = IC_single_factor_model()
setdata!( model, x )
mean = get( model, :mean ) # this depends on sum, so
sum = get( model, :sum)  # this becomes constant time
std = get( model, :std ) # this requires another O(n) sum of squares, but O(1) cached lookup of sum.
var = get( model, :var ) # but at this point var becomes constant time

With all these flexibilities, these are the prices we'd pay (cons):

  • the caching mechanism presumably forces us to store intermediate values in a dictionary at runtime, not variables that may be optimized into registers at compile time, when the range of stats needed are known to be small.
  • the incremental computing model needs to parse what we want done with statistics with those data, in order to properly generate the dependency graph. This feels like duplicated coding efforts for the sake of a constant factor improvement (although this constant factor can be significant, as data size grows).
  • aesthetically, the lookup may seem a bit unnatural, but it can be hidden via macros.

@Nosferican
Copy link
Contributor

Status of this?

@nalimilan
Copy link
Member

Note that we can now use s, m, v, sd = stats(x, (sum, mean, var, std)) since functions have their own types.

@nalimilan
Copy link
Member

This isn't implemented, my last comment just meant that we can now use (sum, mean) instead of (:sum, :mean).

@nalimilan nalimilan reopened this Oct 28, 2023
@ParadaCarleton
Copy link
Contributor

Thanks, sorry about that.

@ParadaCarleton
Copy link
Contributor

I'll note that this can be done automatically using the lazy computation provided by something like Transducers.jl, or graph reduction algorithms in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants