-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused computation of statistics #78
Comments
Another way is to use macros:
Instead of using multiple dispatch (which would grow exponentially as the tuple size increase), the macro can generate the optimal codes at compile-time. And we don't have to define a bunch of things like |
I agree we need something like this, it would be even better if we could tie it into the sufficient statistics stuff in Distributions. |
Good idea (better than |
This would be great to have. Similar issues come in HypothesisTests.jl, where many statistics require the calculation of other simpler statistics. |
How about this idea: http://en.wikipedia.org/wiki/Incremental_computing So we could have an increment computing model that is initialized on the data set. Statistics on this model would then cache any reusable intermediate values, so that subsequent requests on those values or other statistics that depend on them just get the cached values from the model, speeding up computations. This way we are not limited just on the few standard functions. The code may look something like (very hand-wavy) model = IC_single_factor_model()
setdata!( model, x )
mean = get( model, :mean ) # this depends on sum, so
sum = get( model, :sum) # this becomes constant time
std = get( model, :std ) # this requires another O(n) sum of squares, but O(1) cached lookup of sum.
var = get( model, :var ) # but at this point var becomes constant time With all these flexibilities, these are the prices we'd pay (cons):
|
Status of this? |
Note that we can now use |
This isn't implemented, my last comment just meant that we can now use |
Thanks, sorry about that. |
I'll note that this can be done automatically using the lazy computation provided by something like Transducers.jl, or graph reduction algorithms in general. |
I have been considering a uniform interface for computing multiple statistics all at once, while allowing they share part of the computation.
Consider the following example. We want to compute
sum
,mean
,var
, andstd
fromx
:This clearly would waste a lot of computation (e.g it actually computes sum four times, mean three times, and variance twice).
A more efficient way would be
This is more efficient, but not as concise and convenient.
I am considering the following way:
Internally, it should find an efficient routine that computes them altogether. Here,
sum_
andmean_
are typed indicators defined asDifferent combinations of statistics are different tuple types, and therefore we can leverage Julia's multiple dispatch mechanism to choose the optimal computation paths.
This is not urgent, but would be really nice to have. I am not going to implement this in near future. Just open this thread to collect ideas, suggestions, and opinions.
The text was updated successfully, but these errors were encountered: