Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move statistics out of Base #27152

Merged
merged 1 commit into from
May 28, 2018
Merged

move statistics out of Base #27152

merged 1 commit into from
May 28, 2018

Conversation

fredrikekre
Copy link
Member

see #27140

@fredrikekre fredrikekre changed the title RFC/WIP: move statistics out of Base RFC: move statistics out of Base May 19, 2018
@fredrikekre fredrikekre force-pushed the fe/stats branch 2 times, most recently from 59448dc to 66ea31d Compare May 20, 2018 00:56
@fredrikekre
Copy link
Member Author

fredrikekre commented May 20, 2018

Ok, this is done, StatsBase PR at JuliaStats/StatsBase.jl#379 which passes on Julia v0.6.2, master and on this branch.

I have some questions and/or comments about this PR, and specifically to people that don't like this.

  1. If you are doing statistics in Julia, are these functions in Base enough? I feel like they are pretty limited, and if you are doing real statistics work you probably already have a using StatsBase in your code, no? If so, this change should essentially be a no-op.

  2. Currently it is a huge disadvantage to be in Base (or stdlib) since the releases are much less frequent. Package code gets updated much faster, and you can get a bugfix releas usually within a week. Isn't this a great change w.r.t that?

  3. Does it not make sense to group all stats related functions in one package? Why do you want some code in Base, and some code in StatsBase? This change should make it much easier to maintain etc.

Just my 2¢.

@fredrikekre fredrikekre added deprecation This change introduces or involves a deprecation triage This should be discussed on a triage call excision Removal of code from Base or the repository labels May 20, 2018
@innerlee
Copy link
Contributor

I want to see mean and std (perhaps median and quantile if possible) to stay in Base, they are frequently used even I do not do statistics.

@musm
Copy link
Contributor

musm commented May 20, 2018

Is it possible to rename StatsBase to Statistics, since this is promoting the package to be a fundamental package for statistics (even if it is indeed a 'base' package, Statistics sounds more appropriate and similar to the other base packages).

Currently it is a huge disadvantage to be in Base (or stdlib) since the releases are much less frequent. Package code gets updated much faster, and you can get a bugfix releas usually within a week. Isn't this a great change w.r.t that?

My understanding was that packages under stdlib, can indeed be updated independently of base julia and thus releases of bug fixes of stdlib packages are not tied to any specific release schedule of base.

@quinnj
Copy link
Member

quinnj commented May 20, 2018

I also have a case where I rely on statistics functions in Base (std, mean, cor), but don't need StatsBase. I'd prefer to see the following:

  • Move more of the current StatsBase.jl package functionality to a Statistics stdlib module
  • Get rid of StatsBase.jl and either move non-stdlib functionality to other stats packages (or the currently proposed Stats.jl) or possible a StatisticsExtensions.jl package if needed
  • Further work on Stats.jl to load the whole kitchen sink in terms of common statistical packages

I like that Statistics would be a proper stdlib package, with the stdlib "stamp of approval", proximity to Base code (for timely updates and consideration w/ Base changes), and really, this is Julia we're talking about: fresh numerical computing! Surely we can take some "base" statistics functionality seriously enough to include it in our stdlib. I think it's an important signal about how Julia views stats functionality.

@JeffBezanson
Copy link
Member

I'm fine with having a Statistics package in the stdlib. It could either be StatsBase+(functions currently in Base), or a separate package if there's a logical place to draw a line, but I'm not sure there is.

@o314
Copy link
Contributor

o314 commented May 20, 2018

A separation line could be to keep online algorithms eg. algo that can be computed in one pass with o(1) memory (streamable map-reduce) in stdlib, if not base and move more heavy lifting stuff to its own package.

Josh Day has provided a good list of them
We can cite among them mean, variance, quantiles, maximum minimum, sum, histogram

@nalimilan
Copy link
Member

nalimilan commented May 21, 2018

I agree we have to keep some stats methods in the stdlib. Let's start by moving statistics.jl to a Statistics module, and then we can import things which are deemed essential from StatsBase. Here's a quick summary:

Standard features which would be useful in the stdlib:

  • mean functions (geometric, harmonic and generalized mean)
  • scalar stats (moments, span, coefficient of variation, standard error of the mean, median absolute deviation, z-scores, entropies, percentiles, n-quantiles, interquartile range, modes)
  • robust statistics (trimming and winsoring iterators, variance of trimmed mean)
  • weight vectors and weighted methods of standard functions (documented in other pages)
  • ranks and rank correlations
  • sampling: at least the sample function, but probably not all the specific functions

Things which may not be needed or may need to be adapted first:

  • histograms and ECDF: these are relatively complex so we have to be sure we're happy with the API (Proposal: Move histograms to separate package JuliaStats/StatsBase.jl#650). Take ecdf at least.
  • deviations: these can now be computed easily and efficiently using a generator; though some of them have rather complex formulas, so could still be convenient enough to keep
  • scatter matrix and cov2cor/cor2cov: not sure how useful they are; at least cov2cor/cor2cov seem to be pretty standard
  • counting functions: these are essential, but the API or function names should probably be adjusted, and maybe not all functions are needed; one of the main issues is that ideally these would return NamedArray/AxisArray objects like FreqTables, but it's not possible in Base. Maybe at least take countmap, and rename counts to something more explicit.
  • correlation analysis of signals: are these considered as essential enough? I'm not familiar enough with them to have an opinion. I guess it's nice to have them.
  • miscellaneous (run-length encoding, index/levels maps, indicator matrix, midpoints)
  • stats models abstractions: move to StatsModels

@fredrikekre
Copy link
Member Author

My understanding was that packages under stdlib, can indeed be updated independently of base julia and thus releases of bug fixes of stdlib packages are not tied to any specific release schedule of base.

Yes, that is the goal, but the functionality for that does not exist yet, but would most likely be very much simpler to implement if the standard libraries had their own repos. So I see this PR as a step in that direction without the stopover in JuliaLang/julia/stdlib, see my thoughts in #27197

I also have a case where I rely on statistics functions in Base (std, mean, cor), but don't need StatsBase. I'd prefer to see the following:

  • Move more of the current StatsBase.jl package functionality to a Statistics stdlib module

But then you effectively depend on all the code in StatsBase anyway, so what is the difference?

I like that Statistics would be a proper stdlib package, with the stdlib "stamp of approval", proximity to Base code (for timely updates and consideration w/ Base changes), and really, this is Julia we're talking about: fresh numerical computing! Surely we can take some "base" statistics functionality seriously enough to include it in our stdlib. I think it's an important signal about how Julia views stats functionality.

Yes! And I think that this PR is a step in that direction, as explained above and #27197. But also, if we have a Statistics stdlib it should be the real deal, not just these 8 or so functions?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 24, 2018

Here's where I propose cutting the baby in half...

Keep in Base:

  • mean, mean!
  • median, median!
  • middle
  • quantile, quantile!

Put in stdlib/Statistics:

  • cor
  • cov
  • std, stdm
  • var, varm
  • linreg

And merge StatsBase into stdlib/Statistics in the future.

@JeffBezanson
Copy link
Member

I'm ok with that. I can certainly see removing mean annoying a lot of people, so we might as well keep that and a couple other functions in Base, and move the rest.

@fredrikekre
Copy link
Member Author

Alright, I can update the PR.

We shouldn't just move cor, cov ... to StatsBase directly instead of the stopover at stdlib/Statistics? Then StatsBase can rename itself to Statistics and we are done, no?

@StefanKarpinski
Copy link
Member

Sure, that would be fine too.

@fredrikekre fredrikekre force-pushed the fe/stats branch 3 times, most recently from f27a71d to fe2b611 Compare May 25, 2018 11:37
@fredrikekre
Copy link
Member Author

Ok, so I here we get

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

on all three Travis bots, timeout on AppVeyor and

E: Unable to locate package g++-4.8-multilib

on CircleCI. What a time to be alive. Drop julia support for everything but FreeBSD?

Anyway, passes locally.

@StefanKarpinski
Copy link
Member

#27260 for those who are wondering why everything is broken.

@fredrikekre fredrikekre reopened this May 25, 2018
@fredrikekre fredrikekre changed the title RFC: move statistics out of Base move statistics out of Base May 28, 2018
@fredrikekre
Copy link
Member Author

Lets merge when CI is done.

@ViralBShah ViralBShah merged commit 746d08f into master May 28, 2018
@ViralBShah ViralBShah deleted the fe/stats branch May 28, 2018 12:38
@fredrikekre fredrikekre removed the triage This should be discussed on a triage call label May 28, 2018
martinholters added a commit to JuliaLang/Compat.jl that referenced this pull request May 29, 2018
These functions call through to Base functions, which is rendered
impossible after their moval to StatsBase (JuliaLang/julia#27152). As
StatsBase is using Compat, Compat cannot use StatsBase. In the future,
these functions should print deprecation warinings also on 0.6, telling
people how to get the desired functionality, once that is settled.

Alternatively, we could duplicate the code now moved to StatsBase in
Compat.
martinholters added a commit to JuliaLang/Compat.jl that referenced this pull request May 29, 2018
These functions call through to Base functions, which is rendered
impossible after their moval to StatsBase (JuliaLang/julia#27152). As
StatsBase is using Compat, Compat cannot use StatsBase. In the future,
these functions should print deprecation warinings also on 0.6, telling
people how to get the desired functionality, once that is settled.

Alternatively, we could duplicate the code now moved to StatsBase in
Compat.
martinholters added a commit to JuliaLang/Compat.jl that referenced this pull request May 30, 2018
These functions call through to Base functions, which is rendered
impossible after their moval to StatsBase (JuliaLang/julia#27152). As
StatsBase is using Compat, Compat cannot use StatsBase. In the future,
these functions should print deprecation warinings also on 0.6, telling
people how to get the desired functionality, once that is settled.

Alternatively, we could duplicate the code now moved to StatsBase in
Compat.
martinholters added a commit to JuliaLang/Compat.jl that referenced this pull request May 30, 2018
….7 (#559)

These functions call through to Base functions, which is rendered
impossible after their moval to StatsBase (JuliaLang/julia#27152). As
StatsBase is using Compat, Compat cannot use StatsBase. In the future,
these functions should print deprecation warinings also on 0.6, telling
people how to get the desired functionality, once that is settled.
@fredrikekre fredrikekre added the needs news A NEWS entry is required for this change label May 31, 2018
@ararslan
Copy link
Member

ararslan commented Jun 1, 2018

I think this was a huge mistake. #27374

ararslan added a commit that referenced this pull request Jun 1, 2018
Revert "move cor, cov, std, stdm, var, varm and linreg to StatsBase (#27152)"
This reverts commit 746d08f.
Fixes #27374
@StefanKarpinski
Copy link
Member

991c8331ac24c5d9c70a29f534055088

@ararslan
Copy link
Member

ararslan commented Jun 1, 2018

Yes, pretty much.

@ViralBShah
Copy link
Member

I don't think so. I think perhaps we may need an iteration or two on what belongs where, but the general direction is right.

@StefanKarpinski
Copy link
Member

To be clear, Job post does not mean I think this is a huge mistake, I think it's the right direction too.

@KristofferC KristofferC removed the needs news A NEWS entry is required for this change label Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation This change introduces or involves a deprecation excision Removal of code from Base or the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.