-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats does not document the bins used to compute median and mode #3883
Comments
I do not think that the ISIS stats program uses bins at all to compute its statistics. It considers all of the pixels as a big pile of numbers, and computes the statistics of those numbers. |
The typical definition of mode is the bin that is most populated, so bins need to be involved in some way. Median can be computed without using bins, but I find it most efficient to compute it at the same time I split up the data into bins to compute the mode (and histogram) (no point sorting the data twice...). So the nature of the bins is not actually relevant to median. Forget that part of my post. But the fact we are debating this suggests the documentation can be improved... |
I don't think that's the typical definition of mode. Binning is not required to calculate the mode of a set of values. Searches on 'statistics mode' provide no binning requirement, and most of the software implementations of calculating the mode also do not require a bin selection prior to the calculation. Surely, there are software implementations where if you were binning data, you might be able to pull the mode out without a separate iteration through the data, but that's not required, and (perhaps more importantly) not how the ISIS stats program works. Documentation can always be improved, to be sure. |
Yes, there are ways to define mode that do not involve bins, but the stats application, run in GUI-mode, reports that it is spending a lot of time "Gathering histogram." No histogram is produced by the stats application and none of the reported statistics, other than mode, are associated with calculating a histogram. So, either (a) stats is doing a useless computation of a histogram that it does not report back to the user, (b) stats is using a histogram to compute mode but is not reporting the histogram parameters needed by the user to use the reported value for mode, or (c) the GUI is reporting that it is "gathering histogram" when it is actually doing something else. At this point I no longer know how to classify this ticket... |
The Histogram object does use bins for its median and mode. [This function](https://github.com/USGS-Astrogeology/ISIS3/blob/dev/isis/src/base/objs Putting this into the output file will be too far too much data. I think adding a few sentences to the documentation is sufficient. |
@jessemapel yeah, so this is kind of weird design in the Yes, the However, the ISIS Statistics class does not appear to have a Mode or Median method (although it easily could), which is why the When you actually apply some kind of binning to a Histogram object, sure, it makes sense to calculate those things on the binned data, so it makes complete sense that the Histogram class has those 'special' methods, but in this case, for the way that the So I think the messages about a "Histogram" during the run of stats are kind of misleading. Yes, the software does compute a histogram, but it doesn't have to, and the results are mathematically equivalent to not doing so. The right solution would be to implement those statistical methods in the ISIS Statistics class, |
@rbeyer The Statistics class does not store the input data, it only accumulates it. This is done so that the entire cube doesn't need to be loaded into memory when computing things like this. When we need something like the median or mode that requires sorting the data, sorting it into pre-defined bins is much easier than reading the entire cube into memory and sorting it. For extremely large cubes, I would be very concerned about creating and sorting a multi-gigabyte array. |
Ah, that's what I was missing in that software design. Okay, sounds good. |
Description
The value for median and mode of a data set depend on the way in which the data is binned. While it may be cumbersome to document this in the output text file, the online user documentation should explain how the binning is set up.
Suggested Edit
Explain how stats decides on the min/max range of the bins and the number of bins and the size of the bins. If this documentation is somewhere other than in the documentation for stats, a link to that location from the stats documentation is important to add.
Thanks!
The text was updated successfully, but these errors were encountered: