stats does not document the bins used to compute median and mode #3883

lavalaz · 2020-05-27T01:05:46Z

Description

The value for median and mode of a data set depend on the way in which the data is binned. While it may be cumbersome to document this in the output text file, the online user documentation should explain how the binning is set up.
Suggested Edit

Explain how stats decides on the min/max range of the bins and the number of bins and the size of the bins. If this documentation is somewhere other than in the documentation for stats, a link to that location from the stats documentation is important to add.
Thanks!

rbeyer · 2020-05-27T01:32:56Z

I do not think that the ISIS stats program uses bins at all to compute its statistics. It considers all of the pixels as a big pile of numbers, and computes the statistics of those numbers.

lavalaz · 2020-05-27T01:54:53Z

The typical definition of mode is the bin that is most populated, so bins need to be involved in some way. Median can be computed without using bins, but I find it most efficient to compute it at the same time I split up the data into bins to compute the mode (and histogram) (no point sorting the data twice...). So the nature of the bins is not actually relevant to median. Forget that part of my post. But the fact we are debating this suggests the documentation can be improved...

rbeyer · 2020-05-27T02:02:48Z

I don't think that's the typical definition of mode. Binning is not required to calculate the mode of a set of values. Searches on 'statistics mode' provide no binning requirement, and most of the software implementations of calculating the mode also do not require a bin selection prior to the calculation.

Surely, there are software implementations where if you were binning data, you might be able to pull the mode out without a separate iteration through the data, but that's not required, and (perhaps more importantly) not how the ISIS stats program works.

Documentation can always be improved, to be sure.

lavalaz · 2020-05-27T15:00:36Z

Yes, there are ways to define mode that do not involve bins, but the stats application, run in GUI-mode, reports that it is spending a lot of time "Gathering histogram." No histogram is produced by the stats application and none of the reported statistics, other than mode, are associated with calculating a histogram. So, either (a) stats is doing a useless computation of a histogram that it does not report back to the user, (b) stats is using a histogram to compute mode but is not reporting the histogram parameters needed by the user to use the reported value for mode, or (c) the GUI is reporting that it is "gathering histogram" when it is actually doing something else. At this point I no longer know how to classify this ticket...

jessemapel · 2020-05-27T15:44:56Z

The Histogram object does use bins for its median and mode.

[This function](https://github.com/USGS-Astrogeology/ISIS3/blob/dev/isis/src/base/objs
/Histogram/Histogram.cpp#L273) is how stats determines the min/max and bins to use. In short, it tries to make one bin for every possible DN in the cube. If the cube is 32-bit, then it uses 65536 bins and queries the min and max values in the cube to determine the min/max.

Putting this into the output file will be too far too much data. I think adding a few sentences to the documentation is sufficient.

rbeyer · 2020-05-27T16:39:39Z

@jessemapel yeah, so this is kind of weird design in the stats program that maybe I'm not 100% understanding.

Yes, the stats program tells the cube to create a Histogram object (at line 119 which is why Laz sees that "Gathering Histogram" message) where there are as many bins as there are DN, as you point out. However, mathematically making a "histogram" like this to calculate these values is equivalent to not making a histogram at all, and just calculating these values based on the DN values. In fact, the ISIS Histogram class is a descendent of the ISIS Statistics class, so many of the functions that are being called on the instantiated Histogram in the stats program (which is humorously named "stats") are just straight up ISIS Statistics class functions.

However, the ISIS Statistics class does not appear to have a Mode or Median method (although it easily could), which is why the stats program must use the ISIS Histogram object for those methods. However, there is this computational cost of creating a (potentially very large) Histogram object to perform math that could just be done in the ISIS Statistics class with a much lower computational cost.

When you actually apply some kind of binning to a Histogram object, sure, it makes sense to calculate those things on the binned data, so it makes complete sense that the Histogram class has those 'special' methods, but in this case, for the way that the stats program works, creating this weird every-DN-a-bin Histogram just to enable calculating these statistical values on an array that already exists, seems like extra work.

So I think the messages about a "Histogram" during the run of stats are kind of misleading. Yes, the software does compute a histogram, but it doesn't have to, and the results are mathematically equivalent to not doing so.

The right solution would be to implement those statistical methods in the ISIS Statistics class,
so that you can have the stats program just have the cube create a Statistics object, rather than a Histogram object, which would provide identical results to users of stats, remove the misleading messages about histograms, and would speed up the result. However, even for large cubes, stats doesn't really take all that long to run, so patching docs is probably easier.

jessemapel · 2020-05-27T16:54:06Z

@rbeyer The Statistics class does not store the input data, it only accumulates it. This is done so that the entire cube doesn't need to be loaded into memory when computing things like this.

When we need something like the median or mode that requires sorting the data, sorting it into pre-defined bins is much easier than reading the entire cube into memory and sorting it.

For extremely large cubes, I would be very concerned about creating and sorting a multi-gigabyte array.

rbeyer · 2020-05-27T16:57:26Z

Ah, that's what I was missing in that software design. Okay, sounds good.

acpaquette self-assigned this Jun 11, 2020

acpaquette mentioned this issue Jun 15, 2020

Stats Docs #3923

Merged

11 tasks

jlaura closed this as completed Sep 8, 2020

scsides added a commit to scsides/ISIS3 that referenced this issue Sep 22, 2020

Removed library file names from library paths. Closes DOI-USGS#3883

02bdaf8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats does not document the bins used to compute median and mode #3883

stats does not document the bins used to compute median and mode #3883

lavalaz commented May 27, 2020

rbeyer commented May 27, 2020

lavalaz commented May 27, 2020

rbeyer commented May 27, 2020

lavalaz commented May 27, 2020

jessemapel commented May 27, 2020

rbeyer commented May 27, 2020

jessemapel commented May 27, 2020

rbeyer commented May 27, 2020

stats does not document the bins used to compute median and mode #3883

stats does not document the bins used to compute median and mode #3883

Comments

lavalaz commented May 27, 2020

rbeyer commented May 27, 2020

lavalaz commented May 27, 2020

rbeyer commented May 27, 2020

lavalaz commented May 27, 2020

jessemapel commented May 27, 2020

rbeyer commented May 27, 2020

jessemapel commented May 27, 2020

rbeyer commented May 27, 2020