Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats does not document the bins used to compute median and mode #3883

Closed
lavalaz opened this issue May 27, 2020 · 8 comments
Closed

stats does not document the bins used to compute median and mode #3883

lavalaz opened this issue May 27, 2020 · 8 comments
Assignees

Comments

@lavalaz
Copy link

lavalaz commented May 27, 2020

Description

The value for median and mode of a data set depend on the way in which the data is binned. While it may be cumbersome to document this in the output text file, the online user documentation should explain how the binning is set up.
Suggested Edit

Explain how stats decides on the min/max range of the bins and the number of bins and the size of the bins. If this documentation is somewhere other than in the documentation for stats, a link to that location from the stats documentation is important to add.
Thanks!

@rbeyer
Copy link
Contributor

rbeyer commented May 27, 2020

I do not think that the ISIS stats program uses bins at all to compute its statistics. It considers all of the pixels as a big pile of numbers, and computes the statistics of those numbers.

@lavalaz
Copy link
Author

lavalaz commented May 27, 2020

The typical definition of mode is the bin that is most populated, so bins need to be involved in some way. Median can be computed without using bins, but I find it most efficient to compute it at the same time I split up the data into bins to compute the mode (and histogram) (no point sorting the data twice...). So the nature of the bins is not actually relevant to median. Forget that part of my post. But the fact we are debating this suggests the documentation can be improved...

@rbeyer
Copy link
Contributor

rbeyer commented May 27, 2020

I don't think that's the typical definition of mode. Binning is not required to calculate the mode of a set of values. Searches on 'statistics mode' provide no binning requirement, and most of the software implementations of calculating the mode also do not require a bin selection prior to the calculation.

Surely, there are software implementations where if you were binning data, you might be able to pull the mode out without a separate iteration through the data, but that's not required, and (perhaps more importantly) not how the ISIS stats program works.

Documentation can always be improved, to be sure.

@lavalaz
Copy link
Author

lavalaz commented May 27, 2020

Yes, there are ways to define mode that do not involve bins, but the stats application, run in GUI-mode, reports that it is spending a lot of time "Gathering histogram." No histogram is produced by the stats application and none of the reported statistics, other than mode, are associated with calculating a histogram. So, either (a) stats is doing a useless computation of a histogram that it does not report back to the user, (b) stats is using a histogram to compute mode but is not reporting the histogram parameters needed by the user to use the reported value for mode, or (c) the GUI is reporting that it is "gathering histogram" when it is actually doing something else. At this point I no longer know how to classify this ticket...

@jessemapel
Copy link
Contributor

The Histogram object does use bins for its median and mode.

[This function](https://github.com/USGS-Astrogeology/ISIS3/blob/dev/isis/src/base/objs
/Histogram/Histogram.cpp#L273) is how stats determines the min/max and bins to use. In short, it tries to make one bin for every possible DN in the cube. If the cube is 32-bit, then it uses 65536 bins and queries the min and max values in the cube to determine the min/max.

Putting this into the output file will be too far too much data. I think adding a few sentences to the documentation is sufficient.

@rbeyer
Copy link
Contributor

rbeyer commented May 27, 2020

@jessemapel yeah, so this is kind of weird design in the stats program that maybe I'm not 100% understanding.

Yes, the stats program tells the cube to create a Histogram object (at line 119 which is why Laz sees that "Gathering Histogram" message) where there are as many bins as there are DN, as you point out. However, mathematically making a "histogram" like this to calculate these values is equivalent to not making a histogram at all, and just calculating these values based on the DN values. In fact, the ISIS Histogram class is a descendent of the ISIS Statistics class, so many of the functions that are being called on the instantiated Histogram in the stats program (which is humorously named "stats") are just straight up ISIS Statistics class functions.

However, the ISIS Statistics class does not appear to have a Mode or Median method (although it easily could), which is why the stats program must use the ISIS Histogram object for those methods. However, there is this computational cost of creating a (potentially very large) Histogram object to perform math that could just be done in the ISIS Statistics class with a much lower computational cost.

When you actually apply some kind of binning to a Histogram object, sure, it makes sense to calculate those things on the binned data, so it makes complete sense that the Histogram class has those 'special' methods, but in this case, for the way that the stats program works, creating this weird every-DN-a-bin Histogram just to enable calculating these statistical values on an array that already exists, seems like extra work.

So I think the messages about a "Histogram" during the run of stats are kind of misleading. Yes, the software does compute a histogram, but it doesn't have to, and the results are mathematically equivalent to not doing so.

The right solution would be to implement those statistical methods in the ISIS Statistics class,
so that you can have the stats program just have the cube create a Statistics object, rather than a Histogram object, which would provide identical results to users of stats, remove the misleading messages about histograms, and would speed up the result. However, even for large cubes, stats doesn't really take all that long to run, so patching docs is probably easier.

@jessemapel
Copy link
Contributor

@rbeyer The Statistics class does not store the input data, it only accumulates it. This is done so that the entire cube doesn't need to be loaded into memory when computing things like this.

When we need something like the median or mode that requires sorting the data, sorting it into pre-defined bins is much easier than reading the entire cube into memory and sorting it.

For extremely large cubes, I would be very concerned about creating and sorting a multi-gigabyte array.

@rbeyer
Copy link
Contributor

rbeyer commented May 27, 2020

Ah, that's what I was missing in that software design. Okay, sounds good.

@acpaquette acpaquette self-assigned this Jun 11, 2020
@acpaquette acpaquette mentioned this issue Jun 15, 2020
11 tasks
@jlaura jlaura closed this as completed Sep 8, 2020
scsides added a commit to scsides/ISIS3 that referenced this issue Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants