Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture running statistics for algorithms #19

Closed
wants to merge 9 commits into from
Closed

Capture running statistics for algorithms #19

wants to merge 9 commits into from

Conversation

nishihatapalmer
Copy link
Contributor

@nishihatapalmer nishihatapalmer commented May 28, 2022

This PR has changes which allow algorithms to report running statistics.

  • Algorithms that don't implement statistics don't have to do anything.
  • A wide range of stats can be reported on, including memory, size of tables, number of shifts, fast path count, etc.
  • An algorithm can define up to 8 algorithm-specific stats it wishes to report on.
  • Stats summaries are given when running in the console, written out to the HTML reports, and to a tab-delimited text file.

Running
To obtain stats in a run, use the -stats flag:

   ./smart -text englishTexts -pre -stats
____________________________________________________________
Experimental results on englishTexts: EXP1653747757
Searching for a set of 500 patterns with length 4
Testing 2 algorithms

 - [1/2] TWFR4 ...............[OK]  	0.01 + 1.82 ms     
 - [2/2] WFR4 ................[OK]  	0.02 + 2.40 ms     	mem=65536, size=65536, bytes=4202857, shifts=1048572, tests=2470, bits=9, empty=65526
____________________________________________________________
Experimental results on englishTexts: EXP1653747757
Searching for a set of 500 patterns with length 8
Testing 2 algorithms

 - [1/2] TWFR4 ...............[OK]  	0.01 + 0.73 ms     
 - [2/2] WFR4 ................[OK]  	0.02 + 0.84 ms     	mem=65536, size=65536, bytes=865484, shifts=212562, tests=126, bits=33, empty=65502
____________________________________________________________

Standard Stats
It's entirely up to the implementer what stats are reported, but a standard set of statistics is provided that fit most search algorithms.

enum searchInfoStats {
    TEXT_LENGTH,             // length of text searched
    SEARCH_INDEX_BYTES,      // memory consumed by search indexes
    SEARCH_INDEX_ENTRIES,    // number of entries in the search index
    SEARCH_INDEX2_ENTRIES,   // number of entries in the second search index (if present)
    MAIN_LOOP_COUNT,         // number of times the main loop executes
    TEXT_BYTES_READ,         // number of bytes read from the text
    INDEX_LOOKUP_COUNT,      // number of times a lookup in an index is made
    FAST_PATH_COUNT,         // number of times the fast path is taken.
    FAST_PATH_SHIFTS,        // sum of shifts obtained in the fast path
    SLOW_PATH_COUNT,         // number of times the slow path is taken.
    SLOW_PATH_SHIFTS,        // sum of shifts obtained in the slow path
    VALIDATION_COUNT,        // number of times a pattern validation is attempted
    VALIDATION_BYTES_READ,   // text bytes read during pattern validation
    VALIDATION_SHIFTS,       // sum of shifts obtained following validation
    NUM_SHIFTS,              // number of times the algorithm actually shifts (not the sum of the shifts)
    MATCH_COUNT,             // the number of matches the algorithm detects
    ALGO_VALUES              // algorithm defined values (up to 8)
};

Algorithm defined Stats
Algorithms can define up to 8 stats particular to an algorithm. To do this, they must implement the getAlgoValueNames() function. These are often the same for particular kinds of algorithm. Below is an example from \include\bitstats.h, which defines two stats useful for some bit-oriented algorithms.

struct algoValueNames getAlgoValueNames() {
    struct algoValueNames names = {0};
    setAlgoValueName(&names, 0, "bits", "Count of bits set in index");
    setAlgoValueName(&names, 1, "empty", "Count of empty index entries");
    return names;
}

If you don't want to implement any special stats for an algorithm, an empty set of stat names should be returned:

struct algoValueNames getAlgoValueNames() {
    struct algoValueNames names = {0};
    return names;
}

Algorithms
This PR makes no changes to any existing algorithms, so on it's own, running -stats will do nothing - no algorithms support it!
I'll submit changes to families of existing algorithms which have had stats added to them in batches in further PRs.
To test stats, you should look at these PRs, which are branched off this.

  * Outputs summary stats to the console during running.
  * Outputs stats to a .stats.txt file for algorithms with statistics.
  * Writes the stats into the HTML report under each algo graph that has stats.
    The table support dynamically showing or hiding best or worst stats,
    similarly to the time comparison table.

Also includes utility headers, shiftstats.h and bitstats.h that define
common stat names that many algorithms use.
   * A few search algorithms have more than one main
     index, e.g. Boyer Moore.
   * This just provides a convenient place to capture
     those stats without having to resort to
     algo-specific values.
  * If there's a few algorithm defined stats, they tend to run over a line
    in a console, even with reasonably long lines.
  * Keep the names as short as possible, to keep things as neat as possible.
  * A few algorithms maintain more than one table, this header
    can be used to define the basic bit stats for two tables.
  * We don't record this anymore, as we store a stat
    per pattern length anyway.
@nishihatapalmer
Copy link
Contributor Author

I should add, collecting stats does not affect normal benchmark timings, as stats are collected in a separate search function after the main search is timed.

@nishihatapalmer
Copy link
Contributor Author

One thing I have found using stats is that the code uses too much stack memory. With a normal number of algorithms to test it's not a problem - but I recently profiled over 1000 algorithms and it would crash. I had to comment out the stats code to run normal profiling.

I think, if you're still interested in it, it should be altered to allocate memory on the heap. I have found the stats capability useful in practice to understand what algorithms are doing, although I only use it occasionally.

@nishihatapalmer nishihatapalmer closed this by deleting the head repository Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant