README

An evaluation tool for pairwise preference judgements.

Evaluation measures and file formats described in the following:

  B. Carterette, P.N. Bennett, O. Chapelle (2008). A Test Collection of
  Preference Judgments. In Proceedings of the SIGIR 2008 Beyond Binary
  Relevance: Preferences, Diversity, and Set-Level Judgments Workshop.
  Singapore. July, 2008.
   http://research.microsoft.com/en-us/um/people/pauben/papers/sigir-2008-bbr-data-preference-overview.pdf

There is one notable exception to the evaluation measures calculated:

APpref as computed by this tool is somewhat different that what is described in
the above paper. Here, we average ppref@k values across ALL the positions where
a *preferred* document occurs, that is a document that has ever been preferred
to any other document for this query. Note that these are exactly the positions
where rpref@k could change, although it does not necessarily.

We also include "positions" beyond the end of the rank list, in order to account
for preferred documents that were not retrieved by the system. This is in
keeping with standard MAP calculations that average over all judged relevant
documents whether or not they were retrieved.

pairwise_pref_eval.py [-q] [-v] [-i] PREFERENCES_FILE RESULTS_FILE
  Evaluates the results against the provided preferences.
  PREFERENCES_FILE should be in a 4-column format:
    QID  SOURCE_DOC  TARGET_DOC  PREFERENCE
  Where
    QID == query ID
    SOURCE_DOC, TARGET_DOC are document identifiers
    PREFERENCE (-2, -1, 0, 1, 2) indicates the preference value.
    PREFERENCE == -2 -- indicates SOURCE_DOC is BAD, TARGET_DOC should be "NA"
                  -1 -- indicates SOURCE_DOC preferred to TARGET_DOC
                   0 -- indicates SOURCE_DOC and TARGET_DOC are duplicates
                   1 -- indicates TARGET_DOC preferred to SOURCE_DOC
                   2 -- indicates TARGET_DOC is BAD, SOURCE_DOC should be "NA"
  RESULTS_FILE is in the standard 5-column TREC format
    QID "Q0" DOC RANK SCORE [COMMENT+]
  Note:
    the "Q0", RANK & COMMENT columns are ignored

  -q -- produces per-query output
  -v -- produces (very, very) verbose output
  -i -- do not assume preferences are transitive. Transitivity is assumed by
        default

test_eval.sh
  Runs the evaluation script above on sample data, comparing the output to
  the expected output.

test_data/
  small_results small_preferences small_expected_results.txt
    - toy examples & expected results
  results.txt preferences.txt expected_results.txt
    - more complicated examples & expected results