README

This package requires one more library, in addition to the files shipped with
the program: Tim Henderson's Zhang-Shasha library, available at
https://github.com/timtadh/zhang-shasha. It is known to work with revision
138c991, and should work with versions after commit 7c910cc. This includes the
most recent version on PyPi at the time of writing, version 1.1.

SYNOPSIS:
    syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] fileA fileB
    syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] --dirs dir...

DESCRIPTION:
    --tree
        Read phrase structure trees instead of dependency trees. The phrase
        structure format is slightly idiosyncratic. See the section "Phrase
        structure format" below for details.

    --conll
        Read CoNLL formatted dependency trees. This is the default.

    --acc
        Compute uncorrected accuracies in addition to alpha score. For
        dependency trees, UAS, LAS and label accuracy is computed, and for
        phrase structure trees Jaccard similarity is computed.

    --metric=plain|diff|norm|all
        Select the metric to use, or compute all metrics at the same time. The
        default metric is the plain metric.

        NOTE: For any use beyond the reproduction of the results presented in
        Skjærholt (2014) we discourage the use of any other metric that
        α_plain.

    --dirs
        Enable multi-annotator mode. In cases where there are more than two
        annotators, it is common that not all annotators have annotated all of
        the texts. Therefore, we use a mode of operation where each
        annotator's output is in a separate directory. Sentences from files
        with matching names will be grouped together to account for missing
        annotations.

        The file and directory structure must follow the following convention:
        We assume the basename of the directory path to be the "name" of the
        annotator, and the files within to be named thusly:
        $prefix-$name.conll (or $prefix-$name.tree for constituency trees),
        where files with the same prefix in different directories are assumed
        to contain *exactly* the same sentences.

        If the --acc option is also passed, pairwise accuracies are computed.

WEIRD RESULTS ON SMALL DATA SETS
    During initial testing to make sure everything is working, it's common to
    run the tool on very small data sets; if the data set is extremely small
    (more precisely, a single sentence), the tool will return correct results
    that are nonetheless counter-intuitive.

    First, we note that alpha is defined to be 1 - Do/De, where the observed
    distance Do is the mean distance between all pairs of annotations for the
    same sentence (that is, for all sentences compute mean distance between
    annotations of the sentence; Do is the mean of these means), and De is the
    mean distance between all possible pairs of annotations.

    Now, if the data set being processed consists of a set of annotations for
    a single sentence, where at least one annotation differs from the others,
    alpha will be 0. This is because the set of pairs within sentences and the
    set of all possible pairs will be identical, which in turn means that
    Do=De, and thus Do/De=1 and alpha=0.

    If the data set is a set of annotations for a single sentence, and all the
    annotations are identical (because the tool is passed the same
    single-sentence file as corpusA and corpusB, for example), the program
    will terminate with a ZeroDivisionError. This is because all the trees in
    the data are identical, which yields De=0 and thus alpha being undefined.

PHRASE STRUCTURE FORMAT:
    Assume we have the following tree for the sentence "I saw the dog":
                       S
                       ^
                      / \
                     /   VP
                    |     ^
                    |    / \
                    |   /   NP
                   NP  |     ^
                    |  |    / \
                    P  V   D   N
                    |  |   |   |
                    I saw the dog

    The program then expects the tree to be stored *delexicalised* as follows:
        (S (NP P) (VP V (NP D N)))

BUGS
    Probably. If you find any, please create an issue in the GitHub repository
    at <https://github.com/arnsholt/syn-agreement/issues> or contact the
    author by email.

AUTHOR
    Arne Skjærholt <arnsholt@gmail.com>

    Also, many thanks to Andreas Peldszus for invaluable help with finding and
    debugging issues before the initial realease of the code.

LICENCES:
    The files syn-agreement.py and conll.py are (c) 2014 Arne Skjærholt and
    released under the GNU GPL version 2 or later:
    <http://gnu.org/licenses/gpl.html>

    The code in alpha.py is (c) 2011-2014 Thomas Grill and released under the
    Creative Commons Attribution-ShareAlike licence:
    <http://creativecommons.org/licenses/by-sa/3.0/>

    The data from the Norwegian Dependency Treebank in data/ndt/ is free for
    all uses, as long as they are not published as running, human readable
    text.

    The data from the Copenhagen Dependency Treebanks in data/cdt/ is licenced
    under the GNU GPL version 2: <http://gnu.org/licenses/gpl.html>

    The SSD dataset in data/ssd/ is released under the MIT licence.