-
Notifications
You must be signed in to change notification settings - Fork 2
arnsholt/syn-agreement
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This package requires one more library, in addition to the files shipped with the program: Tim Henderson's Zhang-Shasha library, available at https://github.com/timtadh/zhang-shasha. It is known to work with revision 138c991, and should work with versions after commit 7c910cc. This includes the most recent version on PyPi at the time of writing, version 1.1. SYNOPSIS: syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] fileA fileB syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] --dirs dir... DESCRIPTION: --tree Read phrase structure trees instead of dependency trees. The phrase structure format is slightly idiosyncratic. See the section "Phrase structure format" below for details. --conll Read CoNLL formatted dependency trees. This is the default. --acc Compute uncorrected accuracies in addition to alpha score. For dependency trees, UAS, LAS and label accuracy is computed, and for phrase structure trees Jaccard similarity is computed. --metric=plain|diff|norm|all Select the metric to use, or compute all metrics at the same time. The default metric is the plain metric. NOTE: For any use beyond the reproduction of the results presented in Skjærholt (2014) we discourage the use of any other metric that α_plain. --dirs Enable multi-annotator mode. In cases where there are more than two annotators, it is common that not all annotators have annotated all of the texts. Therefore, we use a mode of operation where each annotator's output is in a separate directory. Sentences from files with matching names will be grouped together to account for missing annotations. The file and directory structure must follow the following convention: We assume the basename of the directory path to be the "name" of the annotator, and the files within to be named thusly: $prefix-$name.conll (or $prefix-$name.tree for constituency trees), where files with the same prefix in different directories are assumed to contain *exactly* the same sentences. If the --acc option is also passed, pairwise accuracies are computed. WEIRD RESULTS ON SMALL DATA SETS During initial testing to make sure everything is working, it's common to run the tool on very small data sets; if the data set is extremely small (more precisely, a single sentence), the tool will return correct results that are nonetheless counter-intuitive. First, we note that alpha is defined to be 1 - Do/De, where the observed distance Do is the mean distance between all pairs of annotations for the same sentence (that is, for all sentences compute mean distance between annotations of the sentence; Do is the mean of these means), and De is the mean distance between all possible pairs of annotations. Now, if the data set being processed consists of a set of annotations for a single sentence, where at least one annotation differs from the others, alpha will be 0. This is because the set of pairs within sentences and the set of all possible pairs will be identical, which in turn means that Do=De, and thus Do/De=1 and alpha=0. If the data set is a set of annotations for a single sentence, and all the annotations are identical (because the tool is passed the same single-sentence file as corpusA and corpusB, for example), the program will terminate with a ZeroDivisionError. This is because all the trees in the data are identical, which yields De=0 and thus alpha being undefined. PHRASE STRUCTURE FORMAT: Assume we have the following tree for the sentence "I saw the dog": S ^ / \ / VP | ^ | / \ | / NP NP | ^ | | / \ P V D N | | | | I saw the dog The program then expects the tree to be stored *delexicalised* as follows: (S (NP P) (VP V (NP D N))) BUGS Probably. If you find any, please create an issue in the GitHub repository at <https://github.com/arnsholt/syn-agreement/issues> or contact the author by email. AUTHOR Arne Skjærholt <[email protected]> Also, many thanks to Andreas Peldszus for invaluable help with finding and debugging issues before the initial realease of the code. LICENCES: The files syn-agreement.py and conll.py are (c) 2014 Arne Skjærholt and released under the GNU GPL version 2 or later: <http://gnu.org/licenses/gpl.html> The code in alpha.py is (c) 2011-2014 Thomas Grill and released under the Creative Commons Attribution-ShareAlike licence: <http://creativecommons.org/licenses/by-sa/3.0/> The data from the Norwegian Dependency Treebank in data/ndt/ is free for all uses, as long as they are not published as running, human readable text. The data from the Copenhagen Dependency Treebanks in data/cdt/ is licenced under the GNU GPL version 2: <http://gnu.org/licenses/gpl.html> The SSD dataset in data/ssd/ is released under the MIT licence.
About
Compute alpha-agreement for syntactic data
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published