-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
119 lines (96 loc) · 5.28 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
This package requires one more library, in addition to the files shipped with
the program: Tim Henderson's Zhang-Shasha library, available at
https://github.com/timtadh/zhang-shasha. It is known to work with revision
138c991, and should work with versions after commit 7c910cc. This includes the
most recent version on PyPi at the time of writing, version 1.1.
SYNOPSIS:
syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] fileA fileB
syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] --dirs dir...
DESCRIPTION:
--tree
Read phrase structure trees instead of dependency trees. The phrase
structure format is slightly idiosyncratic. See the section "Phrase
structure format" below for details.
--conll
Read CoNLL formatted dependency trees. This is the default.
--acc
Compute uncorrected accuracies in addition to alpha score. For
dependency trees, UAS, LAS and label accuracy is computed, and for
phrase structure trees Jaccard similarity is computed.
--metric=plain|diff|norm|all
Select the metric to use, or compute all metrics at the same time. The
default metric is the plain metric.
NOTE: For any use beyond the reproduction of the results presented in
Skjærholt (2014) we discourage the use of any other metric that
α_plain.
--dirs
Enable multi-annotator mode. In cases where there are more than two
annotators, it is common that not all annotators have annotated all of
the texts. Therefore, we use a mode of operation where each
annotator's output is in a separate directory. Sentences from files
with matching names will be grouped together to account for missing
annotations.
The file and directory structure must follow the following convention:
We assume the basename of the directory path to be the "name" of the
annotator, and the files within to be named thusly:
$prefix-$name.conll (or $prefix-$name.tree for constituency trees),
where files with the same prefix in different directories are assumed
to contain *exactly* the same sentences.
If the --acc option is also passed, pairwise accuracies are computed.
WEIRD RESULTS ON SMALL DATA SETS
During initial testing to make sure everything is working, it's common to
run the tool on very small data sets; if the data set is extremely small
(more precisely, a single sentence), the tool will return correct results
that are nonetheless counter-intuitive.
First, we note that alpha is defined to be 1 - Do/De, where the observed
distance Do is the mean distance between all pairs of annotations for the
same sentence (that is, for all sentences compute mean distance between
annotations of the sentence; Do is the mean of these means), and De is the
mean distance between all possible pairs of annotations.
Now, if the data set being processed consists of a set of annotations for
a single sentence, where at least one annotation differs from the others,
alpha will be 0. This is because the set of pairs within sentences and the
set of all possible pairs will be identical, which in turn means that
Do=De, and thus Do/De=1 and alpha=0.
If the data set is a set of annotations for a single sentence, and all the
annotations are identical (because the tool is passed the same
single-sentence file as corpusA and corpusB, for example), the program
will terminate with a ZeroDivisionError. This is because all the trees in
the data are identical, which yields De=0 and thus alpha being undefined.
PHRASE STRUCTURE FORMAT:
Assume we have the following tree for the sentence "I saw the dog":
S
^
/ \
/ VP
| ^
| / \
| / NP
NP | ^
| | / \
P V D N
| | | |
I saw the dog
The program then expects the tree to be stored *delexicalised* as follows:
(S (NP P) (VP V (NP D N)))
BUGS
Probably. If you find any, please create an issue in the GitHub repository
at <https://github.com/arnsholt/syn-agreement/issues> or contact the
author by email.
AUTHOR
Arne Skjærholt <[email protected]>
Also, many thanks to Andreas Peldszus for invaluable help with finding and
debugging issues before the initial realease of the code.
LICENCES:
The files syn-agreement.py and conll.py are (c) 2014 Arne Skjærholt and
released under the GNU GPL version 2 or later:
<http://gnu.org/licenses/gpl.html>
The code in alpha.py is (c) 2011-2014 Thomas Grill and released under the
Creative Commons Attribution-ShareAlike licence:
<http://creativecommons.org/licenses/by-sa/3.0/>
The data from the Norwegian Dependency Treebank in data/ndt/ is free for
all uses, as long as they are not published as running, human readable
text.
The data from the Copenhagen Dependency Treebanks in data/cdt/ is licenced
under the GNU GPL version 2: <http://gnu.org/licenses/gpl.html>
The SSD dataset in data/ssd/ is released under the MIT licence.