-
Notifications
You must be signed in to change notification settings - Fork 2
Data format of the Alto AMR pre processor and aligner
The preprocessing pipeline described in [Link Text](Link URL) produces several versions of alto corpora along the way. The most relevant are finalAlto.corpus
and namesDatesNumbers_AlsFixed_sorted.corpus
; the other corpora are only by-products of the process.
This corpus contains three entries per sentence, first the sentence itself, then a constituent syntax tree created by the Stanford Parser, and then the AMR. The AMR preserves the tree- and reentrancy-structure (i.e. the linearization) of the original AMR corpus, but gives each node an explicit node label. For example, it turns the substructure
#!
(c3 / country :wiki "United_States" :name (n2 / name :op1 "United" :op2 "States"))
into
#!
(c / country :wiki (explicitanon1 / "United_States") :name (n / name :op1 (explicitanon2 / "United") :op2 (explicitanon3 / "States")))
Note that the current smatch code does not treat these graphs equally; their smatch score is less than 1.0.
The following script heuristically removes most of these additional annotations.
#!bash
sed -E 's/\(u_[0-9]+ \/ ([-+0-9]+)\)/\1/g' $1 | sed -E 's/\(explicitanon[0-9]+ \/ ([^"()]+)\)/"\1"/g' | sed -E 's/\(explicitanon[0-9]+ \/ ("[^"]+")\)/\1/g' | sed -E 's/"([-+0-9]+)"/\1/g'
Also, all graphs have an explicit <root>
marker at their root.
The file finalAlto.align
contains our heuristic alignments (note that some of the nodes may still be unaligned at this point), one line per sentence in the corpus. The format of an alignment is n1|n2|...||span||weight
, where n1
, n2
,... are the aligned nodes separated by |
(lexical nodes are followed by an exclamation point !
), then after a ||
comes the aligned span, and finally the alignment weight (the weight has mostly aligner-internal use can can be safely ignored).
Some of the node names were added during corpus conversion and do not exist in the original AMR corpus. However, since finalAlto.corpus
preserves linearization, this alignment format may be convertible to e.g. the JAMR format where nodes are specified by their address in the linearized AMR.
The file finalAlto.palign
is similar, but contains all candidate alignments (i.e. they overlap). This file is unused in the pipeline for our ACL 2018 paper.
This corpus exists only in the train
and nnDev
folders. It is different from finalAlto.corpus
in several ways. First and formost, there are additional entries per sentence here, for a total of 11. These 11 "interpretations" are:
-
string
: the original sentence (also in finalAlto.corpus) -
tree
: the Stanford parse tree (also in finalAlto.corpus) -
graph
: the AMR (with changes, see below) -
alignment
: the corresponding alignment of thefinalAlto.align
file, now with all nodes aligned. -
alignmentp
: the corresponding alignment of thefinalAlto.palign
file, now with all nodes aligned. - all above have a version with prefix
rep
, where names, dates and numbers have been replaced with special tokens. -
spanmap
: For each token inrepstring
, this gives the corresponding span instring
.
Differences in the AMR (both for graph
and repgraph
) are:
- The AMRs in this corpus no longer necessarily have the same linearization as in the original corpus.
- Reentrant edges that the AM algebra cannot handle properly have been removed. (this does not apply to the
baseline
versions).
Further, the entries in this corpus are sorted by AMR size.
Documentation to come
This section describes the files in the nnData
subfolder of train
and nnDev
. One central file is the tags.txt file. For each token in repstring
of the namesDatesNumbers_AlsFixed_sorted.corpus
file, this contains the supertag obtained from our heuristic method for getting AM dependency trees. In other words, all tags together describe a split of the original graph into parts. However, the node names do not correspond to the original node names anymore, but rather are changed such that all isomorphic supertags have the same literal string.
Further, since the tags are separated by whitespace, all whitespace inside each tag has been replaced with __ALTO_WS__
, effectively making the file not directly readable. When recovering the original whitespace, a tag may look like this:
#!
(q<root> / --LEX-- :ARG0 (i<s>))--TYPE--(s())
Here, --LEX-- is a placeholder for the label of the lexicalized node of this tag, and --TYPE-- separates the s-graph (in originial AMR notation, with added sources in angle brackets) from its type (i.e. source annotations). The pair of s-graph an type then forms an as-graph.
This documentation is a work in process and will be expanded soon