Skip to content

Data format of the Alto AMR pre processor and aligner

jgroschwitz edited this page Jun 26, 2018 · 1 revision

Data format of the Alto AMR pre-processor and aligner

The preprocessing pipeline described in [Link Text](Link URL) produces several versions of alto corpora along the way. The most relevant are finalAlto.corpus and namesDatesNumbers_AlsFixed_sorted.corpus; the other corpora are only by-products of the process.

finalAlto.corpus

This corpus contains three entries per sentence, first the sentence itself, then a constituent syntax tree created by the Stanford Parser, and then the AMR. The AMR preserves the tree- and reentrancy-structure (i.e. the linearization) of the original AMR corpus, but gives each node an explicit node label. For example, it turns the substructure

#!
(c3 / country :wiki "United_States" :name (n2 / name :op1 "United" :op2 "States"))

into

#!
(c / country :wiki (explicitanon1 / "United_States") :name (n / name :op1 (explicitanon2 / "United") :op2 (explicitanon3 / "States")))

Note that the current smatch code does not treat these graphs equally; their smatch score is less than 1.0.

The following script heuristically removes most of these additional annotations.

#!bash

sed -E 's/\(u_[0-9]+ \/ ([-+0-9]+)\)/\1/g' $1 | sed -E 's/\(explicitanon[0-9]+ \/ ([^"()]+)\)/"\1"/g' | sed -E 's/\(explicitanon[0-9]+ \/ ("[^"]+")\)/\1/g' | sed -E 's/"([-+0-9]+)"/\1/g'

Also, all graphs have an explicit <root> marker at their root.

Alignments: finalAlto.align

The file finalAlto.align contains our heuristic alignments (note that some of the nodes may still be unaligned at this point), one line per sentence in the corpus. The format of an alignment is n1|n2|...||span||weight, where n1, n2,... are the aligned nodes separated by | (lexical nodes are followed by an exclamation point !), then after a || comes the aligned span, and finally the alignment weight (the weight has mostly aligner-internal use can can be safely ignored).

Some of the node names were added during corpus conversion and do not exist in the original AMR corpus. However, since finalAlto.corpus preserves linearization, this alignment format may be convertible to e.g. the JAMR format where nodes are specified by their address in the linearized AMR.

The file finalAlto.palign is similar, but contains all candidate alignments (i.e. they overlap). This file is unused in the pipeline for our ACL 2018 paper.

namesDatesNumbers_AlsFixed_sorted.corpus

This corpus exists only in the train and nnDev folders. It is different from finalAlto.corpus in several ways. First and formost, there are additional entries per sentence here, for a total of 11. These 11 "interpretations" are:

  • string: the original sentence (also in finalAlto.corpus)
  • tree: the Stanford parse tree (also in finalAlto.corpus)
  • graph: the AMR (with changes, see below)
  • alignment: the corresponding alignment of the finalAlto.align file, now with all nodes aligned.
  • alignmentp: the corresponding alignment of the finalAlto.palign file, now with all nodes aligned.
  • all above have a version with prefix rep, where names, dates and numbers have been replaced with special tokens.
  • spanmap: For each token in repstring, this gives the corresponding span in string.

Differences in the AMR (both for graph and repgraph) are:

  1. The AMRs in this corpus no longer necessarily have the same linearization as in the original corpus.
  2. Reentrant edges that the AM algebra cannot handle properly have been removed. (this does not apply to the baseline versions).

Further, the entries in this corpus are sorted by AMR size.

evalInput.corpus for dev and test set

Documentation to come

Neural Network input

This section describes the files in the nnData subfolder of train and nnDev. One central file is the tags.txt file. For each token in repstring of the namesDatesNumbers_AlsFixed_sorted.corpus file, this contains the supertag obtained from our heuristic method for getting AM dependency trees. In other words, all tags together describe a split of the original graph into parts. However, the node names do not correspond to the original node names anymore, but rather are changed such that all isomorphic supertags have the same literal string.

Further, since the tags are separated by whitespace, all whitespace inside each tag has been replaced with __ALTO_WS__, effectively making the file not directly readable. When recovering the original whitespace, a tag may look like this:

#!
(q<root> / --LEX-- :ARG0 (i<s>))--TYPE--(s())

Here, --LEX-- is a placeholder for the label of the lexicalized node of this tag, and --TYPE-- separates the s-graph (in originial AMR notation, with added sources in angle brackets) from its type (i.e. source annotations). The pair of s-graph an type then forms an as-graph.

This documentation is a work in process and will be expanded soon