Skip to content

VistaExtractionTop

JoaoSilva edited this page Jul 12, 2011 · 7 revisions

Propbank/Treebank extraction

The lkb2standard tool extracts the constituency representation of the parse tree.

Getting and running the tool

...

Additional comments

The tool needs to find a variety of data. The relevant paths are defined in the configuration file. These paths are closely tied to the way our repository is organized, but they can be redefined/overridden in the configuration file. The VERSION number (provided through the command line above) is used when piecing together the path in order to select specific versions of the dataset.

The most important fields in the configuration file are:

ADJUDICATION
The directory that contain the test suite subdirectories which, in turn, store the .gz files (one per sentence) with the exported data.

CINTIL
In the repository there is a directory that contains a subdirectory for each test suite. Each of these subdirectories contains, among many other files, the items.txt file, which stores an annotated version of all the sentences in that test suite.

LEXICON
The file containing the lexicon of the grammar (.tdl format).

The SSTREEB and SUITEGZ fields are Java regular expressions that can be used to run the tool over just a subset of test suites and .gz files.

Further additional comments

Though I'd love it if this tool was found to be immediately useful to everyone, I'm aware that, as development progressed and the tool became more complex, it also became more dependent on specific details of our grammar (e.g. particular derivation rules and category names). I think that this is inevitable, but perhaps the tool can still be of use for those wanting to tackle a similar task.

More specifically, things may go wrong when applying directly these tools to data from other grammars given this tool has some hard-coded references to:

  • In exported files: names of grammar rules and tags of the tag set specific of our grammar
  • In the lexicon: specific lexical type of our grammar
  • In annotated sentences: reference to specific Portuguese lexical items and specificities of our grammar input internal format

Extra further additional comments

To check the design options of such constituency view, you may find useful the following handbook:

To see examples of constituency representation extracted with this tool and how a treebank extracted with its help looks like, you may want to check the online search service over our CINTIL-Treebank, which is found [http://lxcenter.di.fc.ul.pt/services/en/LXServicesSearcher.html here].

Clone this wiki locally