-
Notifications
You must be signed in to change notification settings - Fork 4
VistaExtractionTop
The lkb2standard tool extracts the constituency representation of the parse tree.
...
The tool needs to find a variety of data. The relevant paths are defined in the configuration file. These paths are closely tied to the way our repository is organized, but they can be redefined/overridden in the configuration file. The VERSION number (provided through the command line above) is used when piecing together the path in order to select specific versions of the dataset.
The most important fields in the configuration file are:
ADJUDICATION
The directory that contain the test suite subdirectories which, in turn,
store the .gz files (one per sentence) with the exported data.
CINTIL
In the repository there is a directory that contains a subdirectory for
each test suite. Each of these subdirectories contains, among many other
files, the items.txt file, which stores an annotated version of all the
sentences in that test suite.
LEXICON
The file containing the lexicon of the grammar (.tdl format).
The SSTREEB and SUITEGZ fields are Java regular expressions that can be used to run the tool over just a subset of test suites and .gz files.
Though I'd love it if this tool was found to be immediately useful to everyone, I'm aware that, as development progressed and the tool became more complex, it also became more dependent on specific details of our grammar (e.g. particular derivation rules and category names). I think that this is inevitable, but perhaps the tool can still be of use for those wanting to tackle a similar task.
More specifically, things may go wrong when applying directly these tools to data from other grammars given this tool has some hard-coded references to:
- In exported files: names of grammar rules and tags of the tag set specific of our grammar
- In the lexicon: specific lexical type of our grammar
- In annotated sentences: reference to specific Portuguese lexical items and specificities of our grammar input internal format
To check the design options of such constituency view, you may find useful the following handbook:
- Branco, António, João Silva, Francisco Costa and Sérgio Castro, 2011, [http://semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency], University of Lisbon, Faculty of Sciences, Department of Informatics.
To see examples of constituency representation extracted with this tool and how a treebank extracted with its help looks like, you may want to check the online search service over our CINTIL-Treebank, which is found [http://lxcenter.di.fc.ul.pt/services/en/LXServicesSearcher.html here].
Home | Forum | Discussions | Events