Berkeley Neural Parser

A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018, with additional changes described in Multilingual Constituency Parsing with Self-Attention and Pre-Training.

Installation

To install the parser, run the commands:

$ pip install cython numpy
$ pip install benepar[cpu]

Cython and numpy should be installed separately prior to installing benepar. Note that pip install benepar[cpu] has a dependency on the tensorflow pip package, which is a CPU-only version of tensorflow. Use pip install benepar[gpu] to instead introduce a dependency on tensorflow-gpu. Installing a GPU-enabled version of TensorFlow will likely require additional steps; see the official TensorFlow installation instructions for details.

Benepar integrates with one of two NLP libraries for Python: NLTK or spaCy.

If using NLTK, you should install the NLTK sentence and word tokenizers:

>>> import nltk
>>> nltk.download('punkt')

If using spaCy, you should install a spaCy model for your language. For English, the installation command is:

$ python -m spacy download en

Parsing models need to be downloaded separately, using the commands:

>>> import benepar
>>> benepar.download('benepar_en2')

See the Available Models section below for a full list of models.

Usage

Usage with NLTK

>>> import benepar
>>> parser = benepar.Parser("benepar_en2")
>>> tree = parser.parse("Short cuts make long delays.")
>>> print(tree)
(S
  (NP (JJ Short) (NNS cuts))
  (VP (VBP make) (NP (JJ long) (NNS delays)))
  (. .))

Speed note: the first call to parse will take much longer that subsequent calls, as caches are being warmed up.

The parser can also parse pre-tokenized text. For some languages (including Chinese), this is required due to the lack of a built-in tokenizer.

>>> parser.parse(['Short', 'cuts', 'make', 'long', 'delays', '.'])

Use parse_sents to parse multiple sentences. It accepts an entire document as a string, or a list of sentences.

>>> parser.parse_sents("The time for action is now. It's never too late to do something.")
>>> parser.parse_sents(["The time for action is now.", "It's never too late to do something."])
>>> parser.parse_sents([['The', 'time', 'for', 'action', 'is', 'now', '.'], ['It', "'s", 'never', 'too', 'late', 'to', 'do', 'something', '.']])

All parse trees returned are represented using nltk.Tree objects.

Usage with spaCy

Benepar also ships with a component that integrates with spaCy:

>>> import spacy
>>> from benepar.spacy_plugin import BeneparComponent
>>> nlp = spacy.load('en')
>>> nlp.add_pipe(BeneparComponent("benepar_en2"))
>>> doc = nlp(u"The time for action is now. It's never too late to do something.")
>>> sent = list(doc.sents)[0]
>>> print(sent._.parse_string)
(S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))
>>> sent._.labels
('S',)
>>> list(sent._.children)[0]
The time for action

Since spaCy does not provide an official constituency parsing API, all methods are accessible through the extension namespaces Span._ and Token._.

The following extension properties are available:

Span._.labels: a tuple of labels for the given span. A span may have multiple labels when there are unary chains in the parse tree.
Span._.parse_string: a string representation of the parse tree for a given span.
Span._.constituents: an iterator over Span objects for sub-constituents in a pre-order traversal of the parse tree.
Span._.parent: the parent Span in the parse tree.
Span._.children: an iterator over child Spans in the parse tree.
Token._.labels, Token._.parse_string, Token._.parent: these behave the same as calling the corresponding method on the length-one Span containing the token.

These methods will raise an exception when called on a span that is not a constituent in the parse tree. Such errors can be avoided by traversing the parse tree starting at either sentence level (by iterating over doc.sents) or with an individual Token object.

Available Models

The following trained parser models are available:

Model	Language	Info
`benepar_en2`	English	95.17 F1 on WSJ test set, 94 MB on disk.
`benepar_en2_large`	English	95.52 F1 on WSJ test set, 274 MB on disk. This model is up to 3x slower than `benepar_en2` when running on CPU; we recommend running it on a GPU instead.
`benepar_zh`	Chinese	91.69 F1 on CTB 5.1 test set. Usage with NLTK requires tokenized sentences (untokenized raw text is not supported.) Use a package such as jieba for tokenization. Usage with spaCy first requires implementing Chinese support in spaCy. There is no official Chinese support in spaCy at the time of writing, but unofficial packages such as this one may work.
`benepar_ar`	Arabic	Usage with NLTK requires tokenized sentences (untokenized raw text is not supported.) Usage with spaCy first requires implementing Arabic support in spaCy. Accepts Unicode as input, but was trained on transliterated text (see `src/transliterate.py`); please let us know if there are any bugs.
`benepar_de`	German	Full support for NLTK and spaCy; use `python -m spacy download de` to download spaCy model for German.
`benepar_eu`	Basque	Usage with NLTK requires tokenized sentences (untokenized raw text is not supported.) Usage with spaCy first requires implementing Basque support in spaCy.
`benepar_fr`	French	Full support for NLTK and spaCy; use `python -m spacy download fr` to download spaCy model for French.
`benepar_he`	Hebrew	Usage with NLTK requires tokenized sentences (untokenized raw text is not supported.) Usage with spaCy first requires implementing Hebrew support in spaCy. Accepts Unicode as input, but was trained on transliterated text (see `src/transliterate.py`); please let us know if there are any bugs.
`benepar_hu`	Hungarian	Usage with NLTK requires tokenized sentences (untokenized raw text is not supported.) Usage with spaCy requires a Hungarian model for spaCy.
`benepar_ko`	Korean	Usage with NLTK requires tokenized sentences (untokenized raw text is not supported.) Usage with spaCy first requires implementing Korean support in spaCy.
`benepar_pl`	Polish	Full support for NLTK (including parsing from raw text.) Usage with spaCy first requires implementing Polish support in spaCy.
`benepar_sv`	Swedish	Full support for NLTK (including parsing from raw text.) Usage with spaCy first requires implementing Swedish support in spaCy.
`benepar_en`	English	No part-of-speech tagging capabilities: we recommend using `benepar_en2` instead. When using this model, part-of-speech tags will be inherited from either NLTK (requires `nltk.download('averaged_perceptron_tagger')`) or spaCy; however, we've found that our own tagger in models such as `benepar_en2` gives better results. This model was released to accompany our ACL 2018 paper, and is retained for compatibility. 95.07 F1 on WSJ test set.
`benepar_en_small`	English	No part-of-speech tagging capabilities: we recommend using `benepar_en2` instead. This model was released to accompany our ACL 2018 paper, and is retained for compatibility. A smaller model that is 3-4x faster than the `benepar_en` when running on CPU because it uses a smaller version of ELMo. 94.65 F1 on WSJ test set.
`benepar_en_ensemble`	English	No part-of-speech tagging capabilities: we recommend using `benepar_en2_large` instead. This model was released to accompany our ACL 2018 paper, and is retained for compatibility. An ensemble of two parsers: one that uses the original ELMo embeddings and one that uses the 5.5B ELMo embeddings. A GPU is highly recommended for running the ensemble. 95.43 F1 on WSJ test set.

Training

The code used to train our parsing models is currently different from the code used to parse sentences in the release version described above, though both are stored in this repository. The training code uses PyTorch and can be obtained by cloning this repository from GitHub. The release version uses TensorFlow instead, because it allows serializing the parsing model into a single file on disk in a way that minimizes software dependencies and reduces file size on disk.

Software Requirements for Training

Python 3.6 or higher.
Cython 0.25.2 or any compatible version.
PyTorch 0.4.1, 1.0/1.1, or any compatible version.
EVALB. Before starting, run make inside the EVALB/ directory to compile an evalb executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run make inside the EVALB_SPMRL/ directory instead.
AllenNLP 0.7.0 or any compatible version (only required when using ELMo word representations)
pytorch-pretrained-bert 0.4.0 or any compatible version (only required when using BERT word representations)

Pre-trained Models (PyTorch)

The following pre-trained parser models are available for download:

en_charlstm_dev.93.61.pt: Our best English single-system parser that does not rely on external word representations
en_elmo_dev.95.21.pt: The best English single-system parser from our ACL 2018 paper. Using this parser requires ELMo weights, which must be downloaded separately.

To use ELMo embeddings, download the following files into the data/ folder (preserving their names):

There is currently no command-line option for configuring the locations/names of the ELMo files.

Pre-trained BERT weights will be automatically downloaded as needed by the pytorch-pretrained-bert package.

Training Instructions

A new model can be trained using the command python src/main.py train .... Some of the available arguments are:

Argument	Description	Default
`--model-path-base`	Path base to use for saving models	N/A
`--evalb-dir`	Path to EVALB directory	`EVALB/`
`--train-path`	Path to training trees	`data/02-21.10way.clean`
`--dev-path`	Path to development trees	`data/22.auto.clean`
`--batch-size`	Number of examples per training update	250
`--checks-per-epoch`	Number of development evaluations per epoch	4
`--subbatch-max-tokens`	Maximum number of words to process in parallel while training (a full batch may not fit in GPU memory)	2000
`--eval-batch-size`	Number of examples to process in parallel when evaluating on the development set	100
`--numpy-seed`	NumPy random seed	Random
`--use-words`	Use learned word embeddings	Do not use word embeddings
`--use-tags`	Use predicted part-of-speech tags as input	Do not use predicted tags
`--use-chars-lstm`	Use learned CharLSTM word representations	Do not use CharLSTM
`--use-elmo`	Use pre-trained ELMo word representations	Do not use ELMo
`--use-bert`	Use pre-trained BERT word representations	Do not use BERT
`--bert-model`	Pre-trained BERT model to use if `--use-bert` is passed	`bert-base-uncased`
`--no-bert-do-lower-case`	Instructs the BERT tokenizer to retain case information (setting should match the BERT model in use)	Perform lowercasing
`--predict-tags`	Adds a part-of-speech tagging component and auxiliary loss to the parser	Do not predict tags

Additional arguments are available for other hyperparameters; see make_hparams() in src/main.py. These can be specified on the command line, such as --num-layers 2 (for numerical parameters), --use-tags (for boolean parameters that default to False), or --no-partitioned (for boolean parameters that default to True).

If --use-tags is passed, the training and development trees are assumed to have predicted part-of-speech tags. If --predict-tags is passed, the data is assumed to have ground-truth tags instead. As a result, these two options can't be used simultaneously. Note that the files we provide in the data/ folder have predicted tags, and that data with gold tags must be obtained separately.

For each development evaluation, the F-score on the development set is computed and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development F-score.

As an example, to train an English parser using the default hyperparameters, you can use the command:

python src/main.py train --use-words --use-chars-lstm --model-path-base models/en_charlstm --d-char-emb 64

To train an English parser that uses ELMo embeddings, the command is:

python src/main.py train --use-elmo --model-path-base models/en_elmo --num-layers 4

To train an English parser that uses BERT, the command is:

python src/main.py train --use-bert --model-path-base models/en_bert --bert-model "bert-large-uncased" --num-layers 2 --learning-rate 0.00005 --batch-size 32 --eval-batch-size 16 --subbatch-max-tokens 500

Evaluation Instructions

A saved model can be evaluated on a test corpus using the command python src/main.py test ... with the following arguments:

Argument	Description	Default
`--model-path-base`	Path base of saved model	N/A
`--evalb-dir`	Path to EVALB directory	`EVALB/`
`--test-path`	Path to test trees	`data/23.auto.clean`
`--test-path-raw`	Alternative path to test trees that is used for evalb only (used to double-check that evaluation against pre-processed trees does not contain any bugs)	Compare to trees from `--test-path`
`--eval-batch-size`	Number of examples to process in parallel when evaluating on the test set	100

If the parser was trained to have predicted part-of-speech tags as input (via the --use-tags flag) the test trees are assumed to have predicted part-of-speech tags. Otherwise, the tags in the test trees are not used as input to the parser.

As an example, after extracting the pre-trained model, you can evaluate it on the test set using the following command:

python src/main.py test --model-path-base models/nk_base6_lstm_dev.93.61.pt

The pre-trained model with CharLSTM embeddings obtains F-scores of 93.61 on the development set and 93.55 on the test set. The pre-trained model with ELMo embeddings obtains F-scores of 95.21 on the development set and 95.13 on the test set.

Using the Trained Models

See the run_parse function in src/main.py for an example of how a parser can be loaded from disk and used to parse sentences using the PyTorch codebase.

The export/export.py file contains the code we used to convert our ELMo-based parser to a TensorFlow graph (for use in the release version of the parser). For our BERT-based parsers, consult export/export_bert.py instead. This exporting code hard-codes certain hyperparameter choices, so you will likely need to tweak it to export your own models. Exporting the model to TensorFlow allows it to be stored in a single file, including all ELMo/BERT weights. We also use TensorFlow's graph transforms to shrink the model size on disk with only a tiny impact on parsing accuracy: the compressed ELMo model obtains an F1-score of 95.07 on the test set, compared to 95.13 for the uncompressed model.

Reproducing Experiments

The code used for our ACL 2018 paper is tagged acl2018 in git. The EXPERIMENTS.md file in that version of the code contains additional notes about the command-line arguments we used to perform the experiments reported in our ACL 2018 paper.

The version of the code currently in this repository has added new features (such as BERT support and part-of-speech tag prediction), eliminated some of the less-performant parser variations (e.g. the CharConcat word representation), and has updated to a newer version of PyTorch. The EXPERIMENTS.md file now describes the commands used to train our best-performing single-system parser for each language that we evaluate on.

Citation

If you use this software for research, please cite our paper as follows:

@InProceedings{Kitaev-2018-SelfAttentive,
  author    = {Kitaev, Nikita and Klein, Dan},
  title     = {Constituency Parsing with a Self-Attentive Encoder},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics},
}

Credits

The code in this repository and portions of this README are based on https://github.com/mitchellstern/minimal-span-parser

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
EVALB		EVALB
EVALB_SPMRL		EVALB_SPMRL
benepar		benepar
data		data
export		export
src		src
viz		viz
.gitignore		.gitignore
EXPERIMENTS.md		EXPERIMENTS.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Berkeley Neural Parser

Contents

Installation

Usage

Usage with NLTK

Usage with spaCy

Available Models

Training

Software Requirements for Training

Pre-trained Models (PyTorch)

Training Instructions

Evaluation Instructions

Using the Trained Models

Reproducing Experiments

Citation

Credits

About

Releases

Packages

Languages

License

thorunna/self-attentive-parser

Folders and files

Latest commit

History

Repository files navigation

Berkeley Neural Parser

Contents

Installation

Usage

Usage with NLTK

Usage with spaCy

Available Models

Training

Software Requirements for Training

Pre-trained Models (PyTorch)

Training Instructions

Evaluation Instructions

Using the Trained Models

Reproducing Experiments

Citation

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages