This project includes utilities and scripts for automatic dataset generation. It is used in the following papers:
- Warstadt, A., Cao, Y., Grosu, I., Peng, W., Blix, H., Nie, Y., Alsop, A., Bordia, S., Liu, H., Parrish, A. and Wang, S.F., Bowman, S.R. 2019. Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs. arXiv preprint arXiv:1909.02597.
- Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. F., & Bowman, S. R. (2019). BLiMP: A Benchmark of Linguistic Minimal Pairs for English. arXiv preprint arXiv:1912.00582.
- Jeretic, P., Warstadt, A., Bhooshan, S., & Williams, A. (2020). Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition. arXiv preprint arXiv:2004.03066.
To run a sample data generation script, navigate to the data_generation directory and run the following command:
python -m generation_projects.blimp.adjunct_island
If all dependencies are present in your workspace, this will generate the adjunct_island dataset in BLiMP. Generation will take a minute to begin, after which point the progress can be watched in outputs/benchmark/adjunct_island.jsonl.
- With the exception of BLiMP, all project-specific code is kept in separate branches. BLiMP appears in master as a helpful examplar.
- Major branches include:
- blimp
- imppres
- npi
- msgs
- structure_dependence qp
- The project contains the following packages:
generation_projects
: scripts for generating data, organized into subdirectories by research project.mturk_qc
: code for carrying out Amazon mechanical turk quality control.outputs
: generated data, organized into subdirectories by research project.results
: experiment results filesresults_processing
: scripts for analyzing results and producing figuresutils
: shared code for generation projects. Includes utilities for proecessing the vocabulary, generating constituents, manipulating generated strings, etc.
- It also contains a vocabulary file and documentation of the vocabulary:
vocabulary.csv
: the vocab file.vocab_documentation.md
: the vocab documentation
- The vocabulary file is vocabulary.csv.
- Each row in the .csv is a lexical item. Each column is feature encoding grammatical information about the lexical item. Detailed documentation of the columns can be found in vocab_documentation.md.
- The following notation is used to define selectional restrictions in the
arg_1
,arg_2
, andarg_3
columns:<DISJUNCTION> := <CONJUNCTION> | <CONJUNCTION>;<DISJUNCTION> <CONJUNCTION> := <CONDITION> | <CONDITION>^<CONJUNCTION> <CONDITION> := <COLUMN>=<VALUE>
- In other words, the entire restriction is written in disjunctive normal form where
;
is used for disjunction and^
is used for conjunction. - Example 1:
arg_1
of lexical item breaking isanimate=1
. This means any noun appearing as the subject of breaking must have value1
in the columnanimate
. - Example 2:
arg_1
of lexical item buys isinstitution=1^sg=1;animate=1^sg=1
. This means any noun appearing as the subject of breaking must meet one of the following conditions:- have value
1
in columninstitution
and value1
in columnsg
, or - have value
1
in columnanimate
and value1
in columnsg
.
- have value
- Disclaimer: As this project is under active development, data generated with different versions of the vocabulary may differ slightly.
- The
utils
package contains the shared code for the various generation projects.utils.conjugate
includes functions which conjugate verbs and add selecting auxiliaries/modalsutils.constituent_building
includes functions which "do syntax". The following are especially useful:verb_args_from_verb
: gather all arguments of a verb into a dictionaryV_to_VP_mutate
: given a verb, modify the expression to contain the string corresponding to a full VPN_to_DP_mutate
: given a noun, gather all arguments and a determiner, and modify the expression to contain the string corresponding to a full DP
utils.data_generator
defines general classes that are instantianted by a particular generation project. The classes contain metadata fields, the main loop for a generating a dataset (generate_paradigm
), and functions for logging and exception handlingutils.data_type
contains the data_type necessary for the numpy structured array data structure used in the vocabulary.- if the columns of the vocabulary file are ever modified, this file must be modified to match.
utils.string_utils
contains functions for cleaning up generated strings (removing extra whitespace, capitalization, etc.)utils.vocab_sets
contains constants for accessing commonly used sets of vocab entries. Building these constants takes about a minute at the beginning of running a generation script, but this speeds up generation of large datasets.utils.vocab_table
contains functions for creating and accessing the vocabulary tableget_all
gathers all vocab items with a given restrictionget_all_conjunctive
gathers all vocab items with the given restrictions
If you use the data generation project in your work, please cite the BLiMP paper:
@article{warstadt2019blimp,
title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
journal={arXiv preprint arXiv:1912.00582},
year={2019}
}