prepare-squad-dataset

How to preprocess the SQuAD dataset for various NLP tasks.

Classes and Methods

Class SquadGuru

SuqadGuru is an NLP expert who can easily change the original SQuAD dataset into one of the required input formats of NLP tasks. Currently, GEN_QA, GEN_QG, EXT_QA and CORPUS task's input format are available. Extract the end-to-end feature X and ground-truth Y from the SQuAD dataset by using the SquadGuru class.

Constructor Signature
```
SquadGuru(parser: SquadParser, #parser which implement SquadParser
          tokenizer=None, #tokenizer which implement .tokenize(text: str)
          tags=SQUAD_TAGS, #iterable of str
          versions=SQUAD_VERSIONS #iterable of float
)
```
- Inject a parser, which the guru will use to extract X and Y data from the original squad dataset.
- Inject a tokenizer that will create tokenized X and Y. If None, no tokenization.
- Inject an str iterable of tags that describes the tags of the dataset to load.
- Inject a float iterable of versions that describes the versions of the dataset to load.
.gather(only_first_answer=False, verbose=False)
- SquadGuru gathers feature X and Y from the dataset.
- Set only_first_answer to extract the first answer in each of question-answers sets.
- Set verbose to print some logs.
.to_dataframe()
- Returns pandas.DataFrame object.
- X is mapped into 'Input' series.
- Y is mapped into 'Target' series.
.to_numpy()
- Returns numpy array shaped (N, 2) where N is the number of data.
- Column of 0 is X, Column of 1 is Y.
.to_file(x_outfile, y_outfile)
- Writes text files saving X and Y.

Class SquadParser

SquadParser implements methods to parse the original SQuAD dataset into the task-specific X, Y format.

.from_nlp_task(task: str)
- Currently Available task are "GEN_QA" or "GEN_QG" or "EXT_QA" or "CORPUS".
.parse(context: str, question: str:, answers: str iterable)
- Parses given quesition-answers pair in given context(paragraph) from the SQuAD dataset.

Tokenizer

Any tokenizer that implements .tokenize(text: str). In examples, transformers/BertTokenizer is used.

Check Examples

Extractive QA Dataset.ipynb

Generative QA Dataset.ipynb

Generative QG Dataset.ipynb

SQuAD Corpus.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
SQuAD1.1		SQuAD1.1
SQuAD2.0		SQuAD2.0
examples		examples
prep_squad		prep_squad
.gitignore		.gitignore
Example)Extractive QA Dataset.ipynb		Example)Extractive QA Dataset.ipynb
Example)Generative QA Dataset.ipynb		Example)Generative QA Dataset.ipynb
Example)Generative QG Dataset.ipynb		Example)Generative QG Dataset.ipynb
Example)SQuAD Corpus.ipynb		Example)SQuAD Corpus.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prepare-squad-dataset

Classes and Methods

Class SquadGuru

Class SquadParser

Tokenizer

Check Examples

About

Releases

Packages

Languages

binchoo/prepare-squad-dataset

Folders and files

Latest commit

History

Repository files navigation

prepare-squad-dataset

Classes and Methods

Class SquadGuru

Class SquadParser

Tokenizer

Check Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages