diff --git a/core_inception/README.md b/core_inception/README.md index db6070c..46f7d56 100644 --- a/core_inception/README.md +++ b/core_inception/README.md @@ -4,6 +4,12 @@ This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy. +To get started, clone this project using Weasel: +`spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name` + +Then, follow the instructions in the README in the assets directory to set up your project's assets. + + ## 📋 project.yml The [`project.yml`](project.yml) defines the data assets required by the @@ -20,7 +26,6 @@ Commands are only re-run if their inputs have changed. | --- | --- | | `install-dependencies` | Install python dependencies | | `install-language` | Install the language module from Cadet | -| `validate-annotations` | Validate the files exported from INCEpTION | | `convert-raw-text` | Convert raw text files to spaCy's format | | `convert-annotations` | Convert annotated data from INCEpTION to spaCy's format | | `split-data` | Split the data into training, validation, and test sets | @@ -40,9 +45,9 @@ inputs have changed. | Workflow | Steps | | --- | --- | -| `all` | `install-dependencies` → `install-language` → `validate-annotations` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` → `debug-config` → `pretrain-model` → `train-model` | +| `all` | `install-dependencies` → `install-language` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` → `debug-config` → `pretrain-model` → `train-model` | | `install` | `install-dependencies` → `install-language` | -| `setup` | `validate-annotations` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` | +| `setup` | `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` | | `train` | `debug-config` → `pretrain-model` → `train-model` | ### 🗂 Assets diff --git a/core_inception/assets/README.md b/core_inception/assets/README.md index aa3781d..224eec3 100644 --- a/core_inception/assets/README.md +++ b/core_inception/assets/README.md @@ -1,20 +1,33 @@ # Project Assets + ## Annotations (`/annotations`) + This directory contains annotated data exported from INCEpTION. -Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension, by convention. +Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension. + +If you have named entity annotations and wish to combine them with your other syntactic annotations, you can additionally export the [CoNLL-2002 NER format](https://www.clips.uantwerpen.be/conll2002/ner/) for each file. These files will have the `.conll` extension. After export, you can use the `merge_annotations` script to add the NER annotations to the `MISC` column of your CoNLL-U files, for example: -Because CoNLL-U does not support named entity annotation without a custom extension, named entity annotations are stored in the simpler [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Each annotated text is stored in a single file with the `.conll` extension, similar to [CoNLL-2002 data](https://www.cnts.ua.ac.be/conll2002/ner/). +```bash +python scripts/merge_annotations.py annotations/my_text.conllu annotations/my_text.conll > merged.conllu +``` + +For more information, try: -When the data is converted into spaCy's binary format, any `.conllu` and `.conll` files with the same base name will be joined together into a single collection of documents. For example, `my_text.conllu` and `my_text.conll` will be joined together into a single collection of documents named `my_text`. **If the filenames differ, the data will be treated as separate documents, which will impact your model's accuracy.** +```bash +python scripts/merge_annotations.py --help +``` The included examples are annotated data from Project Gutenberg; see section on the [text](../text) directory below for more information. This example data was annotated automatically and is not intended to be used for training a real model. + ## Language Module (`/lang`) + This directory contains the language module exported from Cadet. The language module needs to be installable via `pip`, so it must include (at a minimum) a `setup.py` file and a `__init__.py` file. The `setup.py` file uses spaCy's entry points to register the language with spaCy. The module should have a directory structure like this: + ``` lang ├── zxx @@ -25,11 +38,14 @@ lang ``` **Replace the contents of this directory with your own language module**, renaming the directories labeled `zxx` to your [ISO-639 language code](https://www.loc.gov/standards/iso639-2/php/code_list.php). Then: + - change the value of the `lang` variable in `project.yml` to your language code - change the value of `[nlp.lang]` in `configs/config.cfg` to your language code When you run `spacy project run install-language`, spaCy will install your language module as a Python package, and register it with spaCy. + ## Raw Text (`/text`) + This directory contains two example texts from Project Gutenberg: - _A Muramasa blade: A story of feudalism in old Japan_ by Louis Wertheimber (1887) - [muramasa.txt](muramasa.txt) @@ -37,7 +53,7 @@ This directory contains two example texts from Project Gutenberg: For the license governing the use of these texts, see [LICENSE](LICENSE). -You can use plain text (`.txt`) files like this to pre-train your language model. +You can use plain text (`.txt`) files like this to pre-train your language model. **Replace these texts with ones from your target language.** diff --git a/core_inception/project.yml b/core_inception/project.yml index 95163e2..55f77b6 100644 --- a/core_inception/project.yml +++ b/core_inception/project.yml @@ -1,5 +1,11 @@ title: "Train new language core model with Cadet and INCEpTION" -description: "This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy." +description: | + This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy. + + To get started, clone this project using Weasel: + `spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name` + + Then, follow the instructions in the README in the assets directory to set up your project's assets. # Variables can be referenced across the project.yml using ${vars.var_name} vars: