Update documentation

New-Languages-for-NLP · Jan 15, 2024 · 8330076 · 8330076
1 parent 40dad93
commit 8330076
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 8 deletions.
diff --git a/core_inception/README.md b/core_inception/README.md
@@ -4,6 +4,12 @@
 
 This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy.
 
+To get started, clone this project using Weasel:
+`spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name`
+
+Then, follow the instructions in the README in the assets directory to set up your project's assets.
+
+
 ## 📋 project.yml
 
 The [`project.yml`](project.yml) defines the data assets required by the
@@ -20,7 +26,6 @@ Commands are only re-run if their inputs have changed.
 | --- | --- |
 | `install-dependencies` | Install python dependencies |
 | `install-language` | Install the language module from Cadet |
-| `validate-annotations` | Validate the files exported from INCEpTION |
 | `convert-raw-text` | Convert raw text files to spaCy's format |
 | `convert-annotations` | Convert annotated data from INCEpTION to spaCy's format |
 | `split-data` | Split the data into training, validation, and test sets |
@@ -40,9 +45,9 @@ inputs have changed.
 
 | Workflow | Steps |
 | --- | --- |
-| `all` | `install-dependencies` &rarr; `install-language` &rarr; `validate-annotations` &rarr; `convert-raw-text` &rarr; `convert-annotations` &rarr; `split-data` &rarr; `debug-data` &rarr; `debug-config` &rarr; `pretrain-model` &rarr; `train-model` |
+| `all` | `install-dependencies` &rarr; `install-language` &rarr; `convert-raw-text` &rarr; `convert-annotations` &rarr; `split-data` &rarr; `debug-data` &rarr; `debug-config` &rarr; `pretrain-model` &rarr; `train-model` |
 | `install` | `install-dependencies` &rarr; `install-language` |
-| `setup` | `validate-annotations` &rarr; `convert-raw-text` &rarr; `convert-annotations` &rarr; `split-data` &rarr; `debug-data` |
+| `setup` | `convert-raw-text` &rarr; `convert-annotations` &rarr; `split-data` &rarr; `debug-data` |
 | `train` | `debug-config` &rarr; `pretrain-model` &rarr; `train-model` |
 
 ### 🗂 Assets

diff --git a/core_inception/assets/README.md b/core_inception/assets/README.md
@@ -1,20 +1,33 @@
 # Project Assets
+
 ## Annotations (`/annotations`)
+
 This directory contains annotated data exported from INCEpTION.
 
-Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension, by convention.
+Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension.
+
+If you have named entity annotations and wish to combine them with your other syntactic annotations, you can additionally export the [CoNLL-2002 NER format](https://www.clips.uantwerpen.be/conll2002/ner/) for each file. These files will have the `.conll` extension. After export, you can use the `merge_annotations` script to add the NER annotations to the `MISC` column of your CoNLL-U files, for example:
 
-Because CoNLL-U does not support named entity annotation without a custom extension, named entity annotations are stored in the simpler [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Each annotated text is stored in a single file with the `.conll` extension, similar to [CoNLL-2002 data](https://www.cnts.ua.ac.be/conll2002/ner/).
+```bash
+python scripts/merge_annotations.py annotations/my_text.conllu annotations/my_text.conll > merged.conllu
+```
+
+For more information, try:
 
-When the data is converted into spaCy's binary format, any `.conllu` and `.conll` files with the same base name will be joined together into a single collection of documents. For example, `my_text.conllu` and `my_text.conll` will be joined together into a single collection of documents named `my_text`. **If the filenames differ, the data will be treated as separate documents, which will impact your model's accuracy.**
+```bash
+python scripts/merge_annotations.py --help
+```
 
 The included examples are annotated data from Project Gutenberg; see section on the [text](../text) directory below for more information. This example data was annotated automatically and is not intended to be used for training a real model.
+
 ## Language Module (`/lang`)
+
 This directory contains the language module exported from Cadet.
 
 The language module needs to be installable via `pip`, so it must include (at a minimum) a `setup.py` file and a `__init__.py` file. The `setup.py` file uses spaCy's entry points to register the language with spaCy.
 
 The module should have a directory structure like this:
+
 ```
 lang
 ├── zxx
@@ -25,19 +38,22 @@ lang
 ```
 
 **Replace the contents of this directory with your own language module**, renaming the directories labeled `zxx` to your [ISO-639 language code](https://www.loc.gov/standards/iso639-2/php/code_list.php). Then:
+
 - change the value of the `lang` variable in `project.yml` to your language code
 - change the value of `[nlp.lang]` in `configs/config.cfg` to your language code
 
 When you run `spacy project run install-language`, spaCy will install your language module as a Python package, and register it with spaCy.
+
 ## Raw Text (`/text`)
+
 This directory contains two example texts from Project Gutenberg:
 
 - _A Muramasa blade: A story of feudalism in old Japan_ by Louis Wertheimber (1887) - [muramasa.txt](muramasa.txt)
 - _The Vanguard of Venus_ by Landell Bartlett (1944) - [vanguard.txt](vanguard.txt)
 
 For the license governing the use of these texts, see [LICENSE](LICENSE).
 
-You can use plain text (`.txt`) files like this to pre-train your language model. 
+You can use plain text (`.txt`) files like this to pre-train your language model.
 
 **Replace these texts with ones from your target language.**
 

diff --git a/core_inception/project.yml b/core_inception/project.yml
@@ -1,5 +1,11 @@
 title: "Train new language core model with Cadet and INCEpTION"
-description: "This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy."
+description: |
+  This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy.
+
+  To get started, clone this project using Weasel:
+  `spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name`
+
+  Then, follow the instructions in the README in the assets directory to set up your project's assets.
 
 # Variables can be referenced across the project.yml using ${vars.var_name}
 vars: