Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support joining CoNLL documents #8

Merged
merged 3 commits into from
Jan 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions core_inception/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@

This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy.

To get started, clone this project using Weasel:
`spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name`

Then, follow the instructions in the README in the assets directory to set up your project's assets.


## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
Expand All @@ -20,7 +26,6 @@ Commands are only re-run if their inputs have changed.
| --- | --- |
| `install-dependencies` | Install python dependencies |
| `install-language` | Install the language module from Cadet |
| `validate-annotations` | Validate the files exported from INCEpTION |
| `convert-raw-text` | Convert raw text files to spaCy's format |
| `convert-annotations` | Convert annotated data from INCEpTION to spaCy's format |
| `split-data` | Split the data into training, validation, and test sets |
Expand All @@ -40,9 +45,9 @@ inputs have changed.

| Workflow | Steps |
| --- | --- |
| `all` | `install-dependencies` → `install-language` → `validate-annotations` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` → `debug-config` → `pretrain-model` → `train-model` |
| `all` | `install-dependencies` → `install-language` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` → `debug-config` → `pretrain-model` → `train-model` |
| `install` | `install-dependencies` → `install-language` |
| `setup` | `validate-annotations` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` |
| `setup` | `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` |
| `train` | `debug-config` → `pretrain-model` → `train-model` |

### 🗂 Assets
Expand Down
24 changes: 20 additions & 4 deletions core_inception/assets/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,33 @@
# Project Assets

## Annotations (`/annotations`)

This directory contains annotated data exported from INCEpTION.

Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension, by convention.
Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension.

If you have named entity annotations and wish to combine them with your other syntactic annotations, you can additionally export the [CoNLL-2002 NER format](https://www.clips.uantwerpen.be/conll2002/ner/) for each file. These files will have the `.conll` extension. After export, you can use the `merge_annotations` script to add the NER annotations to the `MISC` column of your CoNLL-U files, for example:

Because CoNLL-U does not support named entity annotation without a custom extension, named entity annotations are stored in the simpler [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Each annotated text is stored in a single file with the `.conll` extension, similar to [CoNLL-2002 data](https://www.cnts.ua.ac.be/conll2002/ner/).
```bash
python scripts/merge_annotations.py annotations/my_text.conllu annotations/my_text.conll > merged.conllu
```

For more information, try:

When the data is converted into spaCy's binary format, any `.conllu` and `.conll` files with the same base name will be joined together into a single collection of documents. For example, `my_text.conllu` and `my_text.conll` will be joined together into a single collection of documents named `my_text`. **If the filenames differ, the data will be treated as separate documents, which will impact your model's accuracy.**
```bash
python scripts/merge_annotations.py --help
```

The included examples are annotated data from Project Gutenberg; see section on the [text](../text) directory below for more information. This example data was annotated automatically and is not intended to be used for training a real model.

## Language Module (`/lang`)

This directory contains the language module exported from Cadet.

The language module needs to be installable via `pip`, so it must include (at a minimum) a `setup.py` file and a `__init__.py` file. The `setup.py` file uses spaCy's entry points to register the language with spaCy.

The module should have a directory structure like this:

```
lang
├── zxx
Expand All @@ -25,19 +38,22 @@ lang
```

**Replace the contents of this directory with your own language module**, renaming the directories labeled `zxx` to your [ISO-639 language code](https://www.loc.gov/standards/iso639-2/php/code_list.php). Then:

- change the value of the `lang` variable in `project.yml` to your language code
- change the value of `[nlp.lang]` in `configs/config.cfg` to your language code

When you run `spacy project run install-language`, spaCy will install your language module as a Python package, and register it with spaCy.

## Raw Text (`/text`)

This directory contains two example texts from Project Gutenberg:

- _A Muramasa blade: A story of feudalism in old Japan_ by Louis Wertheimber (1887) - [muramasa.txt](muramasa.txt)
- _The Vanguard of Venus_ by Landell Bartlett (1944) - [vanguard.txt](vanguard.txt)

For the license governing the use of these texts, see [LICENSE](LICENSE).

You can use plain text (`.txt`) files like this to pre-train your language model.
You can use plain text (`.txt`) files like this to pre-train your language model.

**Replace these texts with ones from your target language.**

Expand Down
Loading