Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source features support for V2.0 #2090

Merged
merged 24 commits into from
Sep 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,16 @@ jobs:
-src_vocab /tmp/onmt.vocab.src \
-tgt_vocab /tmp/onmt.vocab.tgt \
&& rm -rf /tmp/sample
- name: Test vocabulary build with features
run: |
python onmt/bin/build_vocab.py \
-config data/features_data.yaml \
-save_data /tmp/onmt_feat \
-src_vocab /tmp/onmt_feat.vocab.src \
-tgt_vocab /tmp/onmt_feat.vocab.tgt \
-src_feats_vocab '{"feat0": "/tmp/onmt_feat.vocab.feat0"}' \
-n_sample -1 \
&& rm -rf /tmp/sample
- name: Test field/transform dump
run: |
# The dumped fields are used later when testing tools
Expand Down Expand Up @@ -169,6 +179,26 @@ jobs:
-state_dim 256 \
-n_steps 10 \
-n_node 64
- name: Testing training with features
run: |
python onmt/bin/train.py \
-config data/features_data.yaml \
-src_vocab /tmp/onmt_feat.vocab.src \
-tgt_vocab /tmp/onmt_feat.vocab.tgt \
-src_feats_vocab '{"feat0": "/tmp/onmt_feat.vocab.feat0"}' \
-src_vocab_size 1000 -tgt_vocab_size 1000 \
-rnn_size 2 -batch_size 10 \
-word_vec_size 5 -rnn_size 10 \
-report_every 5 -train_steps 10 \
-save_model /tmp/onmt.model \
-save_checkpoint_steps 10
- name: Testing translation with features
run: |
python translate.py \
-model /tmp/onmt.model_step_10.pt \
-src data/data_features/src-test.txt \
-src_feats "{'feat0': 'data/data_features/src-test.feat0'}" \
-verbose
- name: Test RNN translation
run: |
head data/src-test.txt > /tmp/src-test.txt
Expand Down
1 change: 1 addition & 0 deletions data/data_features/src-test.feat0
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
C B A B
1 change: 1 addition & 0 deletions data/data_features/src-test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
she is a hard-working.
3 changes: 3 additions & 0 deletions data/data_features/src-train.feat0
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
A A A A B A A A C
A B C D E
C B A B
3 changes: 3 additions & 0 deletions data/data_features/src-train.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
however, according to the logs, she is a hard-working.
however, according to the logs,
she is a hard-working.
1 change: 1 addition & 0 deletions data/data_features/src-val.feat0
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
C B A B
1 change: 1 addition & 0 deletions data/data_features/src-val.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
she is a hard-working.
3 changes: 3 additions & 0 deletions data/data_features/tgt-train.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
however, according to the logs, she is a hard-working.
however, according to the logs,
she is a hard-working.
1 change: 1 addition & 0 deletions data/data_features/tgt-val.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
she is a hard-working.
11 changes: 11 additions & 0 deletions data/features_data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Corpus opts:
data:
corpus_1:
path_src: data/data_features/src-train.txt
path_tgt: data/data_features/tgt-train.txt
src_feats:
feat0: data/data_features/src-train.feat0
transforms: [filterfeats, inferfeats]
valid:
path_src: data/data_features/src-val.txt
path_tgt: data/data_features/tgt-val.txt
70 changes: 70 additions & 0 deletions docs/source/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -477,3 +477,73 @@ Training options to perform vocabulary update are:
* `-update_vocab`: set this option
* `-reset_optim`: set the value to "states"
* `-train_from`: checkpoint path


## How can I use source word features?
anderleich marked this conversation as resolved.
Show resolved Hide resolved

Extra information can be added to the words in the source sentences by defining word features.

Features should be defined in a separate file using blank spaces as a separator and with each row corresponding to a source sentence. An example of the input files:

data.src
```
however, according to the logs, she is hard-working.
```

feat0.txt
```
A C C C C A A B
```

**Notes**
- Prior tokenization is not necessary, features will be inferred by using the `FeatInferTransform` transform.
- `FilterFeatsTransform` and `FeatInferTransform` are required in order to ensure the functionality.
- Not possible to do shared embeddings (at least with `feat_merge: concat` method)

Sample config file:

```
data:
dummy:
path_src: data/train/data.src
path_tgt: data/train/data.tgt
src_feats:
feat_0: data/train/data.src.feat_0
feat_1: data/train/data.src.feat_1
transforms: [filterfeats, onmt_tokenize, inferfeats, filtertoolong]
weight: 1
valid:
path_src: data/valid/data.src
path_tgt: data/valid/data.tgt
src_feats:
feat_0: data/valid/data.src.feat_0
feat_1: data/valid/data.src.feat_1
transforms: [filterfeats, onmt_tokenize, inferfeats]

# # Vocab opts
src_vocab: exp/data.vocab.src
tgt_vocab: exp/data.vocab.tgt
src_feats_vocab:
feat_0: exp/data.vocab.feat_0
feat_1: exp/data.vocab.feat_1
feat_merge: "sum"

```

During inference you can pass features by using the `--src_feats` argument. `src_feats` is expected to be a Python like dict, mapping feature name with its data file.

```
{'feat_0': '../data.txt.feats0', 'feat_1': '../data.txt.feats1'}
```

**Important note!** During inference, input sentence is expected to be tokenized. Therefore feature inferring should be handled prior to running the translate command. Example:

```bash
python translate.py -model model_step_10.pt -src ../data.txt.tok -output ../data.out --src_feats "{'feat_0': '../data.txt.feats0', 'feat_1': '../data.txt.feats1'}"
anderleich marked this conversation as resolved.
Show resolved Hide resolved
```

When using the Transformer architecture make sure the following options are appropriately set:

- `src_word_vec_size` and `tgt_word_vec_size` or `word_vec_size`
- `feat_merge`: how to handle features vecs
- `feat_vec_size` and maybe `feat_vec_exponent`
7 changes: 6 additions & 1 deletion onmt/bin/build_vocab.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,13 @@ def build_vocab_main(opts):
transforms = make_transforms(opts, transforms_cls, fields)

logger.info(f"Counter vocab from {opts.n_sample} samples.")
src_counter, tgt_counter = build_vocab(
src_counter, tgt_counter, src_feats_counter = build_vocab(
opts, transforms, n_sample=opts.n_sample)

logger.info(f"Counters src:{len(src_counter)}")
logger.info(f"Counters tgt:{len(tgt_counter)}")
for feat_name, feat_counter in src_feats_counter.items():
logger.info(f"Counters {feat_name}:{len(feat_counter)}")

def save_counter(counter, save_path):
check_path(save_path, exist_ok=opts.overwrite, log=logger.warning)
Expand All @@ -52,6 +54,9 @@ def save_counter(counter, save_path):
else:
save_counter(src_counter, opts.src_vocab)
save_counter(tgt_counter, opts.tgt_vocab)

for k, v in src_feats_counter.items():
save_counter(v, opts.src_feats_vocab[k])


def _get_parser():
Expand Down
16 changes: 13 additions & 3 deletions onmt/bin/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import onmt.opts as opts
from onmt.utils.parse import ArgumentParser
from collections import defaultdict


def translate(opt):
Expand All @@ -15,12 +16,21 @@ def translate(opt):
translator = build_translator(opt, logger=logger, report_score=True)
src_shards = split_corpus(opt.src, opt.shard_size)
tgt_shards = split_corpus(opt.tgt, opt.shard_size)
shard_pairs = zip(src_shards, tgt_shards)

for i, (src_shard, tgt_shard) in enumerate(shard_pairs):
features_shards = []
features_names = []
for feat_name, feat_path in opt.src_feats.items():
features_shards.append(split_corpus(feat_path, opt.shard_size))
features_names.append(feat_name)
shard_pairs = zip(src_shards, tgt_shards, *features_shards)

for i, (src_shard, tgt_shard, *features_shard) in enumerate(shard_pairs):
features_shard_ = defaultdict(list)
for j, x in enumerate(features_shard):
features_shard_[features_names[j]] = x
logger.info("Translating shard %d." % i)
translator.translate(
src=src_shard,
src_feats=features_shard_,
tgt=tgt_shard,
batch_size=opt.batch_size,
batch_type=opt.batch_type,
Expand Down
1 change: 1 addition & 0 deletions onmt/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ class CorpusName(object):
class SubwordMarker(object):
SPACER = '▁'
JOINER = '■'
CASE_MARKUP = ["⦅mrk_case_modifier_C⦆", "⦅mrk_begin_case_region_U⦆", "⦅mrk_end_case_region_U⦆"]


class ModelTask(object):
Expand Down
Loading