Source features support for V2.0 #2090

anderleich · 2021-08-24T11:28:58Z

This is a inital draft to allow source features in the new data loading paradigm in OpenNMT-py v2.0.

What is done so far:

Build features vocabularies

Features should be in separate files, one feature for each token
Update features counters as done with source sentences
Save features vocabularies

Training

Read features from previously saved vocabulary files
Create fields by adding thos features
Apply vocabularies to features in MutliTextFields
Training phase keeps the same

I've also added some checks in the parser to ensure necessary options are set.

anderleich · 2021-08-24T11:34:13Z

Subword tokenization and features issue is handled by two additional transforms: InferFeatsTransform, FilterFeatsTransform.

Currently it is a dummy transform which relies on tokenizer's joiner_annotate to expand features to subword units. Case markup is also handled.

anderleich · 2021-08-24T11:40:08Z

Sample config:

data:
    dummy:
        path_src: data/train/data.src
        path_tgt: data/train/data.tgt
        src_feats:
            feat_0: data/train/data.src.feat_0
            feat_1: data/train/data.src.feat_1
        transforms: [filterfeats, onmt_tokenize, inferfeats, filtertoolong]
        weight: 1
    valid:
        path_src: data/valid/data.src
        path_tgt: data/valid/data.tgt
        src_feats:
            feat_0: data/valid/data.src.feat_0
            feat_1: data/valid/data.src.feat_1
        transforms: [filterfeats, onmt_tokenize, inferfeats]

# # Vocab opts
src_vocab: exp/data.vocab.src
tgt_vocab: exp/data.vocab.tgt
src_feats_vocab: 
    feat_0: exp/data.vocab.feat_0
    feat_1: exp/data.vocab.feat_1
feat_merge: "sum"

Zenglinxiao

This looks great.
Things to be done before merging:

Resolve marked places
Add some unittest to make sure all is working as expected: those new Transform classes for example
Maybe add an integration test to CI workflow for ease of maintenance: a small dataset with source feature may be needed
Relevent documentation or example to be add in FAQ or example
Some cleaning: debug code, comments, etc.
Changes for inference?

onmt/inputters/corpus.py

onmt/bin/build_vocab.py

onmt/inputters/corpus.py

onmt/inputters/fields.py

onmt/transforms/features.py

anderleich · 2021-08-26T14:09:00Z

I've solved marked issues and improved the FeatInferTransform to better handle tokenization. I've added TODOs in the transform to support other tokenization options not handled yet.

onmt/transforms/features.py

onmt/inputters/inputter.py

onmt/transforms/features.py

onmt/tests/pull_request_chk.sh

Zenglinxiao · 2021-08-27T14:13:24Z

Currently, the vocabulary preparation and training phrases look good.
The last thing to deal with is the inference phase.
To use the trained extra source feature embedding, you might need to make some changes in order to load extra source feature files when inference.

anderleich · 2021-08-27T14:36:32Z

During inference, I've noticed no Transforms are used. Therefore, I need to perform subword feature inference separatedly and the feed the translator. Wouldn't be easier to be able to use transforms on inference to ensure data preprocessing is the same?

With the current code, first I need to tokenize my data file, infer the features and finally pass the preprocessed files to onmt_translate. I could have used the config file only passing the raw data file to the script (something like the validation set).

Zenglinxiao · 2021-08-30T08:59:55Z

@anderleich
Yes, currently the onmt_translate accept only preprocessed data. Using transforms on inference requires some work to do while people may like to use the model with other frameworks like ctranslate2. This is not trivial to change. You could consider the input file is already prepared for now.

anderleich · 2021-08-30T11:57:51Z

Ok. However, I find the process rather complex. Were source features implemented in early versions of the inference process as it was for the training process? If they were, I guess some code blocks could be reusable

Zenglinxiao · 2021-08-30T12:30:31Z

Previously, the source feature is implemented as token|feature1|feature2, then both train/inference get input like this and rely on _feature_tokenize and TextMultiField, you can check the following code

OpenNMT-py/onmt/inputters/text_dataset.py

Lines 174 to 189 in 4cd9978

    
           feat_delim = u"￨" if n_feats > 0 else None 
        
           for i in range(n_feats + 1): 
        
               name = base_name + "_feat_" + str(i - 1) if i > 0 else base_name 
        
               tokenize = partial( 
        
                   _feature_tokenize, 
        
                   layer=i, 
        
                   truncate=truncate, 
        
                   feat_delim=feat_delim) 
        
               use_len = i == 0 and include_lengths 
        
               feat = Field( 
        
                   init_token=bos, eos_token=eos, 
        
                   pad_token=pad, tokenize=tokenize, 
        
                   include_lengths=use_len) 
        
               fields_.append((name, feat)) 
        
           assert fields_[0][0] == base_name  # sanity check 
        
           field = TextMultiField(fields_[0][0], fields_[0][1], fields_[1:])

.

anderleich · 2021-08-30T13:20:18Z

I've made some inital changes to add source features on inference. Instead of relying on _feature_tokenize, now it relies on a dictionary of features which uses to find the appropiate field in the TextMultiField (similar to what is done in the training process).

anderleich · 2021-08-30T16:49:13Z

We no longer expect features inline with text as: token|feature1|feature2. Therefore, use of _feature_tokenize is deprecated to selectively set source text and features to the corresponding field in MultiTextField.

I've changed the whole pipeline (build vocab, training and inference) to support source features correctly, as it was not handling features correctly during the training phase (it was not linking correctly the fields in MultiTextField with the data provided).

I see some tests are not running as Examples have changed and there might be other minor issues which I'll try to fix as soon as possible.

anderleich · 2021-08-30T16:50:15Z

onmt/inputters/text_dataset.py

@@ -140,8 +155,7 @@ def preprocess(self, x):
                lists of tokens/feature tags for the sentence. The output
                is ordered like ``self.fields``.
        """
-
-        return [f.preprocess(x) for _, f in self.fields]
+        return [f.preprocess(x[fn]) for fn, f in self.fields]


This is the key part of the last change

anderleich · 2021-08-31T08:27:47Z

OpenNMT-py/onmt/tests/test_text_dataset.py

Line 82 in 4cd9978

sample_str = "dummy input here ."

What is the expected behaviour here? If features are defined in the TextMultiField but we do not provide them in the input, which should be the behaviour? The current test is not returning any errors

Zenglinxiao

This looks great!
The only thing left is to fix those failed tests.

The unittest you are referring to is to test the shape after preprocess to make sure the correct number of feature fields is returned.
You could always check the output of Github actions for the error messages, or run the test_*.py to see any failure message.

Zenglinxiao · 2021-08-31T09:02:44Z

onmt/utils/parse.py

@@ -320,4 +320,4 @@ def validate_train_opts(cls, opt):

    @classmethod
    def validate_translate_opts(cls, opt):
-        pass
+        opt.src_feats = eval(opt.src_feats) if opt.src_feats else {}


Why do we need this line? What's the expected input for this argument?
I'm assuming it would be a list of file paths, then using nargs in the group.add("-src_feats", ...) of onmt/opts.py could do the trick.

No, it's not a list of paths, it's a dictionary mapping feature names with the corresponding file path. Like this:

--src_feats "{'feat0': '../kk.txt.feats0', 'feat1': '../kk.txt.feats1'}"

Ok, that's reasonable then.

anderleich · 2021-08-31T09:24:26Z

I've updated the unittests failing. I guess this is the bahaviour expected

anderleich · 2021-08-31T09:52:04Z

Everything works now!

Zenglinxiao

Please also update the documentation in the FAQ for the inference part.
Same for an integration test in onmt/tests/pull_request_chk.sh and .github/workflows/push.yml.

Do you have any comments to add @francoishernandez ?

Zenglinxiao · 2021-08-31T10:33:39Z

onmt/utils/parse.py

@@ -320,4 +320,4 @@ def validate_train_opts(cls, opt):

    @classmethod
    def validate_translate_opts(cls, opt):
-        pass
+        opt.src_feats = eval(opt.src_feats) if opt.src_feats else {}


Ok, that's reasonable then.

francoishernandez

At first glance this looks good!
Thanks a lot @anderleich for diving into this and @Zenglinxiao for helping him out!
I'll review more in depth (and test it) soon.

In the meantime @anderleich it looks like there are a few unwanted comments left (pdb traces, TODOs that are probably handled etc.) that you might want to remove.
Also, the src_feats dict passed at inference is not very user friendly, but I guess there might not necessarily be a better way.

francoishernandez

This looks good @anderleich . Tested default config as well as adaptation to transformer, seems to run fine. Did not test the transforms though but code seems fine.
A few comments to address and we'll merge I think.

docs/source/FAQ.md

onmt/inputters/corpus.py

onmt/inputters/text_dataset.py

onmt/opts.py

onmt/transforms/features.py

anderleich · 2021-09-09T08:36:48Z

I've made all those quick fixes. I've also removed pdb traces I left on the code.

Fix some typos -- Source Features 2.0 PR

francoishernandez · 2021-09-09T10:30:20Z

Merging, massive thanks! I'll bump to 2.2.0 soon to mark the change.
Next step target features ? :)

anderleich · 2021-09-09T10:41:26Z

Great news! I'll run some experiments with source features. I guess I'll need to adapt the server for these models so I will submit a PR soon.
Thanks four your time @francoishernandez and @Zenglinxiao !

PD: target features seem way more complicated to implement, but who knows... ;)

francoishernandez · 2021-09-09T10:45:48Z

I guess I'll need to adapt the server for these models so I will submit a PR soon.

Good idea.

PD: target features seem way more complicated to implement, but who knows... ;)

It's not that complicated actually. Started to have a go here #1710 but then we were sidetracked to the whole 2.0 refactoring and other topics and never got to wrap up.

anderleich · 2022-10-18T13:34:02Z

Hi @francoishernandez ,

Have target features ever been implemented in the Pytorch version of OpenNMT? Maybe in v1.0? I might be able to give it a try and I'd like to know if there is something already implemented in the code as it was for the source fetures.

Thanks!

vince62s · 2022-10-18T16:33:14Z

@anderleich not sure if you saw but started the v3.0 version in a specific branch.
Since all input logics have changed you may start to read this new code.
Also I have skipped one specific point on the translation_server wrt features if you have time to look.

anderleich · 2022-10-19T15:34:47Z

@vince62s Great! I'll try to take a look at it
With respect to target features, is there any publication I can base my implementation on? If I understood @francoishernandez correctly, he started #1710 but it was never merged, so there are no traces of target features in the main code, are there?

Source features support - Initial commit

76ca20e

anderleich marked this pull request as ready for review August 24, 2021 11:49

Ander Corral added 2 commits August 24, 2021 14:49

Improved features transforms

a8190ab

Fixed requirements on VALID data

80d20f9

Zenglinxiao reviewed Aug 26, 2021

View reviewed changes

Solved marked issues + Improved FeatInferTransform

c722028

Zenglinxiao reviewed Aug 26, 2021

View reviewed changes

onmt/transforms/features.py Outdated Show resolved Hide resolved

onmt/inputters/inputter.py Outdated Show resolved Hide resolved

onmt/transforms/features.py Outdated Show resolved Hide resolved

Ander Corral added 5 commits August 27, 2021 09:34

Added some unittests to check features transforms

ad94f21

Updated parameters types in docstrings

e79ec8a

First version of the FAQ

2f15edd

Improved InferFeatsTransform

5a31039

Added integration test

1d785b9

Zenglinxiao reviewed Aug 27, 2021

View reviewed changes

onmt/transforms/features.py Show resolved Hide resolved

Zenglinxiao reviewed Aug 27, 2021

View reviewed changes

onmt/tests/pull_request_chk.sh Show resolved Hide resolved

Added tests in github CI config

a76c737

New inference initial commit for review

0909e31

Ander Corral added 2 commits August 30, 2021 15:34

Fixed indices issue

d68f010

Checked it correctly uses features

ab8dd7a

anderleich commented Aug 30, 2021

View reviewed changes

Zenglinxiao reviewed Aug 31, 2021

View reviewed changes

Updated unittests for text dataset

1f360e1

Ander Corral added 3 commits August 31, 2021 11:29

Fixed corpus save test

22e0e0d

Fixed issues with new examples

4e5e537

Fixed issues with new examples

1e1c540

Zenglinxiao reviewed Aug 31, 2021

View reviewed changes

Ander Corral added 2 commits August 31, 2021 14:30

Added integration tests and updated FAQ

a0bd55f

Added test data files

f5b1eef

francoishernandez reviewed Aug 31, 2021

View reviewed changes

francoishernandez reviewed Sep 3, 2021

View reviewed changes

Ander Corral added 2 commits September 9, 2021 10:23

Fixed some issues

705b94f

Remove pdb traces

d18392c

fix some typos

a736103

francoishernandez mentioned this pull request Sep 9, 2021

Fix some typos -- Source Features 2.0 PR anderleich/OpenNMT-py#1

Merged

anderleich and others added 2 commits September 9, 2021 11:36

Merge pull request #1 from francoishernandez/fix_typos_2090

ee7de03

Fix some typos -- Source Features 2.0 PR

Fixed typo

bd4a01d

francoishernandez merged commit 7e8c7c2 into OpenNMT:master Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source features support for V2.0 #2090

Source features support for V2.0 #2090

anderleich commented Aug 24, 2021

anderleich commented Aug 24, 2021 •

edited

Loading

anderleich commented Aug 24, 2021 •

edited

Loading

Zenglinxiao left a comment •

edited

Loading

anderleich commented Aug 26, 2021

Zenglinxiao commented Aug 27, 2021

anderleich commented Aug 27, 2021 •

edited

Loading

Zenglinxiao commented Aug 30, 2021

anderleich commented Aug 30, 2021 •

edited

Loading

Zenglinxiao commented Aug 30, 2021

anderleich commented Aug 30, 2021 •

edited

Loading

anderleich commented Aug 30, 2021 •

edited

Loading

anderleich Aug 30, 2021

anderleich commented Aug 31, 2021

Zenglinxiao left a comment

Zenglinxiao Aug 31, 2021

anderleich Aug 31, 2021 •

edited

Loading

Zenglinxiao Aug 31, 2021

anderleich commented Aug 31, 2021

anderleich commented Aug 31, 2021

Zenglinxiao left a comment

Zenglinxiao Aug 31, 2021

francoishernandez left a comment

francoishernandez left a comment

anderleich commented Sep 9, 2021 •

edited

Loading

francoishernandez commented Sep 9, 2021

anderleich commented Sep 9, 2021 •

edited

Loading

francoishernandez commented Sep 9, 2021

anderleich commented Oct 18, 2022

vince62s commented Oct 18, 2022

anderleich commented Oct 19, 2022 •

edited

Loading

Source features support for V2.0 #2090

Source features support for V2.0 #2090

Conversation

anderleich commented Aug 24, 2021

anderleich commented Aug 24, 2021 • edited Loading

anderleich commented Aug 24, 2021 • edited Loading

Zenglinxiao left a comment • edited Loading

Choose a reason for hiding this comment

anderleich commented Aug 26, 2021

Zenglinxiao commented Aug 27, 2021

anderleich commented Aug 27, 2021 • edited Loading

Zenglinxiao commented Aug 30, 2021

anderleich commented Aug 30, 2021 • edited Loading

Zenglinxiao commented Aug 30, 2021

anderleich commented Aug 30, 2021 • edited Loading

anderleich commented Aug 30, 2021 • edited Loading

anderleich Aug 30, 2021

Choose a reason for hiding this comment

anderleich commented Aug 31, 2021

Zenglinxiao left a comment

Choose a reason for hiding this comment

Zenglinxiao Aug 31, 2021

Choose a reason for hiding this comment

anderleich Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

Zenglinxiao Aug 31, 2021

Choose a reason for hiding this comment

anderleich commented Aug 31, 2021

anderleich commented Aug 31, 2021

Zenglinxiao left a comment

Choose a reason for hiding this comment

Zenglinxiao Aug 31, 2021

Choose a reason for hiding this comment

francoishernandez left a comment

Choose a reason for hiding this comment

francoishernandez left a comment

Choose a reason for hiding this comment

anderleich commented Sep 9, 2021 • edited Loading

francoishernandez commented Sep 9, 2021

anderleich commented Sep 9, 2021 • edited Loading

francoishernandez commented Sep 9, 2021

anderleich commented Oct 18, 2022

vince62s commented Oct 18, 2022

anderleich commented Oct 19, 2022 • edited Loading

anderleich commented Aug 24, 2021 •

edited

Loading

anderleich commented Aug 24, 2021 •

edited

Loading

Zenglinxiao left a comment •

edited

Loading

anderleich commented Aug 27, 2021 •

edited

Loading

anderleich commented Aug 30, 2021 •

edited

Loading

anderleich commented Aug 30, 2021 •

edited

Loading

anderleich commented Aug 30, 2021 •

edited

Loading

anderleich Aug 31, 2021 •

edited

Loading

anderleich commented Sep 9, 2021 •

edited

Loading

anderleich commented Sep 9, 2021 •

edited

Loading

anderleich commented Oct 19, 2022 •

edited

Loading