Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source features support for V2.0 #2090

Merged
merged 24 commits into from
Sep 9, 2021
Merged

Conversation

anderleich
Copy link
Contributor

This is a inital draft to allow source features in the new data loading paradigm in OpenNMT-py v2.0.

What is done so far:

  1. Build features vocabularies
  • Features should be in separate files, one feature for each token
  • Update features counters as done with source sentences
  • Save features vocabularies
  1. Training
  • Read features from previously saved vocabulary files
  • Create fields by adding thos features
  • Apply vocabularies to features in MutliTextFields
  • Training phase keeps the same

I've also added some checks in the parser to ensure necessary options are set.

@anderleich
Copy link
Contributor Author

anderleich commented Aug 24, 2021

Subword tokenization and features issue is handled by two additional transforms: InferFeatsTransform, FilterFeatsTransform.

Currently it is a dummy transform which relies on tokenizer's joiner_annotate to expand features to subword units. Case markup is also handled.

@anderleich
Copy link
Contributor Author

anderleich commented Aug 24, 2021

Sample config:

data:
    dummy:
        path_src: data/train/data.src
        path_tgt: data/train/data.tgt
        src_feats:
            feat_0: data/train/data.src.feat_0
            feat_1: data/train/data.src.feat_1
        transforms: [filterfeats, onmt_tokenize, inferfeats, filtertoolong]
        weight: 1
    valid:
        path_src: data/valid/data.src
        path_tgt: data/valid/data.tgt
        src_feats:
            feat_0: data/valid/data.src.feat_0
            feat_1: data/valid/data.src.feat_1
        transforms: [filterfeats, onmt_tokenize, inferfeats]

# # Vocab opts
src_vocab: exp/data.vocab.src
tgt_vocab: exp/data.vocab.tgt
src_feats_vocab: 
    feat_0: exp/data.vocab.feat_0
    feat_1: exp/data.vocab.feat_1
feat_merge: "sum"

@anderleich anderleich marked this pull request as ready for review August 24, 2021 11:49
Copy link
Contributor

@Zenglinxiao Zenglinxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.
Things to be done before merging:

  • Resolve marked places
  • Add some unittest to make sure all is working as expected: those new Transform classes for example
  • Maybe add an integration test to CI workflow for ease of maintenance: a small dataset with source feature may be needed
  • Relevent documentation or example to be add in FAQ or example
  • Some cleaning: debug code, comments, etc.
  • Changes for inference?

onmt/inputters/corpus.py Outdated Show resolved Hide resolved
onmt/inputters/corpus.py Outdated Show resolved Hide resolved
onmt/bin/build_vocab.py Outdated Show resolved Hide resolved
onmt/inputters/corpus.py Outdated Show resolved Hide resolved
onmt/inputters/corpus.py Outdated Show resolved Hide resolved
onmt/inputters/fields.py Outdated Show resolved Hide resolved
onmt/transforms/features.py Show resolved Hide resolved
onmt/transforms/features.py Show resolved Hide resolved
@anderleich
Copy link
Contributor Author

I've solved marked issues and improved the FeatInferTransform to better handle tokenization. I've added TODOs in the transform to support other tokenization options not handled yet.

onmt/transforms/features.py Outdated Show resolved Hide resolved
onmt/inputters/inputter.py Outdated Show resolved Hide resolved
onmt/transforms/features.py Outdated Show resolved Hide resolved
@Zenglinxiao
Copy link
Contributor

Currently, the vocabulary preparation and training phrases look good.
The last thing to deal with is the inference phase.
To use the trained extra source feature embedding, you might need to make some changes in order to load extra source feature files when inference.

@anderleich
Copy link
Contributor Author

anderleich commented Aug 27, 2021

During inference, I've noticed no Transforms are used. Therefore, I need to perform subword feature inference separatedly and the feed the translator. Wouldn't be easier to be able to use transforms on inference to ensure data preprocessing is the same?

With the current code, first I need to tokenize my data file, infer the features and finally pass the preprocessed files to onmt_translate. I could have used the config file only passing the raw data file to the script (something like the validation set).

@Zenglinxiao
Copy link
Contributor

@anderleich
Yes, currently the onmt_translate accept only preprocessed data. Using transforms on inference requires some work to do while people may like to use the model with other frameworks like ctranslate2. This is not trivial to change. You could consider the input file is already prepared for now.

@anderleich
Copy link
Contributor Author

anderleich commented Aug 30, 2021

Ok. However, I find the process rather complex. Were source features implemented in early versions of the inference process as it was for the training process? If they were, I guess some code blocks could be reusable

@Zenglinxiao
Copy link
Contributor

Previously, the source feature is implemented as token|feature1|feature2, then both train/inference get input like this and rely on _feature_tokenize and TextMultiField, you can check the following code

feat_delim = u"│" if n_feats > 0 else None
for i in range(n_feats + 1):
name = base_name + "_feat_" + str(i - 1) if i > 0 else base_name
tokenize = partial(
_feature_tokenize,
layer=i,
truncate=truncate,
feat_delim=feat_delim)
use_len = i == 0 and include_lengths
feat = Field(
init_token=bos, eos_token=eos,
pad_token=pad, tokenize=tokenize,
include_lengths=use_len)
fields_.append((name, feat))
assert fields_[0][0] == base_name # sanity check
field = TextMultiField(fields_[0][0], fields_[0][1], fields_[1:])
.

@anderleich
Copy link
Contributor Author

anderleich commented Aug 30, 2021

I've made some inital changes to add source features on inference. Instead of relying on _feature_tokenize, now it relies on a dictionary of features which uses to find the appropiate field in the TextMultiField (similar to what is done in the training process).

@anderleich
Copy link
Contributor Author

anderleich commented Aug 30, 2021

We no longer expect features inline with text as: token|feature1|feature2. Therefore, use of _feature_tokenize is deprecated to selectively set source text and features to the corresponding field in MultiTextField.

I've changed the whole pipeline (build vocab, training and inference) to support source features correctly, as it was not handling features correctly during the training phase (it was not linking correctly the fields in MultiTextField with the data provided).

I see some tests are not running as Examples have changed and there might be other minor issues which I'll try to fix as soon as possible.

@@ -140,8 +155,7 @@ def preprocess(self, x):
lists of tokens/feature tags for the sentence. The output
is ordered like ``self.fields``.
"""

return [f.preprocess(x) for _, f in self.fields]
return [f.preprocess(x[fn]) for fn, f in self.fields]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key part of the last change

@anderleich
Copy link
Contributor Author

sample_str = "dummy input here ."

What is the expected behaviour here? If features are defined in the TextMultiField but we do not provide them in the input, which should be the behaviour? The current test is not returning any errors

Copy link
Contributor

@Zenglinxiao Zenglinxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!
The only thing left is to fix those failed tests.

The unittest you are referring to is to test the shape after preprocess to make sure the correct number of feature fields is returned.
You could always check the output of Github actions for the error messages, or run the test_*.py to see any failure message.

@@ -320,4 +320,4 @@ def validate_train_opts(cls, opt):

@classmethod
def validate_translate_opts(cls, opt):
pass
opt.src_feats = eval(opt.src_feats) if opt.src_feats else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this line? What's the expected input for this argument?
I'm assuming it would be a list of file paths, then using nargs in the group.add("-src_feats", ...) of onmt/opts.py could do the trick.

Copy link
Contributor Author

@anderleich anderleich Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not a list of paths, it's a dictionary mapping feature names with the corresponding file path. Like this:

--src_feats "{'feat0': '../kk.txt.feats0', 'feat1': '../kk.txt.feats1'}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's reasonable then.

@anderleich
Copy link
Contributor Author

I've updated the unittests failing. I guess this is the bahaviour expected

@anderleich
Copy link
Contributor Author

Everything works now!

Copy link
Contributor

@Zenglinxiao Zenglinxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the documentation in the FAQ for the inference part.
Same for an integration test in onmt/tests/pull_request_chk.sh and .github/workflows/push.yml.

Do you have any comments to add @francoishernandez ?

@@ -320,4 +320,4 @@ def validate_train_opts(cls, opt):

@classmethod
def validate_translate_opts(cls, opt):
pass
opt.src_feats = eval(opt.src_feats) if opt.src_feats else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's reasonable then.

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance this looks good!
Thanks a lot @anderleich for diving into this and @Zenglinxiao for helping him out!
I'll review more in depth (and test it) soon.

In the meantime @anderleich it looks like there are a few unwanted comments left (pdb traces, TODOs that are probably handled etc.) that you might want to remove.
Also, the src_feats dict passed at inference is not very user friendly, but I guess there might not necessarily be a better way.

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @anderleich . Tested default config as well as adaptation to transformer, seems to run fine. Did not test the transforms though but code seems fine.
A few comments to address and we'll merge I think.

docs/source/FAQ.md Show resolved Hide resolved
docs/source/FAQ.md Show resolved Hide resolved
onmt/inputters/corpus.py Show resolved Hide resolved
onmt/inputters/text_dataset.py Show resolved Hide resolved
onmt/opts.py Outdated Show resolved Hide resolved
onmt/transforms/features.py Outdated Show resolved Hide resolved
onmt/transforms/features.py Outdated Show resolved Hide resolved
@anderleich
Copy link
Contributor Author

anderleich commented Sep 9, 2021

I've made all those quick fixes. I've also removed pdb traces I left on the code.

anderleich and others added 2 commits September 9, 2021 11:36
@francoishernandez
Copy link
Member

Merging, massive thanks! I'll bump to 2.2.0 soon to mark the change.
Next step target features ? :)

@francoishernandez francoishernandez merged commit 7e8c7c2 into OpenNMT:master Sep 9, 2021
@anderleich
Copy link
Contributor Author

anderleich commented Sep 9, 2021

Great news! I'll run some experiments with source features. I guess I'll need to adapt the server for these models so I will submit a PR soon.
Thanks four your time @francoishernandez and @Zenglinxiao !

PD: target features seem way more complicated to implement, but who knows... ;)

@francoishernandez
Copy link
Member

I guess I'll need to adapt the server for these models so I will submit a PR soon.

Good idea.

PD: target features seem way more complicated to implement, but who knows... ;)

It's not that complicated actually. Started to have a go here #1710 but then we were sidetracked to the whole 2.0 refactoring and other topics and never got to wrap up.

@anderleich
Copy link
Contributor Author

Hi @francoishernandez ,

Have target features ever been implemented in the Pytorch version of OpenNMT? Maybe in v1.0? I might be able to give it a try and I'd like to know if there is something already implemented in the code as it was for the source fetures.

Thanks!

@vince62s
Copy link
Member

@anderleich not sure if you saw but started the v3.0 version in a specific branch.
Since all input logics have changed you may start to read this new code.
Also I have skipped one specific point on the translation_server wrt features if you have time to look.

@anderleich
Copy link
Contributor Author

anderleich commented Oct 19, 2022

@vince62s Great! I'll try to take a look at it
With respect to target features, is there any publication I can base my implementation on? If I understood @francoishernandez correctly, he started #1710 but it was never merged, so there are no traces of target features in the main code, are there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants