-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source features support for V2.0 #2090
Conversation
Subword tokenization and features issue is handled by two additional transforms: Currently it is a dummy transform which relies on tokenizer's joiner_annotate to expand features to subword units. Case markup is also handled. |
Sample config:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great.
Things to be done before merging:
- Resolve marked places
- Add some unittest to make sure all is working as expected: those new Transform classes for example
- Maybe add an integration test to CI workflow for ease of maintenance: a small dataset with source feature may be needed
- Relevent documentation or example to be add in FAQ or example
- Some cleaning: debug code, comments, etc.
- Changes for inference?
I've solved marked issues and improved the FeatInferTransform to better handle tokenization. I've added TODOs in the transform to support other tokenization options not handled yet. |
Currently, the vocabulary preparation and training phrases look good. |
During inference, I've noticed no Transforms are used. Therefore, I need to perform subword feature inference separatedly and the feed the translator. Wouldn't be easier to be able to use transforms on inference to ensure data preprocessing is the same? With the current code, first I need to tokenize my data file, infer the features and finally pass the preprocessed files to |
@anderleich |
Ok. However, I find the process rather complex. Were source features implemented in early versions of the inference process as it was for the training process? If they were, I guess some code blocks could be reusable |
Previously, the source feature is implemented as OpenNMT-py/onmt/inputters/text_dataset.py Lines 174 to 189 in 4cd9978
|
I've made some inital changes to add source features on inference. Instead of relying on |
We no longer expect features inline with text as: I've changed the whole pipeline (build vocab, training and inference) to support source features correctly, as it was not handling features correctly during the training phase (it was not linking correctly the fields in I see some tests are not running as |
@@ -140,8 +155,7 @@ def preprocess(self, x): | |||
lists of tokens/feature tags for the sentence. The output | |||
is ordered like ``self.fields``. | |||
""" | |||
|
|||
return [f.preprocess(x) for _, f in self.fields] | |||
return [f.preprocess(x[fn]) for fn, f in self.fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key part of the last change
What is the expected behaviour here? If features are defined in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
The only thing left is to fix those failed tests.
The unittest you are referring to is to test the shape after preprocess to make sure the correct number of feature fields is returned.
You could always check the output of Github actions for the error messages, or run the test_*.py
to see any failure message.
@@ -320,4 +320,4 @@ def validate_train_opts(cls, opt): | |||
|
|||
@classmethod | |||
def validate_translate_opts(cls, opt): | |||
pass | |||
opt.src_feats = eval(opt.src_feats) if opt.src_feats else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this line? What's the expected input for this argument?
I'm assuming it would be a list of file paths, then using nargs
in the group.add("-src_feats", ...)
of onmt/opts.py
could do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not a list of paths, it's a dictionary mapping feature names with the corresponding file path. Like this:
--src_feats "{'feat0': '../kk.txt.feats0', 'feat1': '../kk.txt.feats1'}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's reasonable then.
I've updated the unittests failing. I guess this is the bahaviour expected |
Everything works now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the documentation in the FAQ for the inference part.
Same for an integration test in onmt/tests/pull_request_chk.sh
and .github/workflows/push.yml
.
Do you have any comments to add @francoishernandez ?
@@ -320,4 +320,4 @@ def validate_train_opts(cls, opt): | |||
|
|||
@classmethod | |||
def validate_translate_opts(cls, opt): | |||
pass | |||
opt.src_feats = eval(opt.src_feats) if opt.src_feats else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's reasonable then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance this looks good!
Thanks a lot @anderleich for diving into this and @Zenglinxiao for helping him out!
I'll review more in depth (and test it) soon.
In the meantime @anderleich it looks like there are a few unwanted comments left (pdb traces, TODOs that are probably handled etc.) that you might want to remove.
Also, the src_feats
dict passed at inference is not very user friendly, but I guess there might not necessarily be a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @anderleich . Tested default config as well as adaptation to transformer, seems to run fine. Did not test the transforms though but code seems fine.
A few comments to address and we'll merge I think.
I've made all those quick fixes. I've also removed pdb traces I left on the code. |
Fix some typos -- Source Features 2.0 PR
Merging, massive thanks! I'll bump to 2.2.0 soon to mark the change. |
Great news! I'll run some experiments with source features. I guess I'll need to adapt the server for these models so I will submit a PR soon. PD: target features seem way more complicated to implement, but who knows... ;) |
Good idea.
It's not that complicated actually. Started to have a go here #1710 but then we were sidetracked to the whole 2.0 refactoring and other topics and never got to wrap up. |
Hi @francoishernandez , Have target features ever been implemented in the Pytorch version of OpenNMT? Maybe in v1.0? I might be able to give it a try and I'd like to know if there is something already implemented in the code as it was for the source fetures. Thanks! |
@anderleich not sure if you saw but started the v3.0 version in a specific branch. |
@vince62s Great! I'll try to take a look at it |
This is a inital draft to allow source features in the new data loading paradigm in OpenNMT-py v2.0.
What is done so far:
I've also added some checks in the parser to ensure necessary options are set.