-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Need help with understanding tokenization and pre processing in case of translation problem. #906
Comments
In
The best (but hard) way how to gain insights is to study the source code of SubwordTextEncoder.
See Macháček et al. (2018) who explore various morphologically-motivated pre-tokenization before applying BPE or SubwordTextEncoder. They also claim that SubwordTextEncoder is more than 4 BLEU better than the default BPE, but this gap can be almost closed with simple tricks. See Kudo (2018) for an even better approach to subwords (orthogonal to Macháček), which is unfortunately not integrated into T2T yet. |
Thanks for the reply @martinpopel also I want to know if data normalisation like replacing dates,numbers etc with a single tag like <date> was used here and if not would preparing my data initially by normalising would improve translation performance as I have come across many word based systems do data normalisation and claim to improve performance not sure how it would impact when used with BPE |
In word-based NMT, this kind of normalization ( I would suggest you to try the baseline subword NMT on your data and check if there are any errors related to numbers, dates etc. in the dev-set translations. If you spot frequent errors in these phenomena, you can try the normalizations, but note that a proper implementation which handles all the edge cases is not as simple as it may seem: it is difficult to detect which tokens should not be translated (esp. in case of proper names) and get better results than the subword NMT baseline. You will also need to align multiple occurrences of these special tokens, so you can substitute them back in post-processing (T2T does not output word alignment out of the box and the special tokens may be re-ordered within the translation). |
I have tried English to German translation using transformer network provided in the walkthrough file (translate_ende_wmt32k) and got decent enough results. But I am unable to understand the pre-processing procedure i.e., is there any pre tokenization done before applying BPE and if not applied would applying change the performance. Also if any such pre tokenization is applied will the same be applied in testing as well how does this impact the performance. Kindly provide me some insights so that I can understand how pre processing works.
Thank you.
The text was updated successfully, but these errors were encountered: