Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Need help with understanding tokenization and pre processing in case of translation problem. #906

Closed
sugeeth14 opened this issue Jul 2, 2018 · 3 comments

Comments

@sugeeth14
Copy link

I have tried English to German translation using transformer network provided in the walkthrough file (translate_ende_wmt32k) and got decent enough results. But I am unable to understand the pre-processing procedure i.e., is there any pre tokenization done before applying BPE and if not applied would applying change the performance. Also if any such pre tokenization is applied will the same be applied in testing as well how does this impact the performance. Kindly provide me some insights so that I can understand how pre processing works.
Thank you.

@sugeeth14 sugeeth14 changed the title Need help with understanding tokenization and pre processing in case for translation problem. Need help with understanding tokenization and pre processing in case of translation problem. Jul 2, 2018
@martinpopel
Copy link
Contributor

In translate_ende_wmt32k a SubwordTextEncoder is used. It works similarly to BPE (see below), but does tokenization (on spaces and punctuation) jointly with splitting to subwords. Unlike BPE, it encodes also the spaces (or absence of spaces), so the raw sequence is fully reproducible.

Kindly provide me some insights so that I can understand how pre processing works.

The best (but hard) way how to gain insights is to study the source code of SubwordTextEncoder.
See also Section 3 of Schuster and Nakajima (2012) for a description of WordPiece model (a predecessor of SubwordTextEncoder).

Also if any such pre tokenization is applied will the same be applied in testing as well how does this impact the performance.

See Macháček et al. (2018) who explore various morphologically-motivated pre-tokenization before applying BPE or SubwordTextEncoder. They also claim that SubwordTextEncoder is more than 4 BLEU better than the default BPE, but this gap can be almost closed with simple tricks.

See Kudo (2018) for an even better approach to subwords (orthogonal to Macháček), which is unfortunately not integrated into T2T yet.

@sugeeth14
Copy link
Author

sugeeth14 commented Jul 2, 2018

Thanks for the reply @martinpopel also I want to know if data normalisation like replacing dates,numbers etc with a single tag like <date> was used here and if not would preparing my data initially by normalising would improve translation performance as I have come across many word based systems do data normalisation and claim to improve performance not sure how it would impact when used with BPE

@martinpopel
Copy link
Contributor

In word-based NMT, this kind of normalization (<data>,<number>,<UNK>...) is used mostly to limit the vocabulary to a reasonable size (which is necessary, so the model can fit into memory and training+inference is not too slow).
In subword-based NMT, it is not necessary because any input text can be encoded with a fixed vocabulary size and without using <UNK> tokens (SubwordTextEncoder can even encode any Unicode character): numbers and dates are split into subwords (e.g. n-grams of digits) and NMT learns that numbers and proper names should not be translated in most cases (but it can also learn how to change the date format or how to transliterate proper names if needed).

I would suggest you to try the baseline subword NMT on your data and check if there are any errors related to numbers, dates etc. in the dev-set translations. If you spot frequent errors in these phenomena, you can try the normalizations, but note that a proper implementation which handles all the edge cases is not as simple as it may seem: it is difficult to detect which tokens should not be translated (esp. in case of proper names) and get better results than the subword NMT baseline. You will also need to align multiple occurrences of these special tokens, so you can substitute them back in post-processing (T2T does not output word alignment out of the box and the special tokens may be re-ordered within the translation).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants