You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there! I was wondering if I can utilize sentencepiece for segmenting text into tokens where tokens would be "multi-word" instead of "sub-word". I want to do this to reduce the number of tokens in the segmentation.
I was thinking that if whitespaces are considered as regular symbols then they could be in the middle of the the tokens too (unlike now where whitespaces can only be at the beginning and end of the tokens) and this could allow "mulit-word" segmentation. To that end, I thought of trying --control_symbols=" " but I think it is not a good idea because I will lose all the white space information in the encoded output.
I hope I am clear about what I intend to do. Look forward to your suggestions.
Thanks!
The text was updated successfully, but these errors were encountered:
Does 'multi-word' mean to extract pieces like "Hello_world" ?
If so, you might want to try spm_train --split_by_whitespace=false . This flag allows us to extract pieces containing whitespaces in the middle.
However, according to my preliminary experiments, no quality improvements were observed by allowing multi-words at least in MT experiments.
Many thanks! I was looking exactly for this. I want to do this because my target sequences are very long and contain a lot of redundant information. I am hoping this would help. Thanks again!
Hi there! I was wondering if I can utilize sentencepiece for segmenting text into tokens where tokens would be "multi-word" instead of "sub-word". I want to do this to reduce the number of tokens in the segmentation.
I was thinking that if whitespaces are considered as regular symbols then they could be in the middle of the the tokens too (unlike now where whitespaces can only be at the beginning and end of the tokens) and this could allow "mulit-word" segmentation. To that end, I thought of trying
--control_symbols=" "
but I think it is not a good idea because I will lose all the white space information in the encoded output.I hope I am clear about what I intend to do. Look forward to your suggestions.
Thanks!
The text was updated successfully, but these errors were encountered: