Multi-word segmentation #220

akshatdewan · 2018-10-23T11:25:12Z

Hi there! I was wondering if I can utilize sentencepiece for segmenting text into tokens where tokens would be "multi-word" instead of "sub-word". I want to do this to reduce the number of tokens in the segmentation.

I was thinking that if whitespaces are considered as regular symbols then they could be in the middle of the the tokens too (unlike now where whitespaces can only be at the beginning and end of the tokens) and this could allow "mulit-word" segmentation. To that end, I thought of trying --control_symbols=" " but I think it is not a good idea because I will lose all the white space information in the encoded output.

I hope I am clear about what I intend to do. Look forward to your suggestions.

Thanks!

The text was updated successfully, but these errors were encountered:

taku910 · 2018-10-24T02:37:37Z

Does 'multi-word' mean to extract pieces like "Hello_world" ?
If so, you might want to try spm_train --split_by_whitespace=false . This flag allows us to extract pieces containing whitespaces in the middle.

However, according to my preliminary experiments, no quality improvements were observed by allowing multi-words at least in MT experiments.

Thank you.

akshatdewan · 2018-10-24T08:34:08Z

Many thanks! I was looking exactly for this. I want to do this because my target sequences are very long and contain a lot of redundant information. I am hoping this would help. Thanks again!

akshatdewan closed this as completed Oct 24, 2018

This was referenced Oct 24, 2023

Does 'multi-word' mean to extract pieces like "Hello_world" ? #923

Closed

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-word segmentation #220

Multi-word segmentation #220

akshatdewan commented Oct 23, 2018

taku910 commented Oct 24, 2018

akshatdewan commented Oct 24, 2018

Multi-word segmentation #220

Multi-word segmentation #220

Comments

akshatdewan commented Oct 23, 2018

taku910 commented Oct 24, 2018

akshatdewan commented Oct 24, 2018