Skip to content

Commit

Permalink
docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)
Browse files Browse the repository at this point in the history
The existing default list of separators for the `RecursiveTextSplitter`
assumes spaces are word boundaries. Some languages [don't use spaces
between
words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries)
(Chinese, Japanese, Thai, Burmese).

This PR extends the documentation to explain how to cater for those
languages by adding additional punctuation to the separators and
zero-width spaces which are used by some typesetters and will assist the
splitter to not split in words.

Ideally, **these separators could be a constant in the module** but for
now, defining them in the documentation is a start.
  • Loading branch information
tonybaloney authored Mar 26, 2024
1 parent 441a801 commit 6c9b0f9
Showing 1 changed file with 47 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,53 @@
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "2b74939c",
"metadata": {},
"source": [
"## Splitting text from languages without word boundaries\n",
"\n",
"Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n",
"\n",
"* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`.`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n",
"* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n",
"* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`,`\", and Unicode ideographic comma \"`、`\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d48a8ef",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" separators=[\n",
" \"\\n\\n\",\n",
" \"\\n\",\n",
" \" \",\n",
" \".\",\n",
" \",\",\n",
" \"\\u200B\", # Zero-width space\n",
" \"\\uff0c\", # Fullwidth comma\n",
" \"\\u3001\", # Ideographic comma\n",
" \"\\uff0e\", # Fullwidth full stop\n",
" \"\\u3002\", # Ideographic full stop\n",
" \"\",\n",
" ],\n",
" # Existing args\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1177ee4f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit 6c9b0f9

Please sign in to comment.