docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)

The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start.
langchain-ai · Mar 26, 2024 · 6c9b0f9 · 6c9b0f9
1 parent 441a801
commit 6c9b0f9
Showing 1 changed file with 47 additions and 0 deletions.
diff --git a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb
@@ -111,6 +111,53 @@
    "metadata": {},
    "outputs": [],
    "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b74939c",
+   "metadata": {},
+   "source": [
+    "## Splitting text from languages without word boundaries\n",
+    "\n",
+    "Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n",
+    "\n",
+    "* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`．`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n",
+    "* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n",
+    "* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`，`\", and Unicode ideographic comma \"`、`\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d48a8ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_splitter = RecursiveCharacterTextSplitter(\n",
+    "    separators=[\n",
+    "        \"\\n\\n\",\n",
+    "        \"\\n\",\n",
+    "        \" \",\n",
+    "        \".\",\n",
+    "        \",\",\n",
+    "        \"\\u200B\",  # Zero-width space\n",
+    "        \"\\uff0c\",  # Fullwidth comma\n",
+    "        \"\\u3001\",  # Ideographic comma\n",
+    "        \"\\uff0e\",  # Fullwidth full stop\n",
+    "        \"\\u3002\",  # Ideographic full stop\n",
+    "        \"\",\n",
+    "    ],\n",
+    "    # Existing args\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1177ee4f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {