docs: Add guidance for splitting Chinese, Japanese, and Thai #19295

tonybaloney · 2024-03-19T23:33:44Z

The existing default list of separators for the RecursiveTextSplitter assumes spaces are word boundaries. Some languages don't use spaces between words (Chinese, Japanese, Thai, Burmese).

This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words.

Ideally, these separators could be a constant in the module but for now, defining them in the documentation is a start.

Lint and test: Run make format, make lint and make test from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:

Make sure optional dependencies are imported within a function.
Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests.
Most PRs should not touch more than one package.
Changes should be backwards compatible.
If you are adding something to community, do not re-import it in langchain.

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.

…i and other writing systems

vercel · 2024-03-19T23:33:48Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 26, 2024 0:33am

…in-ai#19295) The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start.

The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start.

Explain how to use the recursive splitter with Japanese, Chinese, Tha…

498394f

…i and other writing systems

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Mar 19, 2024

vercel bot had a problem deploying to Preview March 19, 2024 23:38 Failure

ccurme self-assigned this Mar 20, 2024

Merge branch 'master' into extend_docs_non_word_boundaries

633f65c

baskaryan enabled auto-merge (squash) March 26, 2024 00:27

vercel bot deployed to Preview March 26, 2024 00:33 View deployment

baskaryan merged commit 6c9b0f9 into langchain-ai:master Mar 26, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add guidance for splitting Chinese, Japanese, and Thai #19295

docs: Add guidance for splitting Chinese, Japanese, and Thai #19295

tonybaloney commented Mar 19, 2024 •

edited

Loading

vercel bot commented Mar 19, 2024 •

edited

Loading

docs: Add guidance for splitting Chinese, Japanese, and Thai #19295

docs: Add guidance for splitting Chinese, Japanese, and Thai #19295

Conversation

tonybaloney commented Mar 19, 2024 • edited Loading

vercel bot commented Mar 19, 2024 • edited Loading

tonybaloney commented Mar 19, 2024 •

edited

Loading

vercel bot commented Mar 19, 2024 •

edited

Loading