Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)
The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start.
- Loading branch information