Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: text splitters improvements #4490

Merged
merged 4 commits into from
May 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
3 changes: 1 addition & 2 deletions docs/modules/indexes/retrievers/examples/databerry.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "9fc6205b",
"metadata": {},
Expand Down Expand Up @@ -87,7 +86,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/indexes/retrievers/examples/metal.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "ab66dd43",
"metadata": {},
Expand Down Expand Up @@ -32,7 +31,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "95d5d7f9",
"metadata": {},
Expand Down Expand Up @@ -109,7 +107,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "dbc025d6",
"metadata": {},
Expand All @@ -131,7 +128,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "96bf8879",
"metadata": {},
Expand All @@ -156,7 +152,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "23601ddb",
"metadata": {},
Expand Down Expand Up @@ -269,7 +264,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -283,7 +278,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
43 changes: 39 additions & 4 deletions docs/modules/indexes/text_splitters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,47 @@ For an introduction to the default text splitter and generic functionality see:
./text_splitters/getting_started.ipynb


We also have documentation for all the types of text splitters that are supported.
Please see below for that list.
Usage examples for the text splitters:

- `Character <./text_splitters/examples/character_text_splitter.html>`_
- `LaTeX <./text_splitters/examples/latex.html>`_
- `Markdown <./text_splitters/examples/markdown.html>`_
- `NLTK <./text_splitters/examples/nltk.html>`_
- `Python code <./text_splitters/examples/python.html>`_
- `Recursive Character <./text_splitters/examples/recursive_text_splitter.html>`_
- `spaCy <./text_splitters/examples/spacy.html>`_
- `tiktoken (OpenAI) <./text_splitters/examples/tiktoken_splitter.html>`_


.. toctree::
:maxdepth: 1
:glob:
:caption: Text Splitters
:name: text_splitters
:hidden:

./text_splitters/examples/character_text_splitter.ipynb
./text_splitters/examples/latex.ipynb
./text_splitters/examples/markdown.ipynb
./text_splitters/examples/nltk.ipynb
./text_splitters/examples/python.ipynb
./text_splitters/examples/recursive_text_splitter.ipynb
./text_splitters/examples/spacy.ipynb
./text_splitters/examples/tiktoken_splitter.ipynb


Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters.
In order to get a more accurate estimate, we can use tokenizers to count the number of tokens in the text.
We use this number inside the `..TextSplitter` classes.
This implemented as the `from_<tokenizer>` methods of the `..TextSplitter` classes:

- `Hugging Face tokenizer <./text_splitters/examples/huggingface_length_function.html>`_
- `tiktoken (OpenAI) tokenizer <./text_splitters/examples/tiktoken.html>`_

.. toctree::
:maxdepth: 1
:caption: Text Splitters with Tokens
:name: text_splitter_with_tokens
:hidden:

./text_splitters/examples/*
./text_splitters/examples/huggingface_length_function.ipynb
./text_splitters/examples/tiktoken.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,21 @@
"id": "5c461b26",
"metadata": {},
"source": [
"# Character Text Splitter\n",
"# Character\n",
"\n",
"This is a more simple method. This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters.\n",
"This is the simplest method. This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters.\n",
"\n",
"1. How the text is split: by single character\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
"2. How the chunk size is measured: by number of characters"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9c21e679",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
Expand All @@ -29,7 +31,9 @@
"cell_type": "code",
"execution_count": 2,
"id": "79ff6737",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
Expand Down Expand Up @@ -87,6 +91,37 @@
"documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)\n",
"print(documents[0])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "90ac0381-855a-469a-b8bf-e33230132bbe",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_splitter.split_text(state_of_the_union)[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "875c20be-9f63-4aee-b05a-34e9c04c1091",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -105,7 +140,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,14 @@
"id": "13dc0983",
"metadata": {},
"source": [
"# Hugging Face Length Function\n",
"Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use Hugging Face tokenizers to count the text length.\n",
"# Hugging Face tokenizer\n",
"\n",
">[Hugging Face](https://huggingface.co/docs/tokenizers/index) has many tokenizers.\n",
"\n",
"We use Hugging Face tokenizer, the [GPT2TokenizerFast](https://huggingface.co/Ransaka/gpt2-tokenizer-fast) to count the text length in tokens.\n",
"\n",
"1. How the text is split: by character passed in\n",
"2. How the chunk size is measured: by Hugging Face tokenizer"
"2. How the chunk size is measured: by number of tokens calculated by the `Hugging Face` tokenizer\n"
]
},
{
Expand Down Expand Up @@ -89,7 +92,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
Expand Down
58 changes: 49 additions & 9 deletions docs/modules/indexes/text_splitters/examples/latex.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,35 @@
"id": "3a2f572e",
"metadata": {},
"source": [
"# Latex Text Splitter\n",
"# LaTeX\n",
"\n",
"LatexTextSplitter splits text along Latex headings, headlines, enumerations and more. It's implemented as a simple subclass of RecursiveCharacterSplitter with Latex-specific separators. See the source code to see the Latex syntax expected by default.\n",
">[LaTeX](https://en.wikipedia.org/wiki/LaTeX) is widely used in academia for the communication and publication of scientific documents in many fields, including mathematics, computer science, engineering, physics, chemistry, economics, linguistics, quantitative psychology, philosophy, and political science.\n",
"\n",
"1. How the text is split: by list of latex specific tags\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
"`LatexTextSplitter` splits text along `LaTeX` headings, headlines, enumerations and more. It's implemented as a subclass of `RecursiveCharacterSplitter` with LaTeX-specific separators. See the source code for more details.\n",
"\n",
"1. How the text is split: by list of `LaTeX` specific tags\n",
"2. How the chunk size is measured: by number of characters"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"id": "c2503917",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.text_splitter import LatexTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 3,
"id": "e46b753b",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"latex_text = \"\"\"\n",
Expand Down Expand Up @@ -84,6 +90,40 @@
"source": [
"docs"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "40e62829-9485-414e-9ea1-e1a8fc7c88cb",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"['\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle',\n",
" 'Introduction}\\nLarge language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.',\n",
" 'History of LLMs}\\nThe earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.',\n",
" 'Applications of LLMs}\\nLLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\\n\\n\\\\end{document}']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"latex_splitter.split_text(latex_text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7deb8f25-a062-4956-9f90-513802069667",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -102,7 +142,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
Expand Down
Loading