Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternative token-based text splitter #816

Merged
merged 2 commits into from
Feb 3, 2023
Merged

Add alternative token-based text splitter #816

merged 2 commits into from
Feb 3, 2023

Conversation

kahkeng
Copy link
Contributor

@kahkeng kahkeng commented Jan 31, 2023

This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings.

Potentially also helps with #528.

This does not involve a separator, and will naively chunk input
text at the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we
need to strictly follow the specified chunk size, and we can't
use aggressive separators like spaces to guarantee the absence
of long strings.

CharacterTextSplitter will let these strings through without
splitting them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is
hopefully mitigated by having a decent overlap quantity.
Also this results in chunks which has exact number of tokens
desired, instead of sometimes overcounting if we concatenate
shorter strings.

Potentially also helps with #528.
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!!!

@hwchase17 hwchase17 merged commit 4a8f5cd into langchain-ai:master Feb 3, 2023
@blob42 blob42 mentioned this pull request Feb 21, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with langchain-ai#528.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants