Add alternative token-based text splitter #816

kahkeng · 2023-01-31T06:48:46Z

This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings.

Potentially also helps with #528.

This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with #528.

hwchase17

awesome!!!

This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with langchain-ai#528.

kahkeng added 2 commits February 2, 2023 00:25

Move to integration test

d1a7319

hwchase17 approved these changes Feb 3, 2023

View reviewed changes

hwchase17 merged commit 4a8f5cd into langchain-ai:master Feb 3, 2023

blob42 mentioned this pull request Feb 21, 2023

fix searx blob42/langchain#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alternative token-based text splitter #816

Add alternative token-based text splitter #816

kahkeng commented Jan 31, 2023

hwchase17 left a comment

Add alternative token-based text splitter #816

Add alternative token-based text splitter #816

Conversation

kahkeng commented Jan 31, 2023

hwchase17 left a comment

Choose a reason for hiding this comment