You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A problem we have faced when using CSV and Excel converted files in RAG pipelines is that often times these tables can contain lots of empty rows or columns depending on how the source file was made.
We have found in practice that this can lead to an unnecessary amount of tokens being used in the LLM's context window since often times , a space + comma equals to one token (like in the tokenizer used for GPT-4o). You can see this using OpenAI's playground with the string , , , , ,.
So we find in practice that it's beneficial to remove empty rows or columns if possible to save on tokens and to increase performance by removing potentially distracting empty rows and columns.
Describe the solution you'd like
It would be great to support cleaning of csv documents. I think it makes the most sense to create a new component called something like CSVDocumentCleaner rather than expand the DocumentCleaner component since the style of cleaning will be quite different.
Additionally, if we make a separate component like CSVDocumentCleaner we can easily inform users that this component will only work on Documents whose contents can be loaded using a CSV reader.
Additional context
Given the push to remove dataframes from documents I think it makes the most sense to create a component that assumes the formatting of the document to be CSV and then proceeds to remove empty rows and columns.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
A problem we have faced when using CSV and Excel converted files in RAG pipelines is that often times these tables can contain lots of empty rows or columns depending on how the source file was made.
We have found in practice that this can lead to an unnecessary amount of tokens being used in the LLM's context window since often times
,
a space + comma equals to one token (like in the tokenizer used for GPT-4o). You can see this using OpenAI's playground with the string, , , , ,
.So we find in practice that it's beneficial to remove empty rows or columns if possible to save on tokens and to increase performance by removing potentially distracting empty rows and columns.
Describe the solution you'd like
It would be great to support cleaning of csv documents. I think it makes the most sense to create a new component called something like
CSVDocumentCleaner
rather than expand theDocumentCleaner
component since the style of cleaning will be quite different.Additionally, if we make a separate component like
CSVDocumentCleaner
we can easily inform users that this component will only work on Documents whose contents can be loaded using a CSV reader.Additional context
Given the push to remove dataframes from documents I think it makes the most sense to create a component that assumes the formatting of the document to be CSV and then proceeds to remove empty rows and columns.
The text was updated successfully, but these errors were encountered: