Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a CSV Document cleaner component #8783

Open
sjrl opened this issue Jan 29, 2025 · 0 comments
Open

Create a CSV Document cleaner component #8783

sjrl opened this issue Jan 29, 2025 · 0 comments
Labels
type:feature New feature or request

Comments

@sjrl
Copy link
Contributor

sjrl commented Jan 29, 2025

Is your feature request related to a problem? Please describe.
A problem we have faced when using CSV and Excel converted files in RAG pipelines is that often times these tables can contain lots of empty rows or columns depending on how the source file was made.

We have found in practice that this can lead to an unnecessary amount of tokens being used in the LLM's context window since often times , a space + comma equals to one token (like in the tokenizer used for GPT-4o). You can see this using OpenAI's playground with the string , , , , ,.

So we find in practice that it's beneficial to remove empty rows or columns if possible to save on tokens and to increase performance by removing potentially distracting empty rows and columns.

Describe the solution you'd like
It would be great to support cleaning of csv documents. I think it makes the most sense to create a new component called something like CSVDocumentCleaner rather than expand the DocumentCleaner component since the style of cleaning will be quite different.

Additionally, if we make a separate component like CSVDocumentCleaner we can easily inform users that this component will only work on Documents whose contents can be loaded using a CSV reader.

Additional context
Given the push to remove dataframes from documents I think it makes the most sense to create a component that assumes the formatting of the document to be CSV and then proceeds to remove empty rows and columns.

@sjrl sjrl added the type:feature New feature or request label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant