Create a CSV Document cleaner component #8783

sjrl · 2025-01-29T07:12:01Z

Is your feature request related to a problem? Please describe.
A problem we have faced when using CSV and Excel converted files in RAG pipelines is that often times these tables can contain lots of empty rows or columns depending on how the source file was made.

We have found in practice that this can lead to an unnecessary amount of tokens being used in the LLM's context window since often times , a space + comma equals to one token (like in the tokenizer used for GPT-4o). You can see this using OpenAI's playground with the string , , , , ,.

So we find in practice that it's beneficial to remove empty rows or columns if possible to save on tokens and to increase performance by removing potentially distracting empty rows and columns.

Describe the solution you'd like
It would be great to support cleaning of csv documents. I think it makes the most sense to create a new component called something like CSVDocumentCleaner rather than expand the DocumentCleaner component since the style of cleaning will be quite different.

Additionally, if we make a separate component like CSVDocumentCleaner we can easily inform users that this component will only work on Documents whose contents can be loaded using a CSV reader.

Additional context
Given the push to remove dataframes from documents I think it makes the most sense to create a component that assumes the formatting of the document to be CSV and then proceeds to remove empty rows and columns.

The text was updated successfully, but these errors were encountered:

sjrl added the type:feature New feature or request label Jan 29, 2025

sjrl mentioned this issue Jan 29, 2025

Create a CSV Document splitter #8784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a CSV Document cleaner component #8783

Create a CSV Document cleaner component #8783

sjrl commented Jan 29, 2025

Create a CSV Document cleaner component #8783

Create a CSV Document cleaner component #8783

Comments

sjrl commented Jan 29, 2025