Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to retrieve the whole document for a chunk? #69

Open
istvan-deak opened this issue Aug 30, 2024 · 2 comments
Open

How to retrieve the whole document for a chunk? #69

istvan-deak opened this issue Aug 30, 2024 · 2 comments
Labels
question Further information is requested

Comments

@istvan-deak
Copy link

What is your question or problem? Please describe.

I would like to use the long context window of the LLM of my choice and pass whole files to the prompt.

Describe what you would like to happen

During retrieval, I'd like the system to:

  1. First fetch the small chunks as it currently does
  2. Then look up the parent IDs for those chunks
  3. Return the larger documents or even the whole file associated with those parent IDs

This approach would allow for more context to be provided to the LLM, potentially improving its performance on tasks that require broader context.

@istvan-deak istvan-deak added the question Further information is requested label Aug 30, 2024
@szymondudycz
Copy link
Contributor

If you want to use whole files in indexing, then just don't use splitter and make sure parser doesn't split documents (e.g. use 'mode=single' in ParseUnstructured).

Doing exactly what you want, that is indexing over small chunks, but retrieving whole documents is not easily supported, what you can do is write your own splitter that inserts full documents text in the metadata of each chunk, and then after chukns are retrieved rather then using returned text, use the full document text from metadata.

@dxtrous
Copy link
Member

dxtrous commented Sep 5, 2024

@szymondudycz I believe this question has come up a number of times already. Perhaps we should make it into a feature request? The resolution could be e.g. a code template that shows how to have a table of full_document_metadata, a table of chunks with document_id in their metadata, and shows how to retrieve full_document_metadata for a given chunk, and maybe also load/reread the document on demand (with a udf).
@istvan-deak if you have any thoughts here, please don't hesitate to share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants