Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the RAG Pipeline #22

Open
3 of 11 tasks
sestinj opened this issue Sep 21, 2023 · 3 comments
Open
3 of 11 tasks

Improve the RAG Pipeline #22

sestinj opened this issue Sep 21, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@sestinj
Copy link
Contributor

sestinj commented Sep 21, 2023

The "@codebase" context provider allows you to ask questions without explicitly specifying which files should be included as context. Instead, Continue will use embeddings to pull out the most important files to answer your question.

The current implementation uses a fairly simple setup with LanceDB. There is tons of room to improve the indexing and retrieval steps. Most of the code can be found in core/indexing

Here are some of the ideas for how the pipeline can be improved (and you can also contribute by adding your own ideas here!):

  • Chunking
  • Code-aware chunking (for example chunking by function or class) (consider using tree-sitter)
  • Separating the text used for similarity search and the text actually returned (for example, you might write a short preamble summary in the text used for similarity search, or use the reverse of the technique of converting the question to a potential answer before doing search)
  • Convert the input to some text that is more appropriate for search (e.g. to a possible answer to the question, and then similarity search on that)
  • Custom embeddings model (currently using ada or sentence transformers (in order to be local))
  • Re-ranking: retrieve many options and then prune afterward
  • Improve the re-ranking prompts (currently there is a "remove" prompt that choose which files are irrelevant, and an "include" prompt that says which files are important
  • Weight chunks by information like commit frequency/recency, file length, etc.
  • Use other retrieval methods like fuzzy search, ripgrep, etc. to expand the initial pool
  • Take into account metadata like filename or path
  • Use code graph to include files that are adjacent to multiple other selected files, or for other reasons
@sestinj sestinj converted this from a draft issue Sep 21, 2023
@sestinj sestinj added the enhancement New feature or request label Sep 21, 2023
@mamolli
Copy link

mamolli commented Sep 21, 2023

Possibility to insert documents from outside the repo. Would be good to be able to have a seperate feed channel into db for things like tickets, convos, other docs.

@SensorLock
Copy link

Two suggestions:

  • Check out RepoCoder, which generates code, finds similar code to the generated, then re-generates using those as examples
  • Add an option to insert the origin of RAG context items into a comment before the generated code

@malaki12003
Copy link

Hey @sestinj, I noticed this ticket is still open and doesn’t seem to be assigned to anyone yet. Would it be alright if I pick it up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

4 participants