Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Better embedding extraction #295

Open
LLukas22 opened this issue Jun 3, 2023 · 1 comment
Open

Better embedding extraction #295

LLukas22 opened this issue Jun 3, 2023 · 1 comment
Labels
issue:enhancement New feature or request

Comments

@LLukas22
Copy link
Contributor

LLukas22 commented Jun 3, 2023

As pointed out in #291, the quality of embeddings produced by the models at present appears to be suboptimal.

Our current approach uses the embedding of the final token as a representation for the entire input sequence, which might lead to the omission of some semantic information. The approach employed by SGPT: GPT Sentence Embeddings for Semantic Search offers an alternative: they use a weighted mean sampling method to amalgamate the embeddings of all tokens in the input sequence. According to the MTEB-Benchmark, this method results in superior embeddings.

So, this poses the question: should we integrate this method into our implementation? Alternatively, should we leave it to users to manually extract the embeddings for each token and carry out the calculations themselves?

@LLukas22 LLukas22 added the issue:enhancement New feature or request label Jun 3, 2023
@philpax
Copy link
Collaborator

philpax commented Jun 3, 2023

Good catch! I think we should integrate this, but separate it from the existing embeddings. I'm also not sure how we best expose this. Any ideas for API changes that are understandable and restricted to only where it makes sense? This would only make sense with feed_prompt, right?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants