Better embedding extraction #295

LLukas22 · 2023-06-03T08:28:33Z

As pointed out in #291, the quality of embeddings produced by the models at present appears to be suboptimal.

Our current approach uses the embedding of the final token as a representation for the entire input sequence, which might lead to the omission of some semantic information. The approach employed by SGPT: GPT Sentence Embeddings for Semantic Search offers an alternative: they use a weighted mean sampling method to amalgamate the embeddings of all tokens in the input sequence. According to the MTEB-Benchmark, this method results in superior embeddings.

So, this poses the question: should we integrate this method into our implementation? Alternatively, should we leave it to users to manually extract the embeddings for each token and carry out the calculations themselves?

philpax · 2023-06-03T08:46:05Z

Good catch! I think we should integrate this, but separate it from the existing embeddings. I'm also not sure how we best expose this. Any ideas for API changes that are understandable and restricted to only where it makes sense? This would only make sense with feed_prompt, right?

LLukas22 added the issue:enhancement New feature or request label Jun 3, 2023

LLukas22 mentioned this issue Aug 13, 2023

add bert model #398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better embedding extraction #295

Better embedding extraction #295

LLukas22 commented Jun 3, 2023

philpax commented Jun 3, 2023

Better embedding extraction #295

Better embedding extraction #295

Comments

LLukas22 commented Jun 3, 2023

philpax commented Jun 3, 2023