Smart Collections FR: Pinecone Adapter #4

arminta7 · 2022-12-27T20:36:23Z

Would it be possible to have the option to store the embeddings in Pinecone?

brianpetro · 2022-12-27T23:33:40Z

It's possible.

Integrating Pinecone would require:

reformatting the embeddings to the "Pinecone vectors array format"
- storing vector.metadata (file.path, file.mtime, etc.)
garbage collection process to keep the embeddings stored in Pinecone up-to-date
replace the current cosine similarity calculation with a query to the Pinecone API

I would consider doing this mainly because of performance, but calculating cosine similarity on my vault containing ~1,500 notes runs pretty smoothly at the moment.

Is the performance why you are asking about this? Or is there another reason?

Thanks!

arminta7 · 2022-12-28T01:04:08Z

My vault is about 20k notes. Part of it is performance. The other is being able to reuse the embeddings for other things rather than paying for the process multiple times.

brianpetro · 2022-12-28T16:36:34Z

20K is significantly more notes than I have tested with myself. Your embeddings.json file must be almost 2GB! And that's all pulled into memory, which would likely cause performance degradation on an average computer.

Regarding reusing the embeddings, the main issue with that is synchronization—the metadata limit for Pinecone is 10kb. So smaller notes will fit completely into the metadata at this limit, but not the larger notes (>~10,000 characters).

One way to get around this is to store references to the notes in the metadata (i.e. file.path), but that requires the "secondary" applications have access to your notes file system.
Another possibility is to limit all the embeddings to <10kb, which would decrease the maximum "chunk" sizes to about a third of what they are now (~8,000 tokens ~= 30,000 characters ~= 30kb). This way, you could avoid accessing your notes file system directly from "secondary" applications.

I feel option 2 goes against the Obsidian.md ethos of "owning your data" since all your notes would be hosted in the cloud.

Option 1 has its drawbacks, too. "Secondary" applications outside of Obsidian would be more difficult to develop. However, other Obsidian plugins (i.e., Smart Completions) will have no problem reusing the embeddings stored within Obsidian. So it depends on your use case.

What is the average number of notes in an Obsidian vault? If it's much more than what I've anticipated (<1000 notes), then I think option 1 could make sense for performance reasons. That said, performance has been an afterthought at this point. There is still likely a lot of low-hanging fruit in terms of performance that wouldn't require an additional API service provider.

I'm thinking out loud here, so any feedback would be appreciated.

Thanks!

arminta7 · 2022-12-28T18:16:45Z

Yes, the file is... unwieldy lol.

I don't know too much about the specifics of the different options. I know there's also something like Weaviate? Not sure if that's better. It is open source right?

Just checked and my largest note is ~4 million characters. And plenty of others over 10k.

As far as the average number of notes? I have no idea. I'm probably on the larger end, not the largest I've heard. I'm sure there are plenty over 1,000 notes.

brianpetro · 2022-12-29T01:55:43Z

Thanks for suggesting Weaviate. It's pretty comparable to Pinecone. Hosting your own instance looks non-trivial and may not be easily packed into the plugin. I'll have to look into it more before saying it for sure.

It needs further research, but there should be a relatively simple solution to manage the vector calculations better. The storage file can be separated based on a cosine similarity clustering algorithm. Then the calculations could be prioritized based on the nearest cluster. I'm surprised I haven't seen anything like this, but I haven't looked much.

I'll continue to look into this. Thanks for the feedback.

vguillet · 2023-01-31T11:23:41Z

I second this. Being able to pull embedding from pinecone would allow for potentialy leveraging purpose-made embedding tools capable of taking in a large variety of files for example (powerpoints/pdfs for example). This could in turn unlock better query responses while also keeping the base embedding repository across all tools leveraging personal data unique!

brianpetro · 2023-01-31T13:31:10Z

@vguillet I see you already commented on brianpetro/obsidian-smart-connections#27 , thanks!

It's a similar idea. I still think a pinecone/weaviate integration will happen. But I need to learn more about how people are using them.

brianpetro · 2024-06-28T20:10:31Z

Recorded response https://youtu.be/J5ARc_91fzs

arminta7 changed the title ~~FR: Store embedding sun Pinecone~~ FR: Store embeddings in Pinecone Dec 27, 2022

brianpetro added the enhancement New feature or request label Dec 27, 2022

Mearman mentioned this issue Feb 12, 2023

FR: Generate embeddings on demand brianpetro/obsidian-smart-connections#40

Closed

brianpetro transferred this issue from brianpetro/obsidian-smart-connections Jun 27, 2024

brianpetro changed the title ~~FR: Store embeddings in Pinecone~~ Smart Collections FR: Pinecone Adapter Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart Collections FR: Pinecone Adapter #4

Smart Collections FR: Pinecone Adapter #4

arminta7 commented Dec 27, 2022

brianpetro commented Dec 27, 2022 •

edited

Loading

arminta7 commented Dec 28, 2022

brianpetro commented Dec 28, 2022

arminta7 commented Dec 28, 2022

brianpetro commented Dec 29, 2022

vguillet commented Jan 31, 2023

brianpetro commented Jan 31, 2023

brianpetro commented Jun 28, 2024

Smart Collections FR: Pinecone Adapter #4

Smart Collections FR: Pinecone Adapter #4

Comments

arminta7 commented Dec 27, 2022

brianpetro commented Dec 27, 2022 • edited Loading

arminta7 commented Dec 28, 2022

brianpetro commented Dec 28, 2022

arminta7 commented Dec 28, 2022

brianpetro commented Dec 29, 2022

vguillet commented Jan 31, 2023

brianpetro commented Jan 31, 2023

brianpetro commented Jun 28, 2024

brianpetro commented Dec 27, 2022 •

edited

Loading