How can multi-lingual vaults be supported? #251

tophee · 2023-06-17T14:58:50Z

tophee
Jun 17, 2023

I have notes in three different languages and one disadvantage of that is that word-based similarity measures don't work in that context to discover related notes in other language. But it struck me that the embedding vectors should solve that problem, because they are representations of meaning, not words. But since I had not read anything about this despite of the AI hype in recent months, I was sceptical: it's probably not an easy solution...

So I asked ChatGPT and it basically told me: whether or not it works depends, try it out.

So I tried it in my vault by creating notes in one language that are deliberately similar to notes I have in another language. Sometimes Smart Connections connects them, but most of the time it suggests notes in the same language which are not really similar, at least not more similar than many other notes in a different language.

Based on that, I would say that smart connections currently does not really work across languages. But I am still hoping that this may not be due to the underlying embeddings technology per see, but that some tweaks may be necessary to get it to work.

I wonder, for example, whether it is necessary to specify the language of the payload when submitting chunks to the embeddings API or whether the model detects the language automatically. If the language is specified by the user, then I assume that smart connections sends all my notes with the same language label, which would explain why it doesn't work across languages. In that case, there would be great potential in having the language either autodetected by the plugin or specified in the yaml-header of the note.

If language detection is not the issue here, what else could be done?

brianpetro · 2023-06-17T15:18:30Z

brianpetro
Jun 17, 2023
Maintainer

This is most likely a model-level issue.

So, in the future, when we have even better embeddings models, this should work as you would expect with no other changes.

Unfortunately, I don't think that there's much we can do besides wait for those improvements to the underlying model.

I haven't seen much about cross-language embedding capabilities. OpenAI mentions that their embeddings work with additional languages, but I haven't seen them explicitly mention cross-language capabilities.

It seems like the cross-language capabilities shouldn't be too far off, though. Given my understanding of how the embeddings work, by matching meaning of the text, I would think that you would have better results.

Thanks for bringing this to my attention
Brian 🌴

1 reply

ivanovzlatan Oct 9, 2024

Is this supported now in any of the available models?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can multi-lingual vaults be supported? #251

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How can multi-lingual vaults be supported? #251

tophee Jun 17, 2023

Replies: 1 comment · 1 reply

brianpetro Jun 17, 2023 Maintainer

ivanovzlatan Oct 9, 2024

tophee
Jun 17, 2023

Replies: 1 comment 1 reply

brianpetro
Jun 17, 2023
Maintainer