SISTER (SImple SenTence EmbeddeR)
pip install sister
import sister
sentence_embedding = sister.MeanEmbedding(lang="en")
sentence = "I am a dog."
vector = sentence_embedding(sentence)
If you have custom model file by yourself, you can load it too.
(Data Format has to be loadable as gensim.models.KeyedVectors
for word2vec model files)
import sister
from sister.word_embedders import Word2VecEmbedding
sentence_embedding = sister.MeanEmbedding(
lang="ja", Word2VecEmbedding(model_path="/path/to/model")
)
sentence = "I am a dog."
vector = sentence_embedding(sentence)
- English
- Japanese
- French
In order to support a new language, please implement Tokenizer
(inheriting sister.tokenizers.Tokenizer
) and add fastText
pre-trained url to word_embedders.get_fasttext()
(List of model urls).
Actually Albert for English, CamemBERT for French and BERT for Japanese.
To use BERT, you need to install sister by pip install 'sister[bert]'
.
import sister
bert_embedding = sister.BertEmbedding(lang="en")
sentence = "I am a dog."
vector = bert_embedding(sentence)
You can also give multiple sentences to it (more efficient).
import sister
bert_embedding = sister.BertEmbedding(lang="en")
sentences = ["I am a dog.", "I want be a cat."]
vectors = bert_embedding(sentences)