-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Sentence-transformers models fail if "/" in the name #741
Comments
@blythed - guidance appreciated :) |
Hello :) @duarteocarmo did you tried running the same model without superduperdb wrapper? Thanks |
@thejumpman2323 - not really, I was testing through sentence-transformers. |
@duarteocarmo Sorry, I meant plain sentence-transformer |
Ah! Yes! It works fine with sentence transformers. |
@blythed @rec - this still fails..
Example: import sentence_transformers
import superduperdb as s
from pymongo import MongoClient
from superduperdb.container.document import Document
from superduperdb.container.listener import Listener
from superduperdb.container.model import Model
from superduperdb.container.vector_index import VectorIndex
from superduperdb.db.mongodb.query import Collection
from superduperdb.ext.numpy.array import array
# This fails, because of the model name
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384
# Below works
# MODEL_NAME = "all-MiniLM-L6-v2"
# VECTOR_SIZE = 384
IDENTIFIER_ID = "my-index"
COLLECTION_NAME = "docs"
client = MongoClient("localhost", 27017)
db = s.superduper(client.documents)
collection = Collection(name=COLLECTION_NAME)
data = [
{
"title": "Anarchism",
"abstract": "Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions they claim maintain unnecessary coercion and hierarchy, typically including, though not necessarily limited to, the state and capitalism. Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations.",
},
{
"title": "Albedo",
"abstract": "Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, to 1, corresponding to a body that reflects all incident radiation.",
},
{
"title": "Achilles",
"abstract": 'In Greek mythology, Achilles ( ) or Achilleus ( Accessed 5 May 2017. the latter being the dative of the former. The name grew more popular, even becoming common soon after the seventh century BCEpigraphical database gives 476 matches for Ἀχιλ-.The earliest ones: Corinth 7th c. BC, Delphi 530 BC, Attica and Elis 5th c. BC. and was also turned into the female form Ἀχιλλεία (Achilleía), attested in Attica in the fourth century BC (IG II² 1617) and, in the form Achillia, on a stele in Halicarnassus as the name of a female gladiator fighting an "Amazon".',
},
]
model = Model(
identifier=MODEL_NAME,
object=sentence_transformers.SentenceTransformer(MODEL_NAME, device="mps"),
encoder=array("float32", shape=(VECTOR_SIZE,)),
predict_method="encode",
batch_predict=True,
)
db.add(
VectorIndex(
identifier=IDENTIFIER_ID,
indexing_listener=Listener(
model=model,
key="abstract",
select=Collection(name=COLLECTION_NAME).find(),
),
)
)
print(db.show("listener"))
print(db.show("model"))
print(db.show("vector_index"))
data = [Document(r) for r in data]
db.execute(collection.insert_many(data))
for doc in db.execute(
collection.like({"abstract": "something"}, n=2, vector_index=IDENTIFIER_ID)
):
print(doc) Error:
|
Sorry for the delay, but I can't so far repro, on HEAD or on 0.0.7, on Python 3.8. The program prints
and then continues to run. Does this happen immediately? I see you're on Python 3.10, so I'll install a 3.10 venv and try again. |
I still can't reproduce it with that code!! I also audited all calls to So let's start with some simple tests: try executing these from the command line:
|
So I just tried creating a brand new directory and installing superduperdb, sentence-transformers, and running the program, and this did not repro the issue. I'm sure I can figure it out given just a bit more info!! :-) |
I'm now installing from the github repo and that part works now! Not sure what happened, but thanks for the help @rec ! |
Contact Details [Optional]
[email protected]
System Information
{
"cfg": {
"apis": {
"providers": {},
"retry": {
"stop_after_attempt": 2,
"wait_max": 10.0,
"wait_min": 4.0,
"wait_multiplier": 1.0
}
},
"cdc": false,
"dask": {
"password": "",
"port": 8786,
"username": "",
"ip": "localhost",
"deserializers": [],
"serializers": [],
"local": true
},
"data_layers": {
"artifact": {
"cls": "mongodb",
"connection": "pymongo",
"kwargs": {
"password": "",
"port": 27017,
"username": "",
"host": "localhost"
},
"name": "_filesystem:test_db"
},
"data_backend": {
"cls": "mongodb",
"connection": "pymongo",
"kwargs": {
"password": "",
"port": 27017,
"username": "",
"host": "localhost"
},
"name": "test_db"
},
"metadata": {
"cls": "mongodb",
"connection": "pymongo",
"kwargs": {
"password": "",
"port": 27017,
"username": "",
"host": "localhost"
},
"name": "test_db"
}
},
"distributed": false,
"logging": {
"level": "INFO",
"type": "STDERR",
"kwargs": {}
},
"model_server": {
"password": "",
"port": 5001,
"username": "",
"host": "127.0.0.1"
},
"notebook": {
"ip": "0.0.0.0",
"password": "",
"port": 8888,
"token": ""
},
"server": {
"host": "127.0.0.1",
"port": 3223,
"protocol": "http"
},
"vector_search": {
"host": "localhost",
"password": "",
"port": 19530,
"type": {
"backfill_batch_size": 100,
"inmemory": true
},
"backfill_batch_size": 100,
"username": ""
},
"verbose": false,
"downloads": {
"hybrid": false,
"root": "data/downloads"
}
},
"cwd": "/Users/duarteocarmo/Repos/thechangelogbot-backend",
"git": {
"branch": "('branch', '--show-current') failed with [Errno 2] No such file or directory: 'branch'",
"commit": "('show', '-s', '--format="%h: %s"') failed with [Errno 2] No such file or directory: 'show'"
},
"hostname": "duartes-macbook-pro.home",
"os_uname": [
"Darwin",
"duartes-macbook-pro.home",
"22.4.0",
"Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020",
"arm64"
],
"package_versions": {},
"platform": {
"platform": "macOS-13.3.1-arm64-arm-64bit",
"python_version": "3.10.11"
},
"startup_time": "2023-08-21 22:16:57.100440",
"superduper_db_root": "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages",
"sys": {
"argv": [
"/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/main.py",
"info"
],
"path": [
"/Users/duarteocarmo/Repos/thechangelogbot-backend",
"/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python310.zip",
"/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10",
"/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10/lib-dynload",
"/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages",
"/Users/duarteocarmo/Repos/thechangelogbot-backend/src"
]
}
}
What happened?
I was trying to use another model (e.g.,
BAAI/bge-base-en
) for my vector use case. Here is my code:When running this, I get the error below. This is probably because of the way we split this here.
I tried changing it, to something like:
But I get another error such as:
FileNotFoundError: Can't find model: bge-base-en in metadata
Steps to reproduce
Just execute the script above.
Relevant log output
The text was updated successfully, but these errors were encountered: