Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Sentence-transformers models fail if "/" in the name #741

Closed
duarteocarmo opened this issue Aug 21, 2023 · 10 comments
Closed

[BUG]: Sentence-transformers models fail if "/" in the name #741

duarteocarmo opened this issue Aug 21, 2023 · 10 comments
Labels
🐛 bug Something isn't working

Comments

@duarteocarmo
Copy link
Contributor

Contact Details [Optional]

[email protected]

System Information

{
"cfg": {
"apis": {
"providers": {},
"retry": {
"stop_after_attempt": 2,
"wait_max": 10.0,
"wait_min": 4.0,
"wait_multiplier": 1.0
}
},
"cdc": false,
"dask": {
"password": "",
"port": 8786,
"username": "",
"ip": "localhost",
"deserializers": [],
"serializers": [],
"local": true
},
"data_layers": {
"artifact": {
"cls": "mongodb",
"connection": "pymongo",
"kwargs": {
"password": "",
"port": 27017,
"username": "",
"host": "localhost"
},
"name": "_filesystem:test_db"
},
"data_backend": {
"cls": "mongodb",
"connection": "pymongo",
"kwargs": {
"password": "",
"port": 27017,
"username": "",
"host": "localhost"
},
"name": "test_db"
},
"metadata": {
"cls": "mongodb",
"connection": "pymongo",
"kwargs": {
"password": "",
"port": 27017,
"username": "",
"host": "localhost"
},
"name": "test_db"
}
},
"distributed": false,
"logging": {
"level": "INFO",
"type": "STDERR",
"kwargs": {}
},
"model_server": {
"password": "",
"port": 5001,
"username": "",
"host": "127.0.0.1"
},
"notebook": {
"ip": "0.0.0.0",
"password": "",
"port": 8888,
"token": ""
},
"server": {
"host": "127.0.0.1",
"port": 3223,
"protocol": "http"
},
"vector_search": {
"host": "localhost",
"password": "",
"port": 19530,
"type": {
"backfill_batch_size": 100,
"inmemory": true
},
"backfill_batch_size": 100,
"username": ""
},
"verbose": false,
"downloads": {
"hybrid": false,
"root": "data/downloads"
}
},
"cwd": "/Users/duarteocarmo/Repos/thechangelogbot-backend",
"git": {
"branch": "('branch', '--show-current') failed with [Errno 2] No such file or directory: 'branch'",
"commit": "('show', '-s', '--format="%h: %s"') failed with [Errno 2] No such file or directory: 'show'"
},
"hostname": "duartes-macbook-pro.home",
"os_uname": [
"Darwin",
"duartes-macbook-pro.home",
"22.4.0",
"Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020",
"arm64"
],
"package_versions": {},
"platform": {
"platform": "macOS-13.3.1-arm64-arm-64bit",
"python_version": "3.10.11"
},
"startup_time": "2023-08-21 22:16:57.100440",
"superduper_db_root": "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages",
"sys": {
"argv": [
"/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/main.py",
"info"
],
"path": [
"/Users/duarteocarmo/Repos/thechangelogbot-backend",
"/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python310.zip",
"/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10",
"/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10/lib-dynload",
"/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages",
"/Users/duarteocarmo/Repos/thechangelogbot-backend/src"
]
}
}

What happened?

I was trying to use another model (e.g., BAAI/bge-base-en) for my vector use case. Here is my code:

import sentence_transformers
import superduperdb as s
from pymongo import MongoClient
from superduperdb.container.document import Document
from superduperdb.container.listener import Listener
from superduperdb.container.model import Model
from superduperdb.container.vector_index import VectorIndex
from superduperdb.db.mongodb.query import Collection
from superduperdb.ext.numpy.array import array

# This fails, because of the model name 
MODEL_NAME = "BAAI/bge-base-en"
VECTOR_SIZE = 768

# Below works
# MODEL_NAME = "all-MiniLM-L6-v2"
# VECTOR_SIZE = 384

IDENTIFIER_ID = "my-index"
COLLECTION_NAME = "docs"

client = MongoClient("localhost", 27017)
db = s.superduper(client.documents)
collection = Collection(name=COLLECTION_NAME)

data = [
    {
        "title": "Anarchism",
        "abstract": "Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions they claim maintain unnecessary coercion and hierarchy, typically including, though not necessarily limited to, the state and capitalism. Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations.",
    },
    {
        "title": "Albedo",
        "abstract": "Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, to 1, corresponding to a body that reflects all incident radiation.",
    },
    {
        "title": "Achilles",
        "abstract": 'In Greek mythology, Achilles ( ) or Achilleus ( Accessed 5 May 2017. the latter being the dative of the former. The name grew more popular, even becoming common soon after the seventh century BCEpigraphical database gives 476 matches for Ἀχιλ-.The earliest ones: Corinth 7th c. BC, Delphi 530 BC, Attica and Elis 5th c. BC. and was also turned into the female form Ἀχιλλεία (Achilleía), attested in Attica in the fourth century BC (IG II² 1617) and, in the form Achillia, on a stele in Halicarnassus as the name of a female gladiator fighting an "Amazon".',
    },
]


model = Model(
    identifier=MODEL_NAME,
    object=sentence_transformers.SentenceTransformer(MODEL_NAME),
    encoder=array("float32", shape=(VECTOR_SIZE,)),
    predict_method="encode",
    batch_predict=True,
)


db.add(
    VectorIndex(
        identifier=IDENTIFIER_ID,
        indexing_listener=Listener(
            model=model,
            key="abstract",
            select=Collection(name=COLLECTION_NAME).find(),
        ),
    )
)


print(db.show("listener"))
print(db.show("model"))
print(db.show("vector_index"))

data = [Document(r) for r in data]
db.execute(collection.insert_many(data))

When running this, I get the error below. This is probably because of the way we split this here.

I tried changing it, to something like:

 *_, model, key = identifier.split("/")

But I get another error such as: FileNotFoundError: Can't find model: bge-base-en in metadata

Steps to reproduce

Just execute the script above.

Relevant log output

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-base-en
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
INFO:root:Adding model BAAI/bge-base-en to db
WARNING:root:model/BAAI/bge-base-en/0 already exists - doing nothing
INFO:root:Done.
0it [00:00, ?it/s]
Batches: 0it [00:00, ?it/s]
INFO:root:loading hashes: 'my-index'
Loading vectors into vector-table...: 0it [00:00, ?it/s]
['BAAI/bge-base-en/abstract']
['BAAI/bge-base-en']
['my-index']
Traceback (most recent call last):
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/ola.py", line 66, in <module>
    db.execute(collection.insert_many(data))
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 263, in execute
    return self.insert(query)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 292, in insert
    return insert(self)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/mongodb/query.py", line 789, in __call__
    graph = db.refresh_after_update_or_insert(
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 348, in refresh_after_update_or_insert
    task_workflow: TaskWorkflow = self._build_task_workflow(
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 533, in _build_task_workflow
    model, key = identifier.split('/')
ValueError: too many values to unpack (expected 2)
@duarteocarmo duarteocarmo added the 🐛 bug Something isn't working label Aug 21, 2023
@duarteocarmo
Copy link
Contributor Author

@blythed - guidance appreciated :)

rec added a commit to rec/superduperdb that referenced this issue Aug 22, 2023
@thejumpman2323
Copy link
Contributor

thejumpman2323 commented Aug 22, 2023

Hello :) @duarteocarmo
Great bug/issue report, thanks!

did you tried running the same model without superduperdb wrapper?
with plain Sentence-transformer face?

Thanks

@duarteocarmo
Copy link
Contributor Author

@thejumpman2323 - not really, I was testing through sentence-transformers.

@thejumpman2323
Copy link
Contributor

@duarteocarmo Sorry, I meant plain sentence-transformer
did this model work?

@duarteocarmo
Copy link
Contributor Author

Ah! Yes! It works fine with sentence transformers.

@rec rec closed this as completed in cfca570 Aug 23, 2023
@duarteocarmo
Copy link
Contributor Author

@blythed @rec - this still fails..

superduperdb==0.0.7

Example:

import sentence_transformers
import superduperdb as s
from pymongo import MongoClient
from superduperdb.container.document import Document
from superduperdb.container.listener import Listener
from superduperdb.container.model import Model
from superduperdb.container.vector_index import VectorIndex
from superduperdb.db.mongodb.query import Collection
from superduperdb.ext.numpy.array import array

# This fails, because of the model name
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384


# Below works
# MODEL_NAME = "all-MiniLM-L6-v2"
# VECTOR_SIZE = 384

IDENTIFIER_ID = "my-index"
COLLECTION_NAME = "docs"

client = MongoClient("localhost", 27017)
db = s.superduper(client.documents)
collection = Collection(name=COLLECTION_NAME)

data = [
    {
        "title": "Anarchism",
        "abstract": "Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions they claim maintain unnecessary coercion and hierarchy, typically including, though not necessarily limited to, the state and capitalism. Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations.",
    },
    {
        "title": "Albedo",
        "abstract": "Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, to 1, corresponding to a body that reflects all incident radiation.",
    },
    {
        "title": "Achilles",
        "abstract": 'In Greek mythology, Achilles ( ) or Achilleus ( Accessed 5 May 2017. the latter being the dative of the former. The name grew more popular, even becoming common soon after the seventh century BCEpigraphical database gives 476 matches for Ἀχιλ-.The earliest ones: Corinth 7th c. BC, Delphi 530 BC, Attica and Elis 5th c. BC. and was also turned into the female form Ἀχιλλεία (Achilleía), attested in Attica in the fourth century BC (IG II² 1617) and, in the form Achillia, on a stele in Halicarnassus as the name of a female gladiator fighting an "Amazon".',
    },
]


model = Model(
    identifier=MODEL_NAME,
    object=sentence_transformers.SentenceTransformer(MODEL_NAME, device="mps"),
    encoder=array("float32", shape=(VECTOR_SIZE,)),
    predict_method="encode",
    batch_predict=True,
)


db.add(
    VectorIndex(
        identifier=IDENTIFIER_ID,
        indexing_listener=Listener(
            model=model,
            key="abstract",
            select=Collection(name=COLLECTION_NAME).find(),
        ),
    )
)


print(db.show("listener"))
print(db.show("model"))
print(db.show("vector_index"))

data = [Document(r) for r in data]
db.execute(collection.insert_many(data))


for doc in db.execute(
    collection.like({"abstract": "something"}, n=2, vector_index=IDENTIFIER_ID)
):
    print(doc)

Error:

Loading vectors into vector-table...: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/ola.py", line 72, in <module>
    for doc in db.execute(
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 276, in execute
    return self.like(query)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 334, in like
    return like(self)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/mongodb/query.py", line 319, in __call__
    ids, scores = db._select_nearest(
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 972, in _select_nearest
    vi = self.vector_indices[vector_index]
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 989, in __missing__
    value = self[key] = self.database.load(self.field, key)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 507, in load
    m.on_load(self)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/container/vector_index.py", line 107, in on_load
    self._initialize_vector_database(db)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/container/vector_index.py", line 239, in _initialize_vector_database
    h = record.outputs(key, self.indexing_listener.model.identifier)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/container/document.py", line 48, in outputs
    document = self.unpack()[_OUTPUTS_KEY][key][model]
KeyError: 'BAAI/bge-small-en-v1.5'
Loading vectors into vector-table...: 0it [00:00, ?it/s]

@rec rec reopened this Sep 19, 2023
@rec
Copy link
Contributor

rec commented Sep 19, 2023

Sorry for the delay, but I can't so far repro, on HEAD or on 0.0.7, on Python 3.8.

The program prints

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-base-en
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu

and then continues to run. Does this happen immediately?

I see you're on Python 3.10, so I'll install a 3.10 venv and try again.

@rec
Copy link
Contributor

rec commented Sep 19, 2023

I still can't reproduce it with that code!!

I also audited all calls to split and all strings looking like '/' while I was waiting and it does look as if we are careful to use rpartition everywhere we do the model/key split.

So let's start with some simple tests: try executing these from the command line:

python -c 'import superduperdb as s; print(s.__version__, s.__file__)'
python -m pip freeze | grep superduperdb

@rec
Copy link
Contributor

rec commented Sep 20, 2023

So I just tried creating a brand new directory and installing superduperdb, sentence-transformers, and running the program, and this did not repro the issue.

I'm sure I can figure it out given just a bit more info!! :-)

@duarteocarmo
Copy link
Contributor Author

I'm now installing from the github repo and that part works now! Not sure what happened, but thanks for the help @rec !

fnikolai pushed a commit that referenced this issue Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants