Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings.search leads to segfault error #813

Closed
Pringled opened this issue Nov 19, 2024 · 3 comments
Closed

Embeddings.search leads to segfault error #813

Pringled opened this issue Nov 19, 2024 · 3 comments

Comments

@Pringled
Copy link

Hi! When running one of the examples, I ran into an issue.

Issue

The following code crashes with a segfault error when search is called:

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
from txtai import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]

# Index the list of text
embeddings.index(data)

print(f"{'Query':20} Best Match")
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war",
              "wildlife", "asia", "lucky", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print(f"{query:20} {data[uid]}")

Environment info

Running on MacOS, M3, python version=3.10.14.
Venv:

aiohappyeyeballs==2.4.3
aiohttp==3.11.4
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.6.2.post1
async-timeout==5.0.1
attrs==24.2.0
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
diskcache==5.6.3
distro==1.9.0
exceptiongroup==1.2.2
faiss-cpu==1.9.0
fasteners==0.19
fasttext==0.9.3
filelock==3.16.1
frozenlist==1.5.0
fsspec==2024.10.0
h11==0.14.0
httpcore==1.0.7
httpx==0.27.2
huggingface-hub==0.26.2
idna==3.10
importlib_metadata==8.5.0
Jinja2==3.1.4
jiter==0.7.1
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
litellm==1.52.10
llama_cpp_python==0.3.2
lz4==4.3.3
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
model2vec==0.3.2
mpmath==1.3.0
msgpack==1.1.0
multidict==6.1.0
networkx==3.4.2
numpy==2.1.3
openai==1.54.4
packaging==24.2
pillow==11.0.0
propcache==0.2.0
pybind11==2.13.6
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pymagnitude-lite==0.1.143
python-dotenv==1.0.1
PyYAML==6.0.2
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.21.0
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.3.1
skops==0.10.0
sniffio==1.3.1
sympy==1.13.1
tabulate==0.9.0
threadpoolctl==3.5.0
tiktoken==0.8.0
tokenizers==0.20.3
torch==2.5.1
tqdm==4.67.0
transformers==4.46.3
txtai==8.0.0
typing_extensions==4.12.2
urllib3==2.2.3
xxhash==3.5.0
yarl==1.17.2
zipp==3.21.0
@davidmezzetti
Copy link
Member

Hello, thank you for the detailed report.

This is typically due to a known issue between Faiss and macOS (kyamagu/faiss-wheels#100)

The usual mitigations are:

Issue

Segmentation faults and similar errors on macOS

Solution

Set the following environment parameters.

  • Disable OpenMP threading via the environment variable export OMP_NUM_THREADS=1
  • Disable PyTorch MPS device via export PYTORCH_MPS_DISABLE=1
  • Disable llama.cpp metal via export LLAMA_NO_METAL=1

Source: https://neuml.github.io/txtai/faq/

There is also this: kyamagu/faiss-wheels#73 (comment)

export KMP_DUPLICATE_LIB_OK=TRUE

It would be great to have a programmatic solution as I'm sure there are plenty of macOS users that encounter this error and just move on to another library.

@Pringled
Copy link
Author

Hi @davidmezzetti, thanks for the detailed reply! The other backends indeed seem to work fine. I guess an alternative solution would be to have a different default for the index method (e.g. hnsw), but I guess that's not as nice as it would introduce more base dependencies for txtai. For now I'll just use a different backend as I would be using hnsw from faiss anyway, thanks!

@davidmezzetti
Copy link
Member

In the past, I had setup.py conditionally install hnswlib for mac/windows and faiss for linux as the defaults. But that became confusing as the results were different based on the OS.

I've been hoping the upstream library would find a solution but I've been holding my breath for a while 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants