Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

向量存储建议 #38

Open
thomas-yanxin opened this issue Apr 26, 2023 · 11 comments
Open

向量存储建议 #38

thomas-yanxin opened this issue Apr 26, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@thomas-yanxin
Copy link
Member

No description provided.

@sujianwei1
Copy link

如何运行Qdrant有各种模式,根据所选择的模式,会有一些细微的差别。选项包括:
本地模式,不需要服务器
本地服务器部署
云部署

@sujianwei1
Copy link

本地模式,不使用Qdrant服务器,也可以将向量存储在磁盘上,这样它们就可以在两次运行之间保持不变。
from langchain.vectorstores import Qdrant qdrant = Qdrant.from_documents( docs, embeddings, path="/tmp/local_qdrant", collection_name="my_documents", )

@sujianwei1
Copy link

应该可以,如果您想重用现有的集合,您总是可以自己创建一个Qdrant实例,并将连接详细信息传递给Qdrant Client实例。
`import qdrant_client

client = qdrant_client.QdrantClient(
path="/tmp/local_qdrant", prefer_grpc=True
)
qdrant = Qdrant(
client=client, collection_name="my_documents",
embedding_function=embeddings.embed_query
)`

@sujianwei1
Copy link

这个就有点像启动的时候,加载下历史存储数据,从而保证一直不丢失

@sujianwei1
Copy link

检索
query = "What did the president say about Ketanji Brown Jackson" found_docs = qdrant.similarity_search(query)

@sujianwei1
Copy link

批量加载文档可以看看这个函数
from langchain.document_loaders import DirectoryLoader loader = DirectoryLoader(solidity_root, glob = "**/*.txt") docs = loader.load()
分词
split_docs = text_splitter.split_documents(docs)
然后embeddings存到向量数据库
vectorstore = vectorstore.from_documents(split_docs, embeddings, persist_directory=persist_directory)

@thomas-yanxin thomas-yanxin pinned this issue Apr 26, 2023
@HkkSimple
Copy link

如果需要对存量的大规模文档进行vector存储的话,可能使用基于磁盘(disk-based)的数据库进行缓存可能是更好的选择。
我看GPTCache是基于此概念搭建的,而且也是面向LLM专门搭建的,功能性上可能是开箱即用的。(https://github.com/zilliztech/GPTCache)

@online2311
Copy link
Contributor

Milvus Litehttps://github.com/milvus-io/milvus-lite ,完全兼容Milvus ,
可以嵌入到 Python 应用程序。pip install milvus https://pypi.org/project/milvus/
方便未来生成环境使用Milvus,可盐可甜。

@benli2023
Copy link

我的代码这样,帮看看有没有问题,获取不了中文的相似的文本

    def qdrant(docs_path):
        texts = []
        for doc in tqdm(os.listdir(docs_path)):
            if doc.endswith('.txt'):
                with open(f'{docs_path}/{doc}','r',encoding='utf-8') as f:
                    doc_data = f.read()
                text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
                texts = text_splitter.split_text(doc_data)
                Qdrant.from_texts(texts, embeddings,
                                        metadatas=[{"source": f"{i}-doc"} for i in range(len(texts))],
                                        host="localhost",
                                        prefer_grpc=False,
                                        collection_name="Finance"
       

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost",port=6333)
embeddings = HuggingFaceEmbeddings(model_name='/home/ubuntu/models/GanymedeNil_text2vec-large-chinese')

qdrant=Qdrant(client,'Finance',embeddings.embed_query)

documents=qdrant.similarity_search("test",4)

for doc in documents:
print(doc.page_content)

                                )

@thomas-yanxin
Copy link
Member Author

目前尝试使用Qdrant,后续将做更细致的调研。

参考资料:

  1. 向量数据库大PK|来自百万级数据的基准测试

@zhugexinxin
Copy link

是否可以增量更新collections的api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants