add store and index for cookbook

LazyAGI · Dec 9, 2024 · ce1ec7f · ce1ec7f
1 parent ec0aead
commit ce1ec7f
Show file tree

Hide file tree

Showing 2 changed files with 227 additions and 6 deletions.
diff --git a/docs/zh/Best Practice/rag.md b/docs/zh/Best Practice/rag.md
@@ -101,16 +101,17 @@ docs.create_node_group(name='sentence-len',
 
 配置项参数 `store_conf` 是一个 dict，包含的字段如下：
 
-* `type`：是存储后端类型。目前支持的存储后端有：
+* `type`：是存储后端类型。目前支持的存储后端及可传递的参数 `kwargs` 如下：
     - `map`：内存 key/value 存储；
     - `chroma`：使用 Chroma 存储数据；
+        - `dir`（必填）：存储数据的目录。
     - `milvus`：使用 Milvus 存储数据。
+        - `uri`（必填）：Milvus 存储地址，可以是一个文件路径或者如 `ip:port` 格式的 url；
+        - `embedding_index_type`（可选）：Milvus 支持的 embedding 索引类型，默认是 `HNSW`；
+        - `embedding_metric_type`（可选）：根据 embedding 索引类型不同配置的检索参数，默认是 `COSINE`。
 * `indices`：是一个 dict，key 是索引类型名称，value 是该索引类型所需要的参数。索引类型目前支持：
     - `smart_embedding_index`：提供 embedding 检索功能。支持的后端有：
-        - `milvus`：使用 Milvus 作为 embedding 检索的后端。可供使用的参数 `kwargs` 有：
-            - `uri`：Milvus 存储地址，可以是一个文件路径或者如 `ip:port` 格式的 url；
-            - `embedding_index_type`：Milvus 支持的 embedding 索引类型，默认是 `HNSW`；
-            - `embedding_metric_type`：根据 embedding 索引类型不同配置的检索参数，默认是 `COSINE`。
+        - `milvus`：使用 Milvus 作为 embedding 检索的后端。可供使用的参数 `kwargs` 和作为存储后端时的参数一样。
 
 下面是一个使用 Chroma 作为存储后端，Milvus 作为检索后端的配置样例：
 
@@ -197,7 +198,7 @@ def dummy_similarity_func(query: List[float], node: DocNode, **kwargs) -> float:
 
 `Retriever` 实例使用时需要传入要查询的 `query` 字符串，还有可选的过滤器 `filters` 用于字段过滤。`filters` 是一个 dict，其中 key 是要过滤的字段，value 是一个可取值列表，表示只要该字段的值匹配列表中的任意一个值即可。只有当所有的条件都满足该 node 才会被返回。
 
-下面是使用 filters 的例子：
+下面是使用 filters 的例子（`doc_fields` 的配置参考 [Document 的介绍](../Best%20Practice/rag.md#Document)）：
 
 ```python
 filters = {

diff --git a/docs/zh/Cookbook/rag.md b/docs/zh/Cookbook/rag.md
@@ -322,3 +322,223 @@ my_reranker = Reranker(name="MyReranker")
 当然返回的结果可能会很奇怪 :)
 
 这里只是简单介绍了怎么使用 `LazyLLM` 注册扩展的机制。可以参考 [Retriever](../Best%20Practice/rag.md#Retriever) 和 [Reranker](../Best%20Practice/rag.md#Reranker) 的文档，在遇到不能满足需求的时候通过编写自己的相似度计算和排序策略来实现自己的应用。
+
+## 版本-5：自定义存储后端
+
+在定义好 Node Group 的转换规则之后，`LazyLLM` 会把检索过程中用到的转换得到的 Node Group 内容保存起来，这样后续使用的时候可以避免重复执行转换操作。为了方便用户存取数据，`LazyLLM` 支持用户自定义存储后端。
+
+如果没有指定，`LazyLLM` 默认使用基于 dict 的 key/value 作为存储后端。用户可以通过 `Document` 的参数 `store_conf` 来指定其它存储后端。例如想使用 Milvus 作为存储后端，我们可以这样配置：
+
+```python
+milvus_store_conf = {
+    'type': 'milvus',
+    'kwargs': {
+        'uri': store_file,
+        'embedding_index_type': 'HNSW',
+        'embedding_metric_type': 'COSINE',
+    },
+}
+```
+
+其中 `type` 为后端类型，`kwargs` 时需要传递给后端的参数。各字段含义如下：
+
+* `type`：需要使用的后端类型。目前支持：
+    - `map`：基于 dict 的内存 key/value 后端；
+    - `milvus`：使用 Milvus 存储数据。`kwargs` 包括：
+        - `uri`（必填）：Milvus 后端所在的路径，可以是一个 `ip:port` 形式的字符串，或者是一个文件路径：
+        - `embedding_index_type`（可选）：Milvus 支持的 embedding 索引类型，默认是 `HNSW`；
+        - `embedding_metric_type`（可选）：根据 embedding 索引类型不同配置的检索参数，默认是 `COSINE`。
+    - `chroma`：使用 Chroma 存储数据。`kwargs` 包括：
+        - `dir`（必填）：数据存放的目录。
+
+
+如果使用 Milvus，我们还需要给 `Document` 传递 `doc_fields` 参数，用于指定需要存储的字段及类型等信息。例如下面的配置：
+
+```python
+doc_fields = {
+    'comment': DocField(data_type=DocField.DTYPE_VARCHAR, max_size=65535, default_value=' '),
+    'signature': DocField(data_type=DocField.DTYPE_VARCHAR, max_size=32, default_value=' '),
+}
+```
+
+配置了两个字段 `comment` 和 `signature` 两个字段。其中 `comment` 是一个字符串，最大长度是 65535，默认值为空；`signature` 类型是一个字符串，最大长度是 32，默认值为空。
+
+下面是一个使用 Milvus 作为存储后端的完整例子：
+
+<details>
+
+<summary>附完整代码（点击展开）：</summary>
+
+```python
+# -*- coding: utf-8 -*-
+
+import os
+import lazyllm
+from lazyllm import bind, config
+from lazyllm.tools.rag import DocField
+import shutil
+
+class TmpDir:
+    def __init__(self):
+        self.root_dir = os.path.expanduser(os.path.join(config['home'], 'rag_for_example_ut'))
+        self.rag_dir = os.path.join(self.root_dir, 'rag_master')
+        os.makedirs(self.rag_dir, exist_ok=True)
+        self.store_file = os.path.join(self.root_dir, "milvus.db")
+
+    def __del__(self):
+        shutil.rmtree(self.root_dir)
+
+tmp_dir = TmpDir()
+
+milvus_store_conf = {
+    'type': 'milvus',
+    'kwargs': {
+        'uri': tmp_dir.store_file,
+        'embedding_index_type': 'HNSW',
+        'embedding_metric_type': 'COSINE',
+    },
+}
+
+doc_fields = {
+    'comment': DocField(data_type=DocField.DTYPE_VARCHAR, max_size=65535, default_value=' '),
+    'signature': DocField(data_type=DocField.DTYPE_VARCHAR, max_size=32, default_value=' '),
+}
+
+prompt = 'You will play the role of an AI Q&A assistant and complete a dialogue task.'\
+    ' In this task, you need to provide your answer based on the given context and question.'
+
+documents = lazyllm.Document(dataset_path=tmp_dir.rag_dir,
+                             embed=lazyllm.TrainableModule("bge-large-zh-v1.5"),
+                             manager=False,
+                             store_conf=milvus_store_conf,
+                             doc_fields=doc_fields)
+
+documents.create_node_group(name="block", transform=lambda s: s.split("\n") if s else '')
+
+with lazyllm.pipeline() as ppl:
+    ppl.retriever = lazyllm.Retriever(doc=documents, group_name="block", topk=3)
+
+    ppl.reranker = lazyllm.Reranker(name='ModuleReranker',
+                                    model="bge-reranker-large",
+                                    topk=1,
+                                    output_format='content',
+                                    join=True) | bind(query=ppl.input)
+
+    ppl.formatter = (
+        lambda nodes, query: dict(context_str=nodes, query=query)
+    ) | bind(query=ppl.input)
+
+    ppl.llm = lazyllm.TrainableModule('internlm2-chat-7b').prompt(
+        lazyllm.ChatPrompter(instruction=prompt, extro_keys=['context_str']))
+
+if __name__ == '__main__':
+    rag = lazyllm.ActionModule(ppl)
+    rag.start()
+    res = rag('何为天道？')
+    print(f'answer: {res}')
+```
+
+</details>
+
+## 版本-6：自定义索引后端
+
+为了加速数据检索和满足不同的检索需求，`LazyLLM` 还支持为不同的存储后端指定索引后端，可以通过 `Document` 的参数 `store_conf` 中的 `indices` 字段来指定。在 `indices` 配置的索引类型可以在 `Retriever` 时使用（通过 `index` 参数指定）。
+
+例如想使用基于 dict 的 key/value 存储，并且使用 Milvus 作为该存储的检索后端，我们可以这样配置：
+
+```python
+milvus_store_conf = {
+    'type': 'map',
+    'indices': {
+        'smart_embedding_index': {
+            'backend': 'milvus',
+            'kwargs': {
+                'uri': store_file,
+                'embedding_index_type': 'HNSW',
+                'embedding_metric_type': 'COSINE',
+            },
+        },
+    },
+}
+```
+
+其中的参数 `type` 在 版本-5 中已经介绍过，这里不再重复。`indices` 是一个 dict，其中 key 是索引类型，value 是一个 dict，取值根据不同的索引类型而不同。
+
+目前 `indices` 只支持 `smart_embedding_index`，其中的参数包括：
+
+* `backend`：指定用于进行 embedding 检索的索引后端类型。目前仅支持 `milvus`；
+* `kwargs`：需要传给索引后端的参数。在本例中传给 `milvus` 后端的参数和 版本-5 小节中介绍的 `milvus` 存储后端的参数一样。
+
+下面是一个使用 `milvus` 作为索引后端的完整例子：
+
+<details>
+
+<summary>附完整代码（点击展开）：</summary>
+
+```python
+# -*- coding: utf-8 -*-
+
+import os
+import lazyllm
+from lazyllm import bind
+import tempfile
+
+def run(query):
+    _, store_file = tempfile.mkstemp(suffix=".db")
+
+    milvus_store_conf = {
+        'type': 'map',
+        'indices': {
+            'smart_embedding_index': {
+                'backend': 'milvus',
+                'kwargs': {
+                    'uri': store_file,
+                    'embedding_index_type': 'HNSW',
+                    'embedding_metric_type': 'COSINE',
+                },
+            },
+        },
+    }
+
+    documents = lazyllm.Document(dataset_path="rag_master",
+                                 embed=lazyllm.TrainableModule("bge-large-zh-v1.5"),
+                                 manager=False,
+                                 store_conf=milvus_store_conf)
+
+    documents.create_node_group(name="sentences",
+                                transform=lambda s: '。'.split(s))
+
+    prompt = 'You will play the role of an AI Q&A assistant and complete a dialogue task.'\
+        ' In this task, you need to provide your answer based on the given context and question.'
+
+    with lazyllm.pipeline() as ppl:
+        ppl.retriever = lazyllm.Retriever(doc=documents, group_name="sentences", topk=3,
+                                          index='smart_embedding_index')
+
+        ppl.reranker = lazyllm.Reranker(name='ModuleReranker',
+                                        model="bge-reranker-large",
+                                        topk=1,
+                                        output_format='content',
+                                        join=True) | bind(query=ppl.input)
+
+        ppl.formatter = (
+            lambda nodes, query: dict(context_str=nodes, query=query)
+        ) | bind(query=ppl.input)
+
+        ppl.llm = lazyllm.TrainableModule('internlm2-chat-7b').prompt(
+            lazyllm.ChatPrompter(instruction=prompt, extro_keys=['context_str']))
+
+        rag = lazyllm.ActionModule(ppl)
+        rag.start()
+        res = rag(query)
+
+    os.remove(store_file)
+
+    return res
+
+if __name__ == '__main__':
+    res = run('何为天道？')
+    print(f'answer: {res}')
+```
+
+</details>