Skip to content

Commit

Permalink
s
Browse files Browse the repository at this point in the history
  • Loading branch information
ouonline committed Dec 5, 2024
1 parent 2b5b15b commit ec0aead
Show file tree
Hide file tree
Showing 3 changed files with 157 additions and 15 deletions.
62 changes: 59 additions & 3 deletions docs/en/Best Practice/rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ The Document constructor has the following parameters:
* `embed`: Uses the specified model to perform text embedding. If you need to generate multiple embeddings for the text, you need to specify them in a dictionary, where the key identifies the name of the embedding and the value is the corresponding embedding model.
* `manager`: Whether to use the UI interface, which will affect the internal processing logic of Document; the default is True.
* `launcher`: The method of launching the service, which is used in cluster applications; it can be ignored for single-machine applications.
* `store_conf`: Configure which storage backend and index backend to use.
* `doc_fields`: Configure the fields and corresponding types that need to be stored and retrieved (currently only used by the Milvus backend).

#### Node and NodeGroup

A `Document` instance may be further subdivided into several sets of nodes with different granularities, known as `Node` sets (the `Node Group`), according to specified rules (referred to as `Transformer` in `LazyLLM`). These `Node`s not only contain the document content but also record which `Node` they were split from and which finer-grained `Node`s they themselves were split into. Users can create their own `Node Group` by using the `Document.create_node_group()` method.

Expand Down Expand Up @@ -91,6 +95,52 @@ The relationship of these `Node Group`s is shown in the diagram below:

These `Node Group`s have different granularities and rules, reflecting various characteristics of the document. In subsequent processing, we use these characteristics in different contexts to better judge the relevance between the document and the user's query content.

#### Store and Index

`LazyLLM` offers the functionality of configurable storage and index backends, which can meet various storage and retrieval needs.

The configuration parameter `store_conf` is a `dict` type that includes the following fields:

* `type`: This is the type of storage backend. Currently supported storage backends include:
- `map`: In-memory key/value storage.
- `chroma`: Uses Chroma for data storage.
- `milvus`: Uses Milvus for data storage.
* `indices`: This is a dictionary where the key is the name of the index type, and the value is the parameters required for that index type. The currently supported index types are:
- `smart_embedding_index`: Provides embedding retrieval functionality. The supported backends include:
- `milvus`: Uses Milvus as the backend for embedding retrieval. The available parameters `kwargs` include:
- `uri`: The Milvus storage address, which can be a file path or a URL in the format of `ip:port`.
- `embedding_index_type`: The type of embedding index supported by Milvus, with the default being `HNSW`.
- `embedding_metric_type`: Retrieval parameters configured based on the type of embedding index, with the default being `COSINE`.

Here is an example configuration using Chroma as the storage backend and Milvus as the retrieval backend:

```python
store_conf = {
'type': 'chroma',
'indices': {
'smart_embedding_index': {
'backend': 'milvus',
'kwargs': {
'uri': store_file,
'embedding_index_type': 'HNSW',
'embedding_metric_type': 'COSINE',
},
},
},
}
```

Note: If using Milvus as the storage backend or index backend, you also need to provide a description of the fields that need to be stored or retrieved, passed in through the `doc_fields` parameter. `doc_fields` is a dictionary where the key is the name of the field to be stored or retrieved, and the value is a structure of type `GlobalMetadataDesc`, which includes information such as the field type.

For example, if you need to store the author information and publication year of documents, you can configure it as follows:

```python
doc_fields = {
'author': DocField(data_type=DataType.VARCHAR, max_size=128, default_value=' '),
'public_year': DocField(data_type=DataType.INT32),
}
```

### Retriever

The documents in the document collection may not all be relevant to the content the user wants to query. Therefore, next, we will use the `Retriever` to filter out documents from the `Document` that are relevant to the user's query.
Expand All @@ -109,7 +159,7 @@ The constructor of the `Retriever` has the following parameters:
* `group_name`: Specifies which `Node Group` of the document to use for retrieval. Use `LAZY_ROOT_NAME` to indicate that the retrieval should be performed on the original document content.
* `similarity`: Specifies the name of the function to calculate the similarity between a `Node` and the user's query content. The similarity calculation functions built into `LazyLLM` include `bm25`, `bm25_chinese`, and `cosine`. Users can also define their own calculation functions.
* `similarity_cut_off`: Discards results with a similarity less than the specified value. The default is `-inf`, which means no results are discarded. In a multi-embedding scenario, if you need to specify different values for different embeddings, this parameter needs to be specified in a dictionary format, where the key indicates which embedding is specified and the value indicates the corresponding threshold. If all embeddings use the same threshold, this parameter only needs to pass a single value.
* `index`: Specifies on which index to perform the search. Currently, only `default` is supported.
* `index`: On which index to search, currently only `default` and `smart_embedding_index` are supported.
* `topk`: Specifies the number of most relevant documents to return. The default value is 6.
* `embed_keys`: Indicates which embeddings to use for retrieval. If not specified, all embeddings will be used for retrieval.
* `similarity_kw`: Parameters that need to be passed through to the `similarity` function.
Expand Down Expand Up @@ -145,10 +195,16 @@ def dummy_similarity_func(query: List[float], nodes: List[DocNode], **kwargs) ->
def dummy_similarity_func(query: List[float], node: DocNode, **kwargs) -> float:
```

An instance of `Retriever` can be used as follows to retrieve documents related to the `query`:
The Retriever instance requires the query string to be passed in when used, along with optional filters for field filtering. filters is a dictionary where the key is the field to be filtered on, and the value is a list of acceptable values, indicating that the node will be returned if the field’s value matches any one of the values in the list. Only when all conditions are met will the node be returned.

Here is an example of using filters:

```python
doc_list = retriever(query=query)
filters = {
"author": ["A", "B", "C"],
"public_year": [2002, 2003, 2004],
}
doc_list = retriever(query=query, filters=filters)
```

### Reranker
Expand Down
68 changes: 61 additions & 7 deletions docs/zh/Best Practice/rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,11 @@ docs = Document(dataset_path='/path/to/doc/dir', embed=MyEmbeddingModule(), mana
* `dataset_path`:指定从哪个文件目录构建;
* `embed`:使用指定的模型来对文本进行 embedding。 如果需要对文本生成多个 embedding,此处需要通过字典的方式指定,key 标识 embedding 的名字,value 为对应的 embedding 模型;
* `manager`:是否使用 ui 界面会影响 `Document` 内部的处理逻辑,默认为 `False`
* `launcher`:启动服务的方式,集群应用会用到这个参数,单机应用可以忽略。
* `launcher`:启动服务的方式,集群应用会用到这个参数,单机应用可以忽略;
* `store_conf`:配置使用哪种存储后端及索引后端;
* `doc_fields`:配置需要存储和检索的字段及对应的类型(目前只有 Milvus 后端会用到)。

#### Node 和 NodeGroup

一个 `Document` 实例可能会按照指定的规则(在 `LazyLLM` 中被称为 `Transformer`),被进一步细分成若干粒度不同的被称为 `Node` 的节点集合(`Node Group`)。这些 `Node` 除了包含的文档内容外,还记录了自己是从哪一个`Node` 拆分而来,以及本身又被拆分成哪些更细粒度的 `Node`这些信息。用户可以通过 `Document.create_node_group()` 来创建自己的 `Node Group`

Expand Down Expand Up @@ -91,6 +95,52 @@ docs.create_node_group(name='sentence-len',

这些 `Node Group` 的拆分粒度和规则各不相同,反映了文档不同方面的特征。在后续的处理中,我们通过在不同的场合使用这些特征,从而更好地判断文档和用户输入的查询内容的相关性。

#### 存储和索引

`LazyLLM` 提供了可配置存储和索引后端的功能,可以满足不同的存储和检索需求。

配置项参数 `store_conf` 是一个 dict,包含的字段如下:

* `type`:是存储后端类型。目前支持的存储后端有:
- `map`:内存 key/value 存储;
- `chroma`:使用 Chroma 存储数据;
- `milvus`:使用 Milvus 存储数据。
* `indices`:是一个 dict,key 是索引类型名称,value 是该索引类型所需要的参数。索引类型目前支持:
- `smart_embedding_index`:提供 embedding 检索功能。支持的后端有:
- `milvus`:使用 Milvus 作为 embedding 检索的后端。可供使用的参数 `kwargs` 有:
- `uri`:Milvus 存储地址,可以是一个文件路径或者如 `ip:port` 格式的 url;
- `embedding_index_type`:Milvus 支持的 embedding 索引类型,默认是 `HNSW`
- `embedding_metric_type`:根据 embedding 索引类型不同配置的检索参数,默认是 `COSINE`

下面是一个使用 Chroma 作为存储后端,Milvus 作为检索后端的配置样例:

```python
store_conf = {
'type': 'chroma',
'indices': {
'smart_embedding_index': {
'backend': 'milvus',
'kwargs': {
'uri': store_file,
'embedding_index_type': 'HNSW',
'embedding_metric_type': 'COSINE',
},
},
},
}
```

注意:如果使用 Milvus 作为存储后端或者索引后端,还需要提供需要存储或检索的字段说明,通过 `doc_fields` 这个参数传入。`doc_fields` 是一个 dict,其中 key 为需要存储或检索的字段名称,value 是一个 `GlobalMetadataDesc` 类型的结构体,包含字段类型等信息。

例如,如果需要存储文档的作者信息和发表年份可以这样配置:

```python
doc_fields = {
'author': DocField(data_type=DataType.VARCHAR, max_size=128, default_value=' '),
'public_year': DocField(data_type=DataType.INT32),
}
```

### Retriever

文档集合中的文档不一定都和用户要查询的内容相关,因此接下来我们要使用 `Retriever``Document` 中筛选出和用户查询相关的文档。
Expand All @@ -109,9 +159,9 @@ retriever = Retriever(documents, group_name="sentence", similarity="cosine", top
* `group_name`:要使用文档的哪个 `Node Group` 来检索,使用 `LAZY_ROOT_NAME` 表示在原始文档内容中进行检索;
* `similarity`:指定用来计算 `Node` 和用户查询内容之间的相似度的函数名称,`LazyLLM` 内置的相似度计算函数有 `bm25``bm25_chinese``cosine`,用户也可以自定义自己的计算函数;
* `similarity_cut_off`:丢弃相似度小于指定值的结果,默认为 `-inf`,表示不丢弃。 在多 embedding 场景下,如果需要对不同的 embedding 指定不同的值,则该参数需要以字典的方式指定,key 表示指定的是哪个 embedding, value 表示相应的阈值。如果所有 embedding 使用同一个阈值,则此参数只传一个数值即可;
* `index`:在哪个索引上进行查找,目前只支持 `default`
* `index`:在哪个索引上进行查找,目前只支持 `default``smart_embedding_index`
* `topk`:表示返回最相关的文档数,默认值为 6;
* `embed_keys`:表示通过哪些 embedding 做检索,不指定表示用全部 embedding 进行检索:
* `embed_keys`:表示通过哪些 embedding 做检索,不指定表示用全部 embedding 进行检索
* `similarity_kw`:需要透传给 `similarity` 函数的参数。

用户可以通过使用 `LazyLLM` 提供的 `register_similarity()` 函数来注册自己的相似度计算函数。`register_similarity()` 有以下参数:
Expand Down Expand Up @@ -145,14 +195,18 @@ def dummy_similarity_func(query: List[float], nodes: List[DocNode], **kwargs) ->
def dummy_similarity_func(query: List[float], node: DocNode, **kwargs) -> float:
```

`Retriever` 实例可以这样使用:
`Retriever` 实例使用时需要传入要查询的 `query` 字符串,还有可选的过滤器 `filters` 用于字段过滤。`filters` 是一个 dict,其中 key 是要过滤的字段,value 是一个可取值列表,表示只要该字段的值匹配列表中的任意一个值即可。只有当所有的条件都满足该 node 才会被返回。

下面是使用 filters 的例子:

```python
doc_list = retriever(query=query)
filters = {
"author": ["A", "B", "C"],
"public_year": [2002, 2003, 2004],
}
doc_list = retriever(query=query, filters=filters)
```

来检索和 `query` 相关的文档。

### Reranker

当我们从最初的文档集合筛选出和用户查询相关性比较高的文档后,接下来就可以进一步对这些文档进行排序,选出更贴合用户查询内容的文档。这一步工作由 `Reranker` 来完成。
Expand Down
42 changes: 37 additions & 5 deletions lazyllm/docs/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
embed (Optional[Union[Callable, Dict[str, Callable]]]): The object used to generate document embeddings. If you need to generate multiple embeddings for the text, you need to specify multiple embedding models in a dictionary format. The key identifies the name corresponding to the embedding, and the value is the corresponding embedding model.
manager (bool, optional): A flag indicating whether to create a user interface for the document module. Defaults to False.
launcher (optional): An object or function responsible for launching the server module. If not provided, the default asynchronous launcher from `lazyllm.launchers` is used (`sync=False`).
store_conf (optional): Configure which storage backend and index backend to use.
doc_fields (optional): Configure the fields that need to be stored and retrieved along with their corresponding types (currently only used by the Milvus backend).
''')

add_chinese_doc('Document', '''\
Expand All @@ -36,8 +38,10 @@
Args:
dataset_path (str): 数据集目录的路径。此目录应包含要由文档模块管理的文档。
embed (Optional[Union[Callable, Dict[str, Callable]]]): 用于生成文档 embedding 的对象。如果需要对文本生成多个 embedding,此处需要通过字典的方式指定多个 embedding 模型,key 标识 embedding 对应的名字, value 为对应的 embedding 模型。
manager (bool, optional): 指示是否为文档模块创建用户界面的标志。默认为 False
manager (bool, optional): 指示是否为文档模块创建用户界面的标志。默认为 False
launcher (optional): 负责启动服务器模块的对象或函数。如果未提供,则使用 `lazyllm.launchers` 中的默认异步启动器 (`sync=False`)。
store_conf (optional): 配置使用哪种存储后端和索引后端。
doc_fields (optional): 配置需要存储和检索的字段继对应的类型(目前只有 Milvus 后端会用到)。
''')

add_example('Document', '''\
Expand All @@ -47,6 +51,25 @@
>>> documents = Document(dataset_path='your_doc_path', embed=m, manager=False) # or documents = Document(dataset_path='your_doc_path', embed={"key": m}, manager=False)
>>> m1 = lazyllm.TrainableModule("bge-large-zh-v1.5").start()
>>> document1 = Document(dataset_path='your_doc_path', embed={"online": m, "local": m1}, manager=False)
>>> store_conf = {
>>> 'type': 'chroma',
>>> 'indices': {
>>> 'smart_embedding_index': {
>>> 'backend': 'milvus',
>>> 'kwargs': {
>>> 'uri': '/tmp/tmp.db',
>>> 'embedding_index_type': 'HNSW',
>>> 'embedding_metric_type': 'COSINE',
>>> },
>>> },
>>> },
>>> }
>>> doc_fields = {
>>> 'author': DocField(data_type=DataType.VARCHAR, max_size=128, default_value=' '),
>>> 'public_year': DocField(data_type=DataType.INT32),
>>> }
>>> document2 = Document(dataset_path='your_doc_path', embed={"online": m, "local": m1}, store_conf=store_conf, doc_fields=doc_fields)
''')

add_english_doc('Document.create_node_group', '''
Expand Down Expand Up @@ -152,7 +175,7 @@
... with open(file, 'r') as f:
... data = f.read()
... return [DocNode(text=data, metadata=extra_info or {})]
...
...
>>> doc1 = Document(dataset_path="your_files_path", create_ui=False)
>>> doc2 = Document(dataset_path="your_files_path", create_ui=False)
>>> files = ["your_yml_files"]
Expand Down Expand Up @@ -191,7 +214,7 @@
... data = yaml.safe_load(f)
... print("Call the class YmlReader.")
... return [DocNode(text=data, metadata=extra_info or {})]
...
...
>>> def processYml(file, extra_info=None):
... with open(file, 'r') as f:
... data = f.read()
Expand Down Expand Up @@ -248,7 +271,7 @@
... data = yaml.safe_load(f)
... print("Call the class YmlReader.")
... return [DocNode(text=data, metadata=extra_info or {})]
...
...
>>> files = ["your_yml_files"]
>>> doc = Document(dataset_path="your_files_path", create_ui=False)
>>> reader = doc._impl._reader.load_data(input_files=files)
Expand Down Expand Up @@ -329,7 +352,7 @@
doc: 文档模块实例。该文档模块可以是单个实例,也可以是一个实例的列表。如果是单个实例,表示对单个Document进行检索,如果是实例的列表,则表示对多个Document进行检索。
group_name: 在哪个 node group 上进行检索。
similarity: 用于设置文档检索的相似度函数。默认为 'dummy'。候选集包括 ["bm25", "bm25_chinese", "cosine"]。
similarity_cut_off: 当相似度低于指定值时丢弃该文档。在多 embedding 场景下,如果需要对不同的 embedding 指定不同的值,则需要使用字典的方式指定,key 表示指定的是哪个 embedding,value 表示相应的阈值。如果所有的 embedding 使用同一个阈值,则只指定一个数值即可。
similarity_cut_off: 当相似度低于指定值时丢弃该文档。在多 embedding 场景下,如果需要对不同的 embedding 指定不同的值,则需要使用字典的方式指定,key 表示指定的是哪个 embedding,value 表示相应的阈值。如果所有的 embedding 使用同一个阈值,则只指定一个数值即可。
index: 用于文档检索的索引类型。目前仅支持 'default'。
topk: 表示取相似度最高的多少篇文档。
embed_keys: 表示通过哪些 embedding 做检索,不指定表示用全部 embedding 进行检索。
Expand Down Expand Up @@ -359,6 +382,15 @@
>>> document2.create_node_group(name='sentences', transform=SentenceSplitter, chunk_size=512, chunk_overlap=50)
>>> retriever2 = Retriever([document1, document2], group_name='sentences', similarity='cosine', similarity_cut_off=0.4, embed_keys=['local'], topk=3)
>>> print(retriever2("user query"))
filters = {
"author": ["A", "B", "C"],
"public_year": [2002, 2003, 2004],
}
>>> document3 = Document(dataset_path='/path/to/user/data', embed={'online':m , 'local': m1}, manager=False)
>>> document3.create_node_group(name='sentences', transform=SentenceSplitter, chunk_size=512, chunk_overlap=50)
>>> retriever3 = Retriever([document1, document3], group_name='sentences', similarity='cosine', similarity_cut_off=0.4, embed_keys=['local'], topk=3)
>>> print(retriever3(query="user query", filters=filters))
''')

# ---------------------------------------------------------------------------- #
Expand Down

0 comments on commit ec0aead

Please sign in to comment.