Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Transform] Add LLM transform #7303

Merged
merged 4 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions docs/en/transform-v2/llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# LLM
Hisoka-X marked this conversation as resolved.
Show resolved Hide resolved

> LLM transform plugin

## Description

Leverage the power of a large language model (LLM) to process data by sending it to the LLM and receiving the
generated results. Utilize the LLM's capabilities to label, clean, enrich data, perform data inference, and
more.

## Options

| name | type | required | default value |
|------------------|--------|----------|--------------------------------------------|
| model_provider | enum | yes | |
| output_data_type | enum | no | String |
| prompt | string | yes | |
| model | string | yes | |
| api_key | string | yes | |
| openai.api_path | string | no | https://api.openai.com/v1/chat/completions |

### model_provider

The model provider to use. The available options are:
OPENAI

### output_data_type

The data type of the output data. The available options are:
STRING,INT,BIGINT,DOUBLE,BOOLEAN.
Default value is STRING.

### prompt

The prompt to send to the LLM. This parameter defines how LLM will process and return data, eg:

The data read from source is a table like this:

| name | age |
|---------------|-----|
| Jia Fan | 20 |
| Hailin Wang | 20 |
| Eric | 20 |
| Guangdong Liu | 20 |

The prompt can be:

```
Determine whether someone is Chinese or American by their name
```

The result will be:

| name | age | llm_output |
|---------------|-----|------------|
| Jia Fan | 20 | Chinese |
| Hailin Wang | 20 | Chinese |
| Eric | 20 | American |
| Guangdong Liu | 20 | Chinese |

### model

The model to use. Different model providers have different models. For example, the OpenAI model can be `gpt-4o-mini`.
If you use OpenAI model, please refer https://platform.openai.com/docs/models/model-endpoint-compatibility of `/v1/chat/completions` endpoint.

### api_key

The API key to use for the model provider.
If you use OpenAI model, please refer https://platform.openai.com/docs/api-reference/api-keys of how to get the API key.

### openai.api_path

The API path to use for the OpenAI model provider. In most cases, you do not need to change this configuration. If you are using an API agent's service, you may need to configure it to the agent's API address.

### common options [string]

Transform plugin common parameters, please refer to [Transform Plugin](common-options.md) for details

## Example

Determine the user's country through a LLM.

```hocon
env {
parallelism = 1
job.mode = "BATCH"
}

source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
name = "string"
}
}
rows = [
{fields = [1, "Jia Fan"], kind = INSERT}
{fields = [2, "Hailin Wang"], kind = INSERT}
{fields = [3, "Tomas"], kind = INSERT}
{fields = [4, "Eric"], kind = INSERT}
{fields = [5, "Guangdong Liu"], kind = INSERT}
]
}
}

transform {
LLM {
model_provider = OPENAI
model = gpt-4o-mini
api_key = sk-xxx
prompt = "Determine whether someone is Chinese or American by their name"
}
}

sink {
console {
}
}
```

120 changes: 120 additions & 0 deletions docs/zh/transform-v2/llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# LLM

> LLM 转换插件

## 描述

利用大型语言模型 (LLM) 的强大功能来处理数据,方法是将数据发送到 LLM 并接收生成的结果。利用 LLM 的功能来标记、清理、丰富数据、执行数据推理等。

## 属性

| 名称 | 类型 | 是否必须 | 默认值 |
|------------------|--------|------|--------------------------------------------|
| model_provider | enum | yes | |
| output_data_type | enum | no | String |
| prompt | string | yes | |
| model | string | yes | |
| api_key | string | yes | |
| openai.api_path | string | no | https://api.openai.com/v1/chat/completions |

### model_provider

要使用的模型提供者。可用选项为:
OPENAI

### output_data_type

输出数据的数据类型。可用选项为:
STRING,INT,BIGINT,DOUBLE,BOOLEAN.
默认值为 STRING。

### prompt

发送到 LLM 的提示。此参数定义 LLM 将如何处理和返回数据,例如:

从源读取的数据是这样的表格:

| name | age |
|---------------|-----|
| Jia Fan | 20 |
| Hailin Wang | 20 |
| Eric | 20 |
| Guangdong Liu | 20 |

我们可以使用以下提示:

```
Determine whether someone is Chinese or American by their name
```

这将返回:

| name | age | llm_output |
|---------------|-----|------------|
| Jia Fan | 20 | Chinese |
| Hailin Wang | 20 | Chinese |
| Eric | 20 | American |
| Guangdong Liu | 20 | Chinese |

### model

要使用的模型。不同的模型提供者有不同的模型。例如,OpenAI 模型可以是 `gpt-4o-mini`。
如果使用 OpenAI 模型,请参考 https://platform.openai.com/docs/models/model-endpoint-compatibility 文档的`/v1/chat/completions` 端点。

### api_key

用于模型提供者的 API 密钥。
如果使用 OpenAI 模型,请参考 https://platform.openai.com/docs/api-reference/api-keys 文档的如何获取 API 密钥。

### openai.api_path

用于 OpenAI 模型提供者的 API 路径。在大多数情况下,您不需要更改此配置。如果使用 API 代理的服务,您可能需要将其配置为代理的 API 地址。

### common options [string]

转换插件的常见参数, 请参考 [Transform Plugin](common-options.md) 了解详情

## 示例

通过 LLM 确定用户所在的国家。

```hocon
env {
parallelism = 1
job.mode = "BATCH"
}

source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
name = "string"
}
}
rows = [
{fields = [1, "Jia Fan"], kind = INSERT}
{fields = [2, "Hailin Wang"], kind = INSERT}
{fields = [3, "Tomas"], kind = INSERT}
{fields = [4, "Eric"], kind = INSERT}
{fields = [5, "Guangdong Liu"], kind = INSERT}
]
}
}

transform {
LLM {
model_provider = OPENAI
model = gpt-4o-mini
api_key = sk-xxx
prompt = "Determine whether someone is Chinese or American by their name"
}
}

sink {
console {
}
}
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.seatunnel.e2e.transform;

import org.apache.seatunnel.e2e.common.TestResource;
import org.apache.seatunnel.e2e.common.container.TestContainer;

import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.TestTemplate;
import org.testcontainers.containers.Container;
import org.testcontainers.containers.GenericContainer;
import org.testcontainers.containers.output.Slf4jLogConsumer;
import org.testcontainers.containers.wait.strategy.HttpWaitStrategy;
import org.testcontainers.lifecycle.Startables;
import org.testcontainers.utility.DockerImageName;
import org.testcontainers.utility.DockerLoggerFactory;
import org.testcontainers.utility.MountableFile;

import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.Optional;
import java.util.stream.Stream;

public class TestLLMIT extends TestSuiteBase implements TestResource {
private static final String TMP_DIR = "/tmp";
private GenericContainer<?> mockserverContainer;
private static final String IMAGE = "mockserver/mockserver:5.14.0";

@BeforeAll
@Override
public void startUp() {
Optional<URL> resource =
Optional.ofNullable(TestLLMIT.class.getResource("/mockserver-config.json"));
this.mockserverContainer =
new GenericContainer<>(DockerImageName.parse(IMAGE))
.withNetwork(NETWORK)
.withNetworkAliases("mockserver")
.withExposedPorts(1080)
.withCopyFileToContainer(
MountableFile.forHostPath(
new File(
resource.orElseThrow(
() ->
new IllegalArgumentException(
"Can not get config file of mockServer"))
.getPath())
.getAbsolutePath()),
TMP_DIR + "/mockserver-config.json")
.withEnv(
"MOCKSERVER_INITIALIZATION_JSON_PATH",
TMP_DIR + "/mockserver-config.json")
.withEnv("MOCKSERVER_LOG_LEVEL", "WARN")
.withLogConsumer(new Slf4jLogConsumer(DockerLoggerFactory.getLogger(IMAGE)))
.waitingFor(new HttpWaitStrategy().forPath("/").forStatusCode(404));
Startables.deepStart(Stream.of(mockserverContainer)).join();
}

@AfterAll
@Override
public void tearDown() throws Exception {
if (mockserverContainer != null) {
mockserverContainer.stop();
}
}

@TestTemplate
public void testLLMWithOpenAI(TestContainer container)
throws IOException, InterruptedException {
Container.ExecResult execResult = container.executeJob("/llm_openai_transform.conf");
Assertions.assertEquals(0, execResult.getExitCode());
}
}
Loading
Loading