Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

测试产品生成统计表.xlsx,agent无法正确提取表头 #124

Closed
AngGaGim opened this issue Nov 28, 2024 · 9 comments
Closed

测试产品生成统计表.xlsx,agent无法正确提取表头 #124

AngGaGim opened this issue Nov 28, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@AngGaGim
Copy link

No description provided.

@AngGaGim AngGaGim added the bug Something isn't working label Nov 28, 2024
@edwardzjl
Copy link
Contributor

可否提供所使用的数据文件,或者部分数据的截图?

@AngGaGim
Copy link
Author

可否提供所使用的数据文件,或者部分数据的截图?

你好!测试数据是本仓库提供的
image

@weekenthralling
Copy link
Contributor

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本

然后可以看一下Normalize Datasets文档

需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

@AngGaGim
Copy link
Author

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本

然后可以看一下Normalize Datasets文档

需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

感谢!!

@AngGaGim
Copy link
Author

AngGaGim commented Nov 29, 2024

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本
然后可以看一下Normalize Datasets文档
需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

大佬您好,执行normalize-datasets.ipynb会显示langchain无法解析的错误:

Failed to generate normalization code: Could not parse output: 生产日期,制造编号,产品名称,预定产量,预计本日产量,实际本日产量,累计产量,本日耗费工时,累计耗费工时
2007-08-10,FK-001,猕猴桃果肉饮料,100000,40000,45000,83000,10,20
2007-08-11,FK-002,西瓜果肉饮料,100000,40000,44000,82000,9,18
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx'}]}, response_metadata={}, id='4fee9d-415f-be95-36b80b48d3ec'), AIMessage(content="我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据ad_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查input': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in colues\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='abf1eb50-67ef-4ad6-83eb-41d1662a3f79', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\nmns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '88dd1283-3c79-4909-959b-52a89880dcd3', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n #   Column      Non-Null Count  Dtype \n---  ------      --------------  ----- \n 0   产品生产统计表3 non-null     object\n 1   Unnamed: 1  26 non-null     object\n 2   Unnamed: 2  21 non-null     object\n 3   Unnamed: 3  21 non-null     object\n 4   Unnamed: 4  23 non-null     object\n 5   Unnamed: 5  22 non-null     object\n 6   Unnamed: 6  26 non-null     object\n 7   Unnamed: 7  22 non-null     object\n 8   Unnamed: 8  21 non-null     object\ndtypes: object(9)\n```"}], name='python', id='347b8bc2-117a-4302-9058-ad44ba768865', tool_call_id='88dd1283-3c79-4909-959b-52a89880dcd3', artifact=[]), AIMessage(content='接下来我将用 `df.head(5)` 来查 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看, 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='6f19beea-5405-45e4-a726-9e1802d88ebb', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'fc7e2cfa-662e-47c2-98b7-758ea8627a0c', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n---------------------------------------------------------------------------\nNameError                                 Traceback (most recent call last)\nCell In[1], line 2\n      1 # Show the first 5 rows to understand the structure\n----> 2 df.head(5)\n\nNameError: name 'df' is not defined\n```"}], name='python', id='8c442d16-b741-469f-9b5e-16512a631d87', tool_call_id='fc7e2cfa-662e-47c2-98b7-758ea8627a0c', artifact=[]), AIMessage(content='我已经了解了数据集 /data/develolegpt2/examples/datasets/产品生产统计表_origin.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='607abcda-a6d1-4a47-a51f-7b884651b6e9')]

执行代码

import asyncio
from datetime import date
from typing import TypedDict
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from tablegpt.agent.file_reading import Stage

print(DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""

    filename: str
    """The dataset uploaded in this session can be a filename, file path, or object storage address."""


# tablegpt-agent fully supports async invocation
async def main() -> None:
    llm = ChatOpenAI(
        openai_api_base="http://localhost:11111/v1",
        openai_api_key="sk-zZD420F648F1826355455eEaD881",
        model_name="TableGPT2-7B"

    )
    normalize_llm = ChatOpenAI(
        openai_api_base="https://myart/v1",
        openai_api_key="sk-c79c",
        model_name="gpt-4o"
    )

    # Use local pybox manager for development and testing
    pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

    agent = create_tablegpt_graph(
        llm=llm,
        pybox_manager=pybox_manager,
        normalize_llm=normalize_llm,
    )

    attachment_msg = HumanMessage(
        content="",
        # Please make sure your iPython kernel can access your filename.
        additional_kwargs={"attachments": [Attachment(filename="/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx")]},
    )

    response = await agent.ainvoke(
        input={
            "entry_message": attachment_msg,
            "processing_stage": Stage.UPLOADED,
            "messages": [attachment_msg],
            "parent_id": "some-parent-id1",
            "date": date.today(),
        },
        config={
            # Using checkpointer requires binding thread_id at runtime.
            "configurable": {"thread_id": "some-thread-id"},
        },
    )

    print(response["messages"])

asyncio.run(main())

tablegpt-agent 0.2.15 langchain-core 0.3.21 langchain-openai 0.2.10 langchain-qdrant 0.2.0 langgraph 0.2.53 langgraph-checkpoint 2.0.6 langgraph-sdk 0.1.39

@weekenthralling
Copy link
Contributor

weekenthralling commented Nov 29, 2024

@AngGaGim 我注意到您评论的代码片段中存在明文的apikey,请注意保护您的隐私。

正如我最开始提到的一样,normalize-datasets需要依靠大模型的能力来完成,并且它目前处于实验阶段

@weekenthralling
Copy link
Contributor

weekenthralling commented Dec 2, 2024

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本
然后可以看一下Normalize Datasets文档
需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

大佬您好,执行normalize-datasets.ipynb会显示langchain无法解析的错误:

Failed to generate normalization code: Could not parse output: 生产日期,制造编号,产品名称,预定产量,预计本日产量,实际本日产量,累计产量,本日耗费工时,累计耗费工时
2007-08-10,FK-001,猕猴桃果肉饮料,100000,40000,45000,83000,10,20
2007-08-11,FK-002,西瓜果肉饮料,100000,40000,44000,82000,9,18
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx'}]}, response_metadata={}, id='4fee9d-415f-be95-36b80b48d3ec'), AIMessage(content="我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据ad_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查input': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in colues\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='abf1eb50-67ef-4ad6-83eb-41d1662a3f79', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\nmns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '88dd1283-3c79-4909-959b-52a89880dcd3', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n #   Column      Non-Null Count  Dtype \n---  ------      --------------  ----- \n 0   产品生产统计表3 non-null     object\n 1   Unnamed: 1  26 non-null     object\n 2   Unnamed: 2  21 non-null     object\n 3   Unnamed: 3  21 non-null     object\n 4   Unnamed: 4  23 non-null     object\n 5   Unnamed: 5  22 non-null     object\n 6   Unnamed: 6  26 non-null     object\n 7   Unnamed: 7  22 non-null     object\n 8   Unnamed: 8  21 non-null     object\ndtypes: object(9)\n```"}], name='python', id='347b8bc2-117a-4302-9058-ad44ba768865', tool_call_id='88dd1283-3c79-4909-959b-52a89880dcd3', artifact=[]), AIMessage(content='接下来我将用 `df.head(5)` 来查 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看, 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='6f19beea-5405-45e4-a726-9e1802d88ebb', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'fc7e2cfa-662e-47c2-98b7-758ea8627a0c', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n---------------------------------------------------------------------------\nNameError                                 Traceback (most recent call last)\nCell In[1], line 2\n      1 # Show the first 5 rows to understand the structure\n----> 2 df.head(5)\n\nNameError: name 'df' is not defined\n```"}], name='python', id='8c442d16-b741-469f-9b5e-16512a631d87', tool_call_id='fc7e2cfa-662e-47c2-98b7-758ea8627a0c', artifact=[]), AIMessage(content='我已经了解了数据集 /data/develolegpt2/examples/datasets/产品生产统计表_origin.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='607abcda-a6d1-4a47-a51f-7b884651b6e9')]

执行代码

import asyncio
from datetime import date
from typing import TypedDict
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from tablegpt.agent.file_reading import Stage

print(DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""

    filename: str
    """The dataset uploaded in this session can be a filename, file path, or object storage address."""


# tablegpt-agent fully supports async invocation
async def main() -> None:
    llm = ChatOpenAI(
        openai_api_base="http://localhost:11111/v1",
        openai_api_key="sk-zZD420F648F1826355455eEaD881",
        model_name="TableGPT2-7B"

    )
    normalize_llm = ChatOpenAI(
        openai_api_base="https://myart/v1",
        openai_api_key="sk-c79c",
        model_name="gpt-4o"
    )

    # Use local pybox manager for development and testing
    pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

    agent = create_tablegpt_graph(
        llm=llm,
        pybox_manager=pybox_manager,
        normalize_llm=normalize_llm,
    )

    attachment_msg = HumanMessage(
        content="",
        # Please make sure your iPython kernel can access your filename.
        additional_kwargs={"attachments": [Attachment(filename="/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx")]},
    )

    response = await agent.ainvoke(
        input={
            "entry_message": attachment_msg,
            "processing_stage": Stage.UPLOADED,
            "messages": [attachment_msg],
            "parent_id": "some-parent-id1",
            "date": date.today(),
        },
        config={
            # Using checkpointer requires binding thread_id at runtime.
            "configurable": {"thread_id": "some-thread-id"},
        },
    )

    print(response["messages"])

asyncio.run(main())

tablegpt-agent 0.2.15 langchain-core 0.3.21 langchain-openai 0.2.10 langchain-qdrant 0.2.0 langgraph 0.2.53 langgraph-checkpoint 2.0.6 langgraph-sdk 0.1.39

抱歉,我可能有点理解错你的意思了,我详细看了你的输出,其中有一段错误信息:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 # Show the first 5 rows to understand the structure
----> 2 df.head(5)

NameError: name 'df' is not defined

解决这个错误,需要在 create_tablegpt_graph 的时候传入 session_id='xxxxx',这个是 file-reading-graph 必要的参数

对于能否将数据转成标准格式,这个取决于模型的能力。

@jianpugh
Copy link

我使用这个示例,normalize_llm和llm一样用的是tablegpt2在huggingface上的7B模型,好像无法正常提取表格

代码如下:
`from pathlib import Path
from langchain_openai import ChatOpenAI
from pybox import LocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B")
normalize_llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B")
pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph(
llm=llm,
pybox_manager=pybox_manager,
normalize_llm=normalize_llm,
session_id="some-session-id", # This is required when using file-reading
)

from typing import TypedDict
from langchain_core.messages import HumanMessage

class Attachment(TypedDict):
"""Contains at least one dictionary with the key filename."""
filename: str

attachment_msg = HumanMessage(
content="",
# Please make sure your iPython kernel can access your filename.
additional_kwargs={"attachments": [Attachment(filename=r"examples\datasets\产品生产统计表.xlsx")]},
)

from datetime import date
from tablegpt.agent.file_reading import Stage

Reading and processing files.

response = await agent.ainvoke(
input={
"entry_message": attachment_msg,
"processing_stage": Stage.UPLOADED,
"messages": [attachment_msg],
"parent_id": "some-parent-id1",
"date": date.today(),
},
config={
# Using checkpointer requires binding thread_id at runtime.
"configurable": {"thread_id": "some-thread-id"},
},
)

response["messages"]
`

任务栏打印的结果如下

Failed to generate normalization code: Could not parse output: ['产品生产统计表', '生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '耗费工时本日', '耗费工时累计', '2007-08-10', 'FK-001', '猕猴桃果肉饮料', 100000, 40000, 45000, 83000, 10, 20, '2007-08-11', 'FK-002', '西瓜果肉饮料', 100000, 40000, 44000, 82000, 9, 18]
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': 'examples\datasets\产品生产统计表.xlsx'}]}, response_metadata={}, id='6739e0ae-19ea-4aba-b286-b239f9ef8e7c'),
AIMessage(content="我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中,并通过 df.info 查看 NaN 情况和数据类型。\npython\n# Load the data into a DataFrame\ndf = read_df('examples\\datasets\\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中,并通过 df.info 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='6c536010-04aa-463c-b95b-78f2c5c39436', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '85fd9d17-1b87-4f92-857e-afcfa072a985', 'type': 'tool_call'}]),
ToolMessage(content=[{'type': 'text', 'text': "pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 产品生产统计表 23 non-null object\n 1 Unnamed: 1 26 non-null object\n 2 Unnamed: 2 21 non-null object\n 3 Unnamed: 3 21 non-null object\n 4 Unnamed: 4 23 non-null object\n 5 Unnamed: 5 22 non-null object\n 6 Unnamed: 6 26 non-null object\n 7 Unnamed: 7 22 non-null object\n 8 Unnamed: 8 21 non-null object\ndtypes: object(9)\n"}], name='python', id='fd378ff2-246b-475d-80a8-3a453eaf2c1b', tool_call_id='85fd9d17-1b87-4f92-857e-afcfa072a985', artifact=[]),
AIMessage(content='接下来我将用 df.head(5) 来查看数据集的前 5 行。\npython\n# Show the first 5 rows to understand the structure\ndf.head(5)\n', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 df.head(5) 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='d15c7071-fb3e-411a-87da-37f5635ebd93', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': '1adcddaf-e235-4ba5-9fad-3261235c8f2a', 'type': 'tool_call'}]),
ToolMessage(content=[{'type': 'text', 'text': 'pycon\n 产品生产统计表 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 \\\n0 生产日期 制造编号 产品名称 预定产量 本日产量 NaN \n1 NaN NaN NaN NaN 预计 实际 \n2 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000 40000 45000 \n3 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000 40000 44000 \n4 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 \n\n Unnamed: 6 Unnamed: 7 Unnamed: 8 \n0 累计产量 耗费工时 NaN \n1 NaN 本日 累计 \n2 83000 10 20 \n3 82000 9 18 \n4 83000 9 18 \n'}], name='python', id='d8206743-4575-4467-8f56-ae82dbb468d2', tool_call_id='1adcddaf-e235-4ba5-9fad-3261235c8f2a', artifact=[]),
AIMessage(content='我已经了解了数据集 examples\datasets\产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='3004ec5b-31ce-4c02-bea4-38e658b97a75')]

@zzzcccxx
Copy link

zzzcccxx commented Jan 7, 2025

我使用这个示例,normalize_llm和llm一样用的是tablegpt2在huggingface上的7B模型,好像无法正常提取表格

代码如下: `from pathlib import Path from langchain_openai import ChatOpenAI from pybox import LocalPyBoxManager from tablegpt.agent import create_tablegpt_graph from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B") normalize_llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B") pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph( llm=llm, pybox_manager=pybox_manager, normalize_llm=normalize_llm, session_id="some-session-id", # This is required when using file-reading )

from typing import TypedDict from langchain_core.messages import HumanMessage

class Attachment(TypedDict): """Contains at least one dictionary with the key filename.""" filename: str

attachment_msg = HumanMessage( content="", # Please make sure your iPython kernel can access your filename. additional_kwargs={"attachments": [Attachment(filename=r"examples\datasets\产品生产统计表.xlsx")]}, )

from datetime import date from tablegpt.agent.file_reading import Stage

Reading and processing files.

response = await agent.ainvoke( input={ "entry_message": attachment_msg, "processing_stage": Stage.UPLOADED, "messages": [attachment_msg], "parent_id": "some-parent-id1", "date": date.today(), }, config={ # Using checkpointer requires binding thread_id at runtime. "configurable": {"thread_id": "some-thread-id"}, }, )

response["messages"] `

任务栏打印的结果如下

Failed to generate normalization code: Could not parse output: ['产品生产统计表', '生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '耗费工时本日', '耗费工时累计', '2007-08-10', 'FK-001', '猕猴桃果肉饮料', 100000, 40000, 45000, 83000, 10, 20, '2007-08-11', 'FK-002', '西瓜果肉饮料', 100000, 40000, 44000, 82000, 9, 18] For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE [HumanMessage(content='', additional_kwargs={'attachments': [{'filename': 'examples\datasets\产品生产统计表.xlsx'}]}, response_metadata={}, id='6739e0ae-19ea-4aba-b286-b239f9ef8e7c'), AIMessage(content="我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中,并通过 df.info 查看 NaN 情况和数据类型。\npython\n# Load the data into a DataFrame\ndf = read_df('examples\\datasets\\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中,并通过 df.info 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='6c536010-04aa-463c-b95b-78f2c5c39436', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '85fd9d17-1b87-4f92-857e-afcfa072a985', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 产品生产统计表 23 non-null object\n 1 Unnamed: 1 26 non-null object\n 2 Unnamed: 2 21 non-null object\n 3 Unnamed: 3 21 non-null object\n 4 Unnamed: 4 23 non-null object\n 5 Unnamed: 5 22 non-null object\n 6 Unnamed: 6 26 non-null object\n 7 Unnamed: 7 22 non-null object\n 8 Unnamed: 8 21 non-null object\ndtypes: object(9)\n"}], name='python', id='fd378ff2-246b-475d-80a8-3a453eaf2c1b', tool_call_id='85fd9d17-1b87-4f92-857e-afcfa072a985', artifact=[]), AIMessage(content='接下来我将用 df.head(5) 来查看数据集的前 5 行。\npython\n# Show the first 5 rows to understand the structure\ndf.head(5)\n', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 df.head(5) 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='d15c7071-fb3e-411a-87da-37f5635ebd93', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': '1adcddaf-e235-4ba5-9fad-3261235c8f2a', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': 'pycon\n 产品生产统计表 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 \\\n0 生产日期 制造编号 产品名称 预定产量 本日产量 NaN \n1 NaN NaN NaN NaN 预计 实际 \n2 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000 40000 45000 \n3 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000 40000 44000 \n4 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 \n\n Unnamed: 6 Unnamed: 7 Unnamed: 8 \n0 累计产量 耗费工时 NaN \n1 NaN 本日 累计 \n2 83000 10 20 \n3 82000 9 18 \n4 83000 9 18 \n'}], name='python', id='d8206743-4575-4467-8f56-ae82dbb468d2', tool_call_id='1adcddaf-e235-4ba5-9fad-3261235c8f2a', artifact=[]), AIMessage(content='我已经了解了数据集 examples\datasets\产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='3004ec5b-31ce-4c02-bea4-38e658b97a75')]

@jianpugh 您好,我也是遇到同样的问题,请问最后解决了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants