测试产品生成统计表.xlsx，agent无法正确提取表头 #124

AngGaGim · 2024-11-28T08:55:52Z

No description provided.

edwardzjl · 2024-11-28T08:58:26Z

可否提供所使用的数据文件，或者部分数据的截图？

AngGaGim · 2024-11-28T09:08:25Z

可否提供所使用的数据文件，或者部分数据的截图？

你好！测试数据是本仓库提供的

weekenthralling · 2024-11-28T09:18:43Z

首先，需要确认一下tablegpt-agent的版本，需要使用目前最新的版本

然后可以看一下Normalize Datasets文档

需要注意的是，对异型表的处理有一部分依靠的是模型本身的能力

AngGaGim · 2024-11-28T09:23:33Z

首先，需要确认一下tablegpt-agent的版本，需要使用目前最新的版本

然后可以看一下Normalize Datasets文档

需要注意的是，对异型表的处理有一部分依靠的是模型本身的能力

感谢！！

AngGaGim · 2024-11-29T02:13:05Z

首先，需要确认一下tablegpt-agent的版本，需要使用目前最新的版本
然后可以看一下Normalize Datasets文档
需要注意的是，对异型表的处理有一部分依靠的是模型本身的能力

大佬您好，执行normalize-datasets.ipynb会显示langchain无法解析的错误：

Failed to generate normalization code: Could not parse output: 生产日期,制造编号,产品名称,预定产量,预计本日产量,实际本日产量,累计产量,本日耗费工时,累计耗费工时
2007-08-10,FK-001,猕猴桃果肉饮料,100000,40000,45000,83000,10,20
2007-08-11,FK-002,西瓜果肉饮料,100000,40000,44000,82000,9,18
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx'}]}, response_metadata={}, id='4fee9d-415f-be95-36b80b48d3ec'), AIMessage(content="我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查看 NaN 情况和数据ad_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查input': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in colues\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='abf1eb50-67ef-4ad6-83eb-41d1662a3f79', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\nmns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '88dd1283-3c79-4909-959b-52a89880dcd3', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n #   Column      Non-Null Count  Dtype \n---  ------      --------------  ----- \n 0   产品生产统计表3 non-null     object\n 1   Unnamed: 1  26 non-null     object\n 2   Unnamed: 2  21 non-null     object\n 3   Unnamed: 3  21 non-null     object\n 4   Unnamed: 4  23 non-null     object\n 5   Unnamed: 5  22 non-null     object\n 6   Unnamed: 6  26 non-null     object\n 7   Unnamed: 7  22 non-null     object\n 8   Unnamed: 8  21 non-null     object\ndtypes: object(9)\n```"}], name='python', id='347b8bc2-117a-4302-9058-ad44ba768865', tool_call_id='88dd1283-3c79-4909-959b-52a89880dcd3', artifact=[]), AIMessage(content='接下来我将用 `df.head(5)` 来查 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看, 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='6f19beea-5405-45e4-a726-9e1802d88ebb', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'fc7e2cfa-662e-47c2-98b7-758ea8627a0c', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n---------------------------------------------------------------------------\nNameError                                 Traceback (most recent call last)\nCell In[1], line 2\n      1 # Show the first 5 rows to understand the structure\n----> 2 df.head(5)\n\nNameError: name 'df' is not defined\n```"}], name='python', id='8c442d16-b741-469f-9b5e-16512a631d87', tool_call_id='fc7e2cfa-662e-47c2-98b7-758ea8627a0c', artifact=[]), AIMessage(content='我已经了解了数据集 /data/develolegpt2/examples/datasets/产品生产统计表_origin.xlsx 的基本信息。请问我可以帮您做些什么？', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='607abcda-a6d1-4a47-a51f-7b884651b6e9')]

执行代码

import asyncio
from datetime import date
from typing import TypedDict
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from tablegpt.agent.file_reading import Stage

print(DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""

    filename: str
    """The dataset uploaded in this session can be a filename, file path, or object storage address."""


# tablegpt-agent fully supports async invocation
async def main() -> None:
    llm = ChatOpenAI(
        openai_api_base="http://localhost:11111/v1",
        openai_api_key="sk-zZD420F648F1826355455eEaD881",
        model_name="TableGPT2-7B"

    )
    normalize_llm = ChatOpenAI(
        openai_api_base="https://myart/v1",
        openai_api_key="sk-c79c",
        model_name="gpt-4o"
    )

    # Use local pybox manager for development and testing
    pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

    agent = create_tablegpt_graph(
        llm=llm,
        pybox_manager=pybox_manager,
        normalize_llm=normalize_llm,
    )

    attachment_msg = HumanMessage(
        content="",
        # Please make sure your iPython kernel can access your filename.
        additional_kwargs={"attachments": [Attachment(filename="/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx")]},
    )

    response = await agent.ainvoke(
        input={
            "entry_message": attachment_msg,
            "processing_stage": Stage.UPLOADED,
            "messages": [attachment_msg],
            "parent_id": "some-parent-id1",
            "date": date.today(),
        },
        config={
            # Using checkpointer requires binding thread_id at runtime.
            "configurable": {"thread_id": "some-thread-id"},
        },
    )

    print(response["messages"])

asyncio.run(main())

tablegpt-agent 0.2.15 langchain-core 0.3.21 langchain-openai 0.2.10 langchain-qdrant 0.2.0 langgraph 0.2.53 langgraph-checkpoint 2.0.6 langgraph-sdk 0.1.39

weekenthralling · 2024-11-29T02:22:22Z

@AngGaGim 我注意到您评论的代码片段中存在明文的apikey，请注意保护您的隐私。

正如我最开始提到的一样，normalize-datasets需要依靠大模型的能力来完成，并且它目前处于实验阶段

weekenthralling · 2024-12-02T08:14:27Z

首先，需要确认一下tablegpt-agent的版本，需要使用目前最新的版本
然后可以看一下Normalize Datasets文档
需要注意的是，对异型表的处理有一部分依靠的是模型本身的能力

大佬您好，执行normalize-datasets.ipynb会显示langchain无法解析的错误：

Failed to generate normalization code: Could not parse output: 生产日期,制造编号,产品名称,预定产量,预计本日产量,实际本日产量,累计产量,本日耗费工时,累计耗费工时
2007-08-10,FK-001,猕猴桃果肉饮料,100000,40000,45000,83000,10,20
2007-08-11,FK-002,西瓜果肉饮料,100000,40000,44000,82000,9,18
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx'}]}, response_metadata={}, id='4fee9d-415f-be95-36b80b48d3ec'), AIMessage(content="我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查看 NaN 情况和数据ad_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查input': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in colues\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='abf1eb50-67ef-4ad6-83eb-41d1662a3f79', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\nmns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '88dd1283-3c79-4909-959b-52a89880dcd3', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n #   Column      Non-Null Count  Dtype \n---  ------      --------------  ----- \n 0   产品生产统计表3 non-null     object\n 1   Unnamed: 1  26 non-null     object\n 2   Unnamed: 2  21 non-null     object\n 3   Unnamed: 3  21 non-null     object\n 4   Unnamed: 4  23 non-null     object\n 5   Unnamed: 5  22 non-null     object\n 6   Unnamed: 6  26 non-null     object\n 7   Unnamed: 7  22 non-null     object\n 8   Unnamed: 8  21 non-null     object\ndtypes: object(9)\n```"}], name='python', id='347b8bc2-117a-4302-9058-ad44ba768865', tool_call_id='88dd1283-3c79-4909-959b-52a89880dcd3', artifact=[]), AIMessage(content='接下来我将用 `df.head(5)` 来查 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看, 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='6f19beea-5405-45e4-a726-9e1802d88ebb', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'fc7e2cfa-662e-47c2-98b7-758ea8627a0c', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n---------------------------------------------------------------------------\nNameError                                 Traceback (most recent call last)\nCell In[1], line 2\n      1 # Show the first 5 rows to understand the structure\n----> 2 df.head(5)\n\nNameError: name 'df' is not defined\n```"}], name='python', id='8c442d16-b741-469f-9b5e-16512a631d87', tool_call_id='fc7e2cfa-662e-47c2-98b7-758ea8627a0c', artifact=[]), AIMessage(content='我已经了解了数据集 /data/develolegpt2/examples/datasets/产品生产统计表_origin.xlsx 的基本信息。请问我可以帮您做些什么？', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='607abcda-a6d1-4a47-a51f-7b884651b6e9')]

执行代码

import asyncio
from datetime import date
from typing import TypedDict
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from tablegpt.agent.file_reading import Stage

print(DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""

    filename: str
    """The dataset uploaded in this session can be a filename, file path, or object storage address."""


# tablegpt-agent fully supports async invocation
async def main() -> None:
    llm = ChatOpenAI(
        openai_api_base="http://localhost:11111/v1",
        openai_api_key="sk-zZD420F648F1826355455eEaD881",
        model_name="TableGPT2-7B"

    )
    normalize_llm = ChatOpenAI(
        openai_api_base="https://myart/v1",
        openai_api_key="sk-c79c",
        model_name="gpt-4o"
    )

    # Use local pybox manager for development and testing
    pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

    agent = create_tablegpt_graph(
        llm=llm,
        pybox_manager=pybox_manager,
        normalize_llm=normalize_llm,
    )

    attachment_msg = HumanMessage(
        content="",
        # Please make sure your iPython kernel can access your filename.
        additional_kwargs={"attachments": [Attachment(filename="/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx")]},
    )

    response = await agent.ainvoke(
        input={
            "entry_message": attachment_msg,
            "processing_stage": Stage.UPLOADED,
            "messages": [attachment_msg],
            "parent_id": "some-parent-id1",
            "date": date.today(),
        },
        config={
            # Using checkpointer requires binding thread_id at runtime.
            "configurable": {"thread_id": "some-thread-id"},
        },
    )

    print(response["messages"])

asyncio.run(main())

tablegpt-agent 0.2.15 langchain-core 0.3.21 langchain-openai 0.2.10 langchain-qdrant 0.2.0 langgraph 0.2.53 langgraph-checkpoint 2.0.6 langgraph-sdk 0.1.39

抱歉，我可能有点理解错你的意思了，我详细看了你的输出，其中有一段错误信息：

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 # Show the first 5 rows to understand the structure
----> 2 df.head(5)

NameError: name 'df' is not defined

解决这个错误，需要在 create_tablegpt_graph 的时候传入 session_id='xxxxx'，这个是 file-reading-graph 必要的参数

对于能否将数据转成标准格式，这个取决于模型的能力。

jianpugh · 2024-12-13T08:29:32Z

我使用这个示例，normalize_llm和llm一样用的是tablegpt2在huggingface上的7B模型，好像无法正常提取表格

代码如下：
`from pathlib import Path
from langchain_openai import ChatOpenAI
from pybox import LocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B")
normalize_llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B")
pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph(
llm=llm,
pybox_manager=pybox_manager,
normalize_llm=normalize_llm,
session_id="some-session-id", # This is required when using file-reading
)

from typing import TypedDict
from langchain_core.messages import HumanMessage

class Attachment(TypedDict):
"""Contains at least one dictionary with the key filename."""
filename: str

attachment_msg = HumanMessage(
content="",
# Please make sure your iPython kernel can access your filename.
additional_kwargs={"attachments": [Attachment(filename=r"examples\datasets\产品生产统计表.xlsx")]},
)

from datetime import date
from tablegpt.agent.file_reading import Stage

Reading and processing files.

response = await agent.ainvoke(
input={
"entry_message": attachment_msg,
"processing_stage": Stage.UPLOADED,
"messages": [attachment_msg],
"parent_id": "some-parent-id1",
"date": date.today(),
},
config={
# Using checkpointer requires binding thread_id at runtime.
"configurable": {"thread_id": "some-thread-id"},
},
)

response["messages"]
`

任务栏打印的结果如下

Failed to generate normalization code: Could not parse output: ['产品生产统计表', '生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '耗费工时本日', '耗费工时累计', '2007-08-10', 'FK-001', '猕猴桃果肉饮料', 100000, 40000, 45000, 83000, 10, 20, '2007-08-11', 'FK-002', '西瓜果肉饮料', 100000, 40000, 44000, 82000, 9, 18]
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': 'examples\datasets\产品生产统计表.xlsx'}]}, response_metadata={}, id='6739e0ae-19ea-4aba-b286-b239f9ef8e7c'),
AIMessage(content="我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中，并通过 df.info 查看 NaN 情况和数据类型。\npython\n# Load the data into a DataFrame\ndf = read_df('examples\\datasets\\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中，并通过 df.info 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='6c536010-04aa-463c-b95b-78f2c5c39436', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '85fd9d17-1b87-4f92-857e-afcfa072a985', 'type': 'tool_call'}]),
ToolMessage(content=[{'type': 'text', 'text': "pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 产品生产统计表 23 non-null object\n 1 Unnamed: 1 26 non-null object\n 2 Unnamed: 2 21 non-null object\n 3 Unnamed: 3 21 non-null object\n 4 Unnamed: 4 23 non-null object\n 5 Unnamed: 5 22 non-null object\n 6 Unnamed: 6 26 non-null object\n 7 Unnamed: 7 22 non-null object\n 8 Unnamed: 8 21 non-null object\ndtypes: object(9)\n"}], name='python', id='fd378ff2-246b-475d-80a8-3a453eaf2c1b', tool_call_id='85fd9d17-1b87-4f92-857e-afcfa072a985', artifact=[]),
AIMessage(content='接下来我将用 df.head(5) 来查看数据集的前 5 行。\npython\n# Show the first 5 rows to understand the structure\ndf.head(5)\n', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 df.head(5) 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='d15c7071-fb3e-411a-87da-37f5635ebd93', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': '1adcddaf-e235-4ba5-9fad-3261235c8f2a', 'type': 'tool_call'}]),
ToolMessage(content=[{'type': 'text', 'text': 'pycon\n 产品生产统计表 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 \\\n0 生产日期制造编号产品名称预定产量本日产量 NaN \n1 NaN NaN NaN NaN 预计实际 \n2 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000 40000 45000 \n3 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000 40000 44000 \n4 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 \n\n Unnamed: 6 Unnamed: 7 Unnamed: 8 \n0 累计产量耗费工时 NaN \n1 NaN 本日累计 \n2 83000 10 20 \n3 82000 9 18 \n4 83000 9 18 \n'}], name='python', id='d8206743-4575-4467-8f56-ae82dbb468d2', tool_call_id='1adcddaf-e235-4ba5-9fad-3261235c8f2a', artifact=[]),
AIMessage(content='我已经了解了数据集 examples\datasets\产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么？', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='3004ec5b-31ce-4c02-bea4-38e658b97a75')]

zzzcccxx · 2025-01-07T08:21:04Z

我使用这个示例，normalize_llm和llm一样用的是tablegpt2在huggingface上的7B模型，好像无法正常提取表格

代码如下： `from pathlib import Path from langchain_openai import ChatOpenAI from pybox import LocalPyBoxManager from tablegpt.agent import create_tablegpt_graph from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B") normalize_llm = ChatOpenAI(base_url="http://xxx:xxx/v1", api_key="TableGPT2-7B_api_key", model_name="TableGPT2-7B") pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph( llm=llm, pybox_manager=pybox_manager, normalize_llm=normalize_llm, session_id="some-session-id", # This is required when using file-reading )

from typing import TypedDict from langchain_core.messages import HumanMessage

class Attachment(TypedDict): """Contains at least one dictionary with the key filename.""" filename: str

attachment_msg = HumanMessage( content="", # Please make sure your iPython kernel can access your filename. additional_kwargs={"attachments": [Attachment(filename=r"examples\datasets\产品生产统计表.xlsx")]}, )

from datetime import date from tablegpt.agent.file_reading import Stage

Reading and processing files.

response = await agent.ainvoke( input={ "entry_message": attachment_msg, "processing_stage": Stage.UPLOADED, "messages": [attachment_msg], "parent_id": "some-parent-id1", "date": date.today(), }, config={ # Using checkpointer requires binding thread_id at runtime. "configurable": {"thread_id": "some-thread-id"}, }, )

response["messages"] `

任务栏打印的结果如下

Failed to generate normalization code: Could not parse output: ['产品生产统计表', '生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '耗费工时本日', '耗费工时累计', '2007-08-10', 'FK-001', '猕猴桃果肉饮料', 100000, 40000, 45000, 83000, 10, 20, '2007-08-11', 'FK-002', '西瓜果肉饮料', 100000, 40000, 44000, 82000, 9, 18] For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE [HumanMessage(content='', additional_kwargs={'attachments': [{'filename': 'examples\datasets\产品生产统计表.xlsx'}]}, response_metadata={}, id='6739e0ae-19ea-4aba-b286-b239f9ef8e7c'), AIMessage(content="我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中，并通过 df.info 查看 NaN 情况和数据类型。\npython\n# Load the data into a DataFrame\ndf = read_df('examples\\datasets\\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 df 变量中，并通过 df.info 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='6c536010-04aa-463c-b95b-78f2c5c39436', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('examples\datasets\产品生产统计表.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '85fd9d17-1b87-4f92-857e-afcfa072a985', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 产品生产统计表 23 non-null object\n 1 Unnamed: 1 26 non-null object\n 2 Unnamed: 2 21 non-null object\n 3 Unnamed: 3 21 non-null object\n 4 Unnamed: 4 23 non-null object\n 5 Unnamed: 5 22 non-null object\n 6 Unnamed: 6 26 non-null object\n 7 Unnamed: 7 22 non-null object\n 8 Unnamed: 8 21 non-null object\ndtypes: object(9)\n"}], name='python', id='fd378ff2-246b-475d-80a8-3a453eaf2c1b', tool_call_id='85fd9d17-1b87-4f92-857e-afcfa072a985', artifact=[]), AIMessage(content='接下来我将用 df.head(5) 来查看数据集的前 5 行。\npython\n# Show the first 5 rows to understand the structure\ndf.head(5)\n', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 df.head(5) 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='d15c7071-fb3e-411a-87da-37f5635ebd93', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': '1adcddaf-e235-4ba5-9fad-3261235c8f2a', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': 'pycon\n 产品生产统计表 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 \\\n0 生产日期制造编号产品名称预定产量本日产量 NaN \n1 NaN NaN NaN NaN 预计实际 \n2 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000 40000 45000 \n3 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000 40000 44000 \n4 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 \n\n Unnamed: 6 Unnamed: 7 Unnamed: 8 \n0 累计产量耗费工时 NaN \n1 NaN 本日累计 \n2 83000 10 20 \n3 82000 9 18 \n4 83000 9 18 \n'}], name='python', id='d8206743-4575-4467-8f56-ae82dbb468d2', tool_call_id='1adcddaf-e235-4ba5-9fad-3261235c8f2a', artifact=[]), AIMessage(content='我已经了解了数据集 examples\datasets\产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么？', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='3004ec5b-31ce-4c02-bea4-38e658b97a75')]

@jianpugh 您好，我也是遇到同样的问题，请问最后解决了吗？

AngGaGim added the bug Something isn't working label Nov 28, 2024

edwardzjl assigned weekenthralling Nov 28, 2024

vegetablest mentioned this issue Dec 2, 2024

带有合并单元格的复杂表格应该以什么格式输入和输出？ #90

Closed

weekenthralling mentioned this issue Dec 2, 2024

docs: add the necessary parameters for file-reading-graph #129

Merged

AngGaGim closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

测试产品生成统计表.xlsx，agent无法正确提取表头 #124

测试产品生成统计表.xlsx，agent无法正确提取表头 #124

AngGaGim commented Nov 28, 2024

edwardzjl commented Nov 28, 2024

AngGaGim commented Nov 28, 2024

weekenthralling commented Nov 28, 2024

AngGaGim commented Nov 28, 2024

AngGaGim commented Nov 29, 2024 •

edited

Loading

weekenthralling commented Nov 29, 2024 •

edited by edwardzjl

Loading

weekenthralling commented Dec 2, 2024 •

edited by edwardzjl

Loading

jianpugh commented Dec 13, 2024

zzzcccxx commented Jan 7, 2025

Reading and processing files.

测试产品生成统计表.xlsx，agent无法正确提取表头 #124

测试产品生成统计表.xlsx，agent无法正确提取表头 #124

Comments

AngGaGim commented Nov 28, 2024

edwardzjl commented Nov 28, 2024

AngGaGim commented Nov 28, 2024

weekenthralling commented Nov 28, 2024

AngGaGim commented Nov 28, 2024

AngGaGim commented Nov 29, 2024 • edited Loading

weekenthralling commented Nov 29, 2024 • edited by edwardzjl Loading

weekenthralling commented Dec 2, 2024 • edited by edwardzjl Loading

jianpugh commented Dec 13, 2024

Reading and processing files.

zzzcccxx commented Jan 7, 2025

Reading and processing files.

AngGaGim commented Nov 29, 2024 •

edited

Loading

weekenthralling commented Nov 29, 2024 •

edited by edwardzjl

Loading

weekenthralling commented Dec 2, 2024 •

edited by edwardzjl

Loading