Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

测试产品生成统计表.xlsx,agent无法正确提取表头 #124

Open
AngGaGim opened this issue Nov 28, 2024 · 6 comments
Open

测试产品生成统计表.xlsx,agent无法正确提取表头 #124

AngGaGim opened this issue Nov 28, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@AngGaGim
Copy link

No description provided.

@AngGaGim AngGaGim added the bug Something isn't working label Nov 28, 2024
@edwardzjl
Copy link
Contributor

可否提供所使用的数据文件,或者部分数据的截图?

@AngGaGim
Copy link
Author

可否提供所使用的数据文件,或者部分数据的截图?

你好!测试数据是本仓库提供的
image

@weekenthralling
Copy link
Contributor

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本

然后可以看一下Normalize Datasets文档

需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

@AngGaGim
Copy link
Author

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本

然后可以看一下Normalize Datasets文档

需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

感谢!!

@AngGaGim
Copy link
Author

AngGaGim commented Nov 29, 2024

首先,需要确认一下tablegpt-agent的版本,需要使用目前最新的版本
然后可以看一下Normalize Datasets文档
需要注意的是,对异型表的处理有一部分依靠的是模型本身的能力

大佬您好,执行normalize-datasets.ipynb会显示langchain无法解析的错误:

Failed to generate normalization code: Could not parse output: 生产日期,制造编号,产品名称,预定产量,预计本日产量,实际本日产量,累计产量,本日耗费工时,累计耗费工时
2007-08-10,FK-001,猕猴桃果肉饮料,100000,40000,45000,83000,10,20
2007-08-11,FK-002,西瓜果肉饮料,100000,40000,44000,82000,9,18
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx'}]}, response_metadata={}, id='4fee9d-415f-be95-36b80b48d3ec'), AIMessage(content="我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据ad_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查input': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in colues\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'model_type': None}, response_metadata={}, id='abf1eb50-67ef-4ad6-83eb-41d1662a3f79', tool_calls=[{'name': 'python', 'args': {'query': "# Load the data into a DataFrame\ndf = read_df('/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx')\n\n# Remove leading and trailing whitespaces in column names\nmns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how='all').dropna(axis=1, how='all')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)"}, 'id': '88dd1283-3c79-4909-959b-52a89880dcd3', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 28 entries, 0 to 27\nData columns (total 9 columns):\n #   Column      Non-Null Count  Dtype \n---  ------      --------------  ----- \n 0   产品生产统计表3 non-null     object\n 1   Unnamed: 1  26 non-null     object\n 2   Unnamed: 2  21 non-null     object\n 3   Unnamed: 3  21 non-null     object\n 4   Unnamed: 4  23 non-null     object\n 5   Unnamed: 5  22 non-null     object\n 6   Unnamed: 6  26 non-null     object\n 7   Unnamed: 7  22 non-null     object\n 8   Unnamed: 8  21 non-null     object\ndtypes: object(9)\n```"}], name='python', id='347b8bc2-117a-4302-9058-ad44ba768865', tool_call_id='88dd1283-3c79-4909-959b-52a89880dcd3', artifact=[]), AIMessage(content='接下来我将用 `df.head(5)` 来查 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看, 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='6f19beea-5405-45e4-a726-9e1802d88ebb', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'fc7e2cfa-662e-47c2-98b7-758ea8627a0c', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\n---------------------------------------------------------------------------\nNameError                                 Traceback (most recent call last)\nCell In[1], line 2\n      1 # Show the first 5 rows to understand the structure\n----> 2 df.head(5)\n\nNameError: name 'df' is not defined\n```"}], name='python', id='8c442d16-b741-469f-9b5e-16512a631d87', tool_call_id='fc7e2cfa-662e-47c2-98b7-758ea8627a0c', artifact=[]), AIMessage(content='我已经了解了数据集 /data/develolegpt2/examples/datasets/产品生产统计表_origin.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='607abcda-a6d1-4a47-a51f-7b884651b6e9')]

执行代码

import asyncio
from datetime import date
from typing import TypedDict
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from tablegpt.agent.file_reading import Stage

print(DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""

    filename: str
    """The dataset uploaded in this session can be a filename, file path, or object storage address."""


# tablegpt-agent fully supports async invocation
async def main() -> None:
    llm = ChatOpenAI(
        openai_api_base="http://localhost:11111/v1",
        openai_api_key="sk-zZD420F648F1826355455eEaD881",
        model_name="TableGPT2-7B"

    )
    normalize_llm = ChatOpenAI(
        openai_api_base="https://myart/v1",
        openai_api_key="sk-c79c",
        model_name="gpt-4o"
    )

    # Use local pybox manager for development and testing
    pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

    agent = create_tablegpt_graph(
        llm=llm,
        pybox_manager=pybox_manager,
        normalize_llm=normalize_llm,
    )

    attachment_msg = HumanMessage(
        content="",
        # Please make sure your iPython kernel can access your filename.
        additional_kwargs={"attachments": [Attachment(filename="/data/develop/hjy/tablegpt2/examples/datasets/产品生产统计表_origin.xlsx")]},
    )

    response = await agent.ainvoke(
        input={
            "entry_message": attachment_msg,
            "processing_stage": Stage.UPLOADED,
            "messages": [attachment_msg],
            "parent_id": "some-parent-id1",
            "date": date.today(),
        },
        config={
            # Using checkpointer requires binding thread_id at runtime.
            "configurable": {"thread_id": "some-thread-id"},
        },
    )

    print(response["messages"])

asyncio.run(main())

tablegpt-agent 0.2.15 langchain-core 0.3.21 langchain-openai 0.2.10 langchain-qdrant 0.2.0 langgraph 0.2.53 langgraph-checkpoint 2.0.6 langgraph-sdk 0.1.39

@weekenthralling
Copy link
Contributor

weekenthralling commented Nov 29, 2024

@AngGaGim 我注意到您评论的代码片段中存在明文的apikey,请注意保护您的隐私。

正如我最开始提到的一样,normalize-datasets需要依靠大模型的能力来完成,并且它目前处于实验阶段

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants