Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [KnowledgeGraph] 长文本生成知识图谱时到最后网络会一直超时 #2367

Open
4 of 15 tasks
JUNK803 opened this issue Feb 24, 2025 · 5 comments
Open
4 of 15 tasks
Assignees
Labels
bug Something isn't working Waiting for reply

Comments

@JUNK803
Copy link

JUNK803 commented Feb 24, 2025

Search before asking

  • I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

3.10

DB-GPT version

main

Related scenes

  • Chat Data
  • Chat Excel
  • Chat DB
  • Chat Knowledge
  • Model Management
  • Dashboard
  • Plugins

Installation Information

Device information

GPU:16G

Models information

LLM:gpt-35-turbo【微软代理】
embedding:DB-GPT/models/text2vec-large-chinese

What happened

对于短文章,一万字左右,点击知识图谱-自动切片,可以顺利构造知识图片并进行召回,让ai回答问题;
但是对于长文本【三体第一部】,首先是自动切片不能用,会显示报错:
CheckErrorInfo.**: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 267142 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}},即令牌数超了,

因此使用手动切片chunk size,每512token切一次,重叠50token,然后就开始处理,处理的过程中简单看了一下终端的输出,会经历切片、向量化等一系列操作,中间跳的太快看不清楚,最后出现一系列进度条
2025-02-24 16:30:27 deqing-gpu-249 dbgpt.storage.vector_store.chroma_store[3847629] INFO ChromaStore similar search with scores
Batches: 100%|███████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.40it/s]

但是会有一些是进度0,然后就开始报
deqing-gpu-249 dbgpt.util.api_utils[3847629] WARNING Health check failed for http://127.0.0.1:5670, error: HTTPConnectionPool(host='127.0.0.1', port=5670): Read timed out. (read timeout=10)

然后我估计是有些进度条一直是0且超时,然后程序就一直重复一直显示超时,搞到最后都不响应了

What you expected to happen

有两类进度条好像都会弹出超时的提示,因为切片是并行逻辑,所以终端弹出的内容不是很按照顺序,比较混乱,另一类是
2025-02-24 16:33:39 deqing-gpu-249 dbgpt.model.proxy.llms.chatgpt[3847629] INFO Send request to openai(1.61.1), payload: {'stream': True, 'model': 'gpt-35-turbo'}

后面会显示进度条,然后也会报超时。
然后我估计是有些进度条一直是0且超时,然后程序就一直重复一直显示超时,搞到最后都不响应了

How to reproduce

复现用个超长文本不知道能不能复现,我可以提供一下我的样本

santi.md

Additional context

哦对了,在这里面我还看见过一个问题,终端会显示:
Expected str but got datetime with value datetime.datetime(2025, 2, 24, 11, 10, 6, 371925) - serialized value may not be as expected

这个是什么原因,能怎么解决吗?一直有见到这个不是报错的报错

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@JUNK803 JUNK803 added bug Something isn't working Waiting for reply labels Feb 24, 2025
@JUNK803
Copy link
Author

JUNK803 commented Feb 24, 2025

自己复现了一遍,这是最后的情况:
Image

Image

@JUNK803
Copy link
Author

JUNK803 commented Feb 25, 2025

补充说一下我最新的测试,我将文本手动切割,token是10000,重叠100,这下就可以正常分割了,当然也会出现“read timeout=10”的情况,但是最终还是可以制成知识图谱,但是查看制作完的知识图谱会发现:内容不够丰富,且很多实体的名字叫“none_header_chunk”,可以看下图,最终也是咨询不了任何问题,会报错,报错信息看着像是因为检索的东西过多,以及文本过长?我对sql的东西不是很熟悉,看不是很懂。报错的终端信息也会贴在下面。麻烦各位前辈帮我分析一下,非常感谢

Image

Image

Image

Image

@fanzhidongyzby
Copy link
Collaborator

fanzhidongyzby commented Feb 25, 2025 via email

@JUNK803
Copy link
Author

JUNK803 commented Feb 25, 2025

none_header_chunk不用太在意,只是表示文本块。KNOWLEDGE_GRAPH_CHUNK_SEARCH_TOP_SIZE是用来控制单次召回的文本块的数量的,太大容易导致上下文溢出。你split的chunk size = 10k已经很大了,容易出现这样的问题。

嗯嗯,明白了,非常感谢!所以chunk size不能太大,但是细分的太小的话也会不好!

@JUNK803
Copy link
Author

JUNK803 commented Feb 25, 2025

none_header_chunk不用太在意,只是表示文本块。KNOWLEDGE_GRAPH_CHUNK_SEARCH_TOP_SIZE是用来控制单次召回的文本块的数量的,太大容易导致上下文溢出。你split的chunk size = 10k已经很大了,容易出现这样的问题。

前辈您好,根据您的这个说法,我又进行了两次参数,第一次是用1000token进行分割,80token的文本重叠,第二次是用800token进行分割和70的文本重叠,好消息是都可以生成知识图谱,最终显示“Finished”。但是坏消息是依然不能正常进行对话,报错内容和之前10ktoken的一样,图片如下:
难道800token也会很大吗,还是说是因为我传入的文本是小说,导致文字量过大了?这个能麻烦您告诉我能不能在代码里面的位置进行修改,从而优化检索召回的思路,使其可以正常对话。非常感谢您的帮助!

此外我还想知道一下,终端里面显示的:
Expected str but got datetime with value datetime.datetime(2025, 2, 24, 11, 10, 6, 371925) - serialized value may not be as expected
是什么,会不会有什么影响?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Waiting for reply
Projects
None yet
Development

No branches or pull requests

2 participants