Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug: docker container process will crash when upload a big PDF(100M). #473

Closed
wants to merge 1 commit into from

Conversation

sszgwdk
Copy link
Collaborator

@sszgwdk sszgwdk commented Dec 6, 2024

The bug originated from Tidb's limitation on the length of the TEXT field to 6MB. When inserting document data into the db, the SQL execution failed due to the content being too long, exceeding 6MB. For this situation, consider truncating the content to meet the length limit and retaining the original content for indexing construction, so as not to affect the execution of subsequent processes.

Copy link

vercel bot commented Dec 6, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
tidb-ai-preview ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 6, 2024 6:06am
tidb-ai-storybook ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 6, 2024 6:06am

Copy link

vercel bot commented Dec 6, 2024

@sszgwdk is attempting to deploy a commit to the pingcap Team on Vercel.

A member of the Team first needs to authorize it.

session.add(document)
session.commit()

build_index_for_document.delay(kb_id, document.id)
build_index_for_document.delay(kb_id, document.id, original_content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to increase Redis' memory overhead? Can we refer to Git Large File Storage (LFS) for storing references to large files in the database instead of the raw content?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set a max-upload-size parameters to protect redis and suppose user(admin) have enough sense that he/her SHOULD not set it a very big value(somehow bigger than redis's ram size)

So currenttly I think it looks good to me.

@634750802 634750802 linked an issue Dec 9, 2024 that may be closed by this pull request
original_content = self.content
self.content = self.content[:max_length] + "..."
return original_content
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we truncate a big file, it will fail when "retry" to index it(based on current workflow)

@IANTHEREAL
Copy link
Contributor

I have one question, does truncated content matter?

@sszgwdk
Copy link
Collaborator Author

sszgwdk commented Dec 10, 2024

I have one question, does truncated content matter?

For the backend, apart from the initial chunk splitting and 'retry' interface, there seems to be no use of the original content elsewhere.
Users may want to view the original document in the front-end, and the displayed content will be truncated, which may be harmless when the text is very large?

@sszgwdk sszgwdk closed this Dec 13, 2024
@sszgwdk sszgwdk deleted the fix_upload_big_doc branch December 19, 2024 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: docker container process will crash when upload a big PDF(100M)
5 participants