Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get exact token count in summarizer #348

Merged
merged 1 commit into from
Apr 27, 2024

Conversation

MichaelClifford
Copy link
Collaborator

This PR uses the /extras/tokenize/ and extras/detokenize/ methods of the llamacpp_python api server while chunking the text to get the exact number of tokens that will be used in each prompt. This is an improvement over the earlier method where we were simply estimating the token count. Estimation is imperfect, it worked in many cases, but could cause errors if the documents contained a lot of rarer tokens (particularly noted in PDF docs).

I've also swapped the PDF reader package from pypdf to pymupdf which overcame some PDF decoding issues that came up while adding the new chunking text method.

Signed-off-by: Michael Clifford <[email protected]>
Copy link
Collaborator

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding print("num_tokens: ", num_tokens) to line 44 of recipes/natural_language_processing/summarizer/app/summarizer.py, and calling it with a PDF under 200 Mb I can confirm the token count works.

Summarizer concerns unrelated to this PR found via testing this:

However, I would recommend enforcing more strict sizing requirements on the PDF uploader to reduce running workloads with unreasonable performance expectations. I ran this with a few simple PDFs averaging around 0.5 Mb and 1000 tokens, which on avg came out to about 35 seconds.

Scaling this up I tried some bigger PDFs (~ 10 Mb and 50000 tokens) This took me roughly 30+ minutes and my PC feels like like a kitchen burner in use. This is not scaleable for larger PDF files, and I think building that enforcement into the upload sizing would save users from uploading something and expecting high performance from something unreasonably big. I would also like to add a timing feature to not have to manually count for performance testing, but I will add that as a separate PR.

@Gregory-Pereira Gregory-Pereira merged commit 9c2a779 into containers:main Apr 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants