Get exact token count in summarizer #348

MichaelClifford · 2024-04-27T13:36:25Z

This PR uses the /extras/tokenize/ and extras/detokenize/ methods of the llamacpp_python api server while chunking the text to get the exact number of tokens that will be used in each prompt. This is an improvement over the earlier method where we were simply estimating the token count. Estimation is imperfect, it worked in many cases, but could cause errors if the documents contained a lot of rarer tokens (particularly noted in PDF docs).

I've also swapped the PDF reader package from pypdf to pymupdf which overcame some PDF decoding issues that came up while adding the new chunking text method.

Signed-off-by: Michael Clifford <[email protected]>

Gregory-Pereira

After adding print("num_tokens: ", num_tokens) to line 44 of recipes/natural_language_processing/summarizer/app/summarizer.py, and calling it with a PDF under 200 Mb I can confirm the token count works.

Summarizer concerns unrelated to this PR found via testing this:

However, I would recommend enforcing more strict sizing requirements on the PDF uploader to reduce running workloads with unreasonable performance expectations. I ran this with a few simple PDFs averaging around 0.5 Mb and 1000 tokens, which on avg came out to about 35 seconds.

Scaling this up I tried some bigger PDFs (~ 10 Mb and 50000 tokens) This took me roughly 30+ minutes and my PC feels like like a kitchen burner in use. This is not scaleable for larger PDF files, and I think building that enforcement into the upload sizing would save users from uploading something and expecting high performance from something unreasonably big. I would also like to add a timing feature to not have to manually count for performance testing, but I will add that as a separate PR.

get exact token counts

5ae341d

Signed-off-by: Michael Clifford <[email protected]>

MichaelClifford requested review from rhatdan, sallyom, lmilbaum, cgwalters and Gregory-Pereira as code owners April 27, 2024 13:36

Gregory-Pereira approved these changes Apr 27, 2024

View reviewed changes

Gregory-Pereira merged commit 9c2a779 into containers:main Apr 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get exact token count in summarizer #348

Get exact token count in summarizer #348

MichaelClifford commented Apr 27, 2024

Gregory-Pereira left a comment •

edited

Loading

Get exact token count in summarizer #348

Get exact token count in summarizer #348

Conversation

MichaelClifford commented Apr 27, 2024

Gregory-Pereira left a comment • edited Loading

Choose a reason for hiding this comment

Gregory-Pereira left a comment •

edited

Loading