OCR on PDF #543

xuzeyu91 · 2024-04-10T04:06:45Z

xuzeyu91
Apr 10, 2024

Context / Scenario

I referred to this example and wrote an implementation of OCR. Attempting to scan PDF and PDF containing images did not trigger it. I'm not sure if there was anything wrong with the operation

Question

I referred to this example and wrote an implementation of OCR. Attempting to scan PDF and PDF containing images did not trigger it. I'm not sure if there was anything wrong with the operation

lecramr · 2024-04-12T11:37:46Z

lecramr
Apr 12, 2024

Looks like this is currently not possible, see code:
https://github.com/microsoft/kernel-memory/blob/main/service/Core/DataFormats/Pdf/PdfDecoder.cs

Altough we already have (https://github.com/microsoft/kernel-memory/blob/main/service/Abstractions/DataFormats/IOcrEngine.cs) in place, which would be enough for simple text extraction, and UglyToad.PdfPig is able to extract images as experimental feature.

@dluc Wouldn't it be possible to extend "FileContent" with a Array of found Images in the PDF described GPT-4 Vision Api if enabled?

0 replies

marcominerva · 2024-04-12T11:47:40Z

marcominerva
Apr 12, 2024

I think that you can support this scenario when the issue #379 will be completed (currently there is a PR in preview).

With that, you will be able to inject a custom decoder for PDF files.

0 replies

dluc · 2024-04-16T00:54:41Z

dluc
Apr 16, 2024
Maintainer

Given that now custom content decoders can be injected, I would first try creating one that replaces the default PDF decoder, and internally does all the work of extracting text and text from images. E.g. you can create a decoder that depends on the existing image decoder to parse images, and return all the text at the end, without the need to revisit the FileContent class (for now).

0 replies

TaffarelJr · 2024-06-12T16:24:00Z

TaffarelJr
Jun 12, 2024

Is any work being done on this?

My company desperately needs this functionality, and my quick solution would be to simply extract the images from the PDF first, then send PDF + images to KernelMemory. But this sounds exactly like the solution @dluc is proposing above (only, outside of KM). I'd much rather help contribute to KernelMemory than create my own one-off solution.

0 replies

mhackermsft · 2024-09-25T17:04:27Z

mhackermsft
Sep 25, 2024

Any update on OCR extraction for PDFs? Customer has a bunch of pdf docs generated from a scanning solution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR on PDF #543

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OCR on PDF #543

xuzeyu91 Apr 10, 2024

Context / Scenario

Question

Replies: 5 comments

lecramr Apr 12, 2024

marcominerva Apr 12, 2024

dluc Apr 16, 2024 Maintainer

TaffarelJr Jun 12, 2024

mhackermsft Sep 25, 2024

xuzeyu91
Apr 10, 2024

lecramr
Apr 12, 2024

marcominerva
Apr 12, 2024

dluc
Apr 16, 2024
Maintainer

TaffarelJr
Jun 12, 2024

mhackermsft
Sep 25, 2024