Extract text from existing PDF? #641

rondonjon · 2020-10-21T01:06:30Z

I am trying to extract the (plain) text from existing PDF documents and stumbled upon this library after it turned out that the pdfjs-dist is not as portable as needed in my project.

Could you share a few quick pointers on what might be the best approach to find the text nodes and extract the values with your library?

I have already browsed the API docs but realized that (while they're very extensively covering the creation and extension of PDFs) the information on processing PDFs is rather scarce. I am guessing that I should iterate over all pages and then descend into the .node trees? I tried that out but quickly faced another problem: the most of the types (PDFDict, PDFObject, ...) in these trees seem to be missing in the d.ts file, which makes the drill-down pretty cumbersome and leaves me puzzled about the actual chances of success??

Thanks in advance.

The text was updated successfully, but these errors were encountered:

rondonjon · 2020-10-21T06:51:10Z

Duplicate of #93

rondonjon closed this as completed Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract text from existing PDF? #641

Extract text from existing PDF? #641

rondonjon commented Oct 21, 2020

rondonjon commented Oct 21, 2020 •

edited

Loading

Extract text from existing PDF? #641

Extract text from existing PDF? #641

Comments

rondonjon commented Oct 21, 2020

rondonjon commented Oct 21, 2020 • edited Loading

rondonjon commented Oct 21, 2020 •

edited

Loading