Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract text from existing PDF? #641

Closed
rondonjon opened this issue Oct 21, 2020 · 1 comment
Closed

Extract text from existing PDF? #641

rondonjon opened this issue Oct 21, 2020 · 1 comment

Comments

@rondonjon
Copy link

Hello @Hopding,

I am trying to extract the (plain) text from existing PDF documents and stumbled upon this library after it turned out that the pdfjs-dist is not as portable as needed in my project.

Could you share a few quick pointers on what might be the best approach to find the text nodes and extract the values with your library?

I have already browsed the API docs but realized that (while they're very extensively covering the creation and extension of PDFs) the information on processing PDFs is rather scarce. I am guessing that I should iterate over all pages and then descend into the .node trees? I tried that out but quickly faced another problem: the most of the types (PDFDict, PDFObject, ...) in these trees seem to be missing in the d.ts file, which makes the drill-down pretty cumbersome and leaves me puzzled about the actual chances of success??

Thanks in advance.

@rondonjon
Copy link
Author

rondonjon commented Oct 21, 2020

Duplicate of #93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant