Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page#text does not return all the text #518

Open
3ynm opened this issue Jun 16, 2023 · 3 comments
Open

Page#text does not return all the text #518

3ynm opened this issue Jun 16, 2023 · 3 comments

Comments

@3ynm
Copy link

3ynm commented Jun 16, 2023

For some reason PDF::Reader#text does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here is the file: https://hacktivista.org/tmp/2700968.pdf

The text I'm unable to get through #text is LECTURA ACTUAL 15-MAY-2023

@3ynm
Copy link
Author

3ynm commented Jun 16, 2023

For the time being I just monkey-patched the class to add an :unformatted option. I'll leave it here:

require 'pdf/reader'

module PDF
  class Reader
    # PDF::Reader::Page monkey patches.
    class Page
      alias_method :_text, :text
      remove_method :text

      # @param [Hash] opts Adds :unformatted option.
      def text(opts = {})
        return runs.map(&:text).join(' ') if opts[:unformatted]

        _text(opts)
      end
    end
  end
end

@pbernery
Copy link

Had the same issue as well, looking forward to see a fix merged in the library.
In in the meantime, thanks @HACKTIVISTA for this monkey patch.

@mochetts
Copy link

Having the same issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants