Page#text does not return all the text #518

3ynm · 2023-06-16T04:43:56Z

For some reason PDF::Reader#text does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here is the file: https://hacktivista.org/tmp/2700968.pdf

The text I'm unable to get through #text is LECTURA ACTUAL 15-MAY-2023

The text was updated successfully, but these errors were encountered:

3ynm · 2023-06-16T18:01:54Z

For the time being I just monkey-patched the class to add an :unformatted option. I'll leave it here:

require 'pdf/reader'

module PDF
  class Reader
    # PDF::Reader::Page monkey patches.
    class Page
      alias_method :_text, :text
      remove_method :text

      # @param [Hash] opts Adds :unformatted option.
      def text(opts = {})
        return runs.map(&:text).join(' ') if opts[:unformatted]

        _text(opts)
      end
    end
  end
end

pbernery · 2023-10-24T13:20:55Z

Had the same issue as well, looking forward to see a fix merged in the library.
In in the meantime, thanks @HACKTIVISTA for this monkey patch.

mochetts · 2024-04-22T17:39:46Z

Having the same issue!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page#text does not return all the text #518

Page#text does not return all the text #518

3ynm commented Jun 16, 2023 •

edited

Loading

3ynm commented Jun 16, 2023 •

edited

Loading

pbernery commented Oct 24, 2023

mochetts commented Apr 22, 2024

Page#text does not return all the text #518

Page#text does not return all the text #518

Comments

3ynm commented Jun 16, 2023 • edited Loading

3ynm commented Jun 16, 2023 • edited Loading

pbernery commented Oct 24, 2023

mochetts commented Apr 22, 2024

3ynm commented Jun 16, 2023 •

edited

Loading

3ynm commented Jun 16, 2023 •

edited

Loading