Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Commit

Permalink
Merge pull request #89 from tabulapdf/feature/warnAboutScans
Browse files Browse the repository at this point in the history
notes first page that has text elements in pages JSON

@jazzido: Let me know if you think this introduces a regression.
  • Loading branch information
jeremybmerrill committed Jan 25, 2015
2 parents 95df965 + 0fc899f commit 00a15c3
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 5 deletions.
6 changes: 5 additions & 1 deletion lib/tabula/entities/page.rb
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,10 @@ def number(indexing_base=:one_indexed)
end
end

def has_text?
!self.texts.empty?
end

# TODO no need for this, let's choose one name
def ruling_lines
get_ruling_lines!
Expand Down Expand Up @@ -258,7 +262,7 @@ def to_json(options={})
:height => self.height,
:number => self.number,
:rotation => self.rotation,
:texts => self.texts
:hasText => self.has_text?
}.to_json(options)
end

Expand Down
19 changes: 15 additions & 4 deletions lib/tabula/extraction.rb
Original file line number Diff line number Diff line change
Expand Up @@ -371,26 +371,37 @@ def debugPath(path)


class PagesInfoExtractor
def initialize(pdf_filename, password='')
@pdf_filename = pdf_filename
@pdf_file = Extraction.openPDF(pdf_filename, password)
def initialize(pdf_file_path, password='')
@pdf_filename = pdf_file_path
@pdf_file = Extraction.openPDF(pdf_file_path, password)
@all_pages = @pdf_file.getDocumentCatalog.getAllPages

@extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all )
end

def pages
found_page_with_texts = false
Enumerator.new do |y|
begin
@all_pages.each_with_index do |page, i|
contents = page.getContents

y.yield Tabula::Page.new(@pdf_filename,
if found_page_with_texts
page = Tabula::Page.new(@pdf_filename,
page.findCropBox.width,
page.findCropBox.height,
page.getRotation.to_i,
i+1) #remember, these are one-indexed
else
page = @extractor.extract_page(i+1)
found_page_with_texts = page.has_text?
end

y.yield page
end
ensure
@pdf_file.close
@extractor.close!
end
end
end
Expand Down

3 comments on commit 00a15c3

@jazzido
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Did you run the tests? Also, a test for this feature would be awesome.

@jeremybmerrill
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Tests do pass (or don't fail any more than normal... 😁 ). I'll add a new one.

@jeremybmerrill
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added: a5ee307

Please sign in to comment.