Detect tables automagically when Stream is used #102

vinayak-mehta · 2018-09-11T22:59:11Z

By default, Stream treats the whole page as a table right now, which fails when there are two or more Stream-type tables on the same page with different number of columns. Change the default to fallback.

12s0324.pdf and how Tabula does this should be a good start.

vinayak-mehta · 2018-10-30T20:35:52Z

Tabula has an implementation based on Anssi Nurminen's master's thesis, starting from there.

imri · 2018-11-22T09:51:20Z

Hi there, thanks for this library! :)

Regarding table detection algorithms - I know that Tabula uses Nurminen's algorithm, but I was wondering - is it the best algorithm that's out there? Do you guys know of any other / better ones?

Thanks a lot - you guys rock!

vinayak-mehta · 2018-11-22T12:00:55Z

Hey @imri!

There has been a lot of research on detecting and extracting tables from PDFs. All the approaches I've seen are heuristics and no single one gives a 100% table detection accuracy. pdf2table is where I started some time back and followed on the citations and links from there.

I've seen Tabula and Nurminen's algorithm work really well on tables that don't have both vertical and horizontal ruling lines and instead rely on spaces to form the grid. As Tabula's author states in this comment tabulapdf/tabula-java#49 (comment), it got really good results in ICDAR 2013 and passed most of the table detection tests that Tabula has. The ones it didn't pass were just corner cases, which could be extracted by specifying table areas or column separators.

Since that was 5 years, it's possible that other performant approaches could've been devised. If you come across one, please let us know!

imri · 2018-11-24T16:15:34Z

Thanks for you reply.

Weird, because when I use it on the simplest document I have, it's not working well.
As far as I'm concerned, Nurminen's algorithm is part of Tabula's Autodetect Tables feature. When using it on this document:

It resulted the following:

As you can see, it unified the 'From:' and the right table into one selection area, which is wrong.

Is this a corner case? Isn't it the simplest case there is?

Thanks,
Imri

vinayak-mehta · 2018-11-24T17:19:53Z

Nurminen's master's thesis states that after calculating left, middle and right text edges (which are vertical lines that pass through similarly aligned text), each text row is assigned a probability of being part of a table. The comment here NurminenDetectionAlgorithm.java#L237 says that Tabula uses a general heuristic instead, by trying to find the text edge type that intersects most horizontal text rows and then generating table areas using those text edges.

I'm guessing that Tabula extends the table areas by including text rows that share a y-axis overlap with the table areas, because that is what I've done in #206 (see comment here L180) and the image that you showed is the kind of result I expect using stream. I went with this approach to include the cases where table columns share different alignments. For example: if a table has 4 columns that are left aligned and 3 columns that are right aligned, the left aligned columns would win the majority due to the sum of their intersections with horizontal text rows, leading to the table area being limited to only left aligned columns.

However, in the case that you showed, lattice should be able to work perfectly since the table cells are separated by lines.

imri · 2018-11-25T08:39:06Z

Interesting.
Is it possible to detect (and extract) tables using both Stream and Lattice together?

When you say that lattice should work perfectly - I sort of wish to create a generic way to detect and extract tables without having to know which detection method (lattice / stream) is best for a given document - I want to decouple them as much as possible.

Imri

vinayak-mehta · 2018-12-01T01:40:21Z

Is it possible to detect (and extract) tables using both Stream and Lattice together?

I get your use-case and it is not possible currently through the library itself. But I see two possibilities which can be implemented (both heuristics):

As far as I can tell from NurminenDetectionAlgorithm.java, Tabula first filters out all Lattice-type tables from the document and then looks for Stream-type tables, till it cannot find any more tables. Similarly, we can "couple" both flavors into a single one inside Camelot.
We can create a flavor called guess which automatically chooses between Lattice and Stream.

vinayak-mehta · 2018-12-01T01:43:48Z

@imri Let's continue the conversation on the issue I just opened.

abhilashabhardwaj · 2020-06-12T02:03:03Z

@vinayak-mehta, I was wondering if any code has been merged regarding 'guess' flavor?
I'm having trouble identifying the following as a table.
Stream gives entire page as table.
edge_tol doesn't work.
can't use visual debugging to identify coordinates at run-time.

ShanksDS · 2020-10-15T06:36:44Z

Hi @vinayak-mehta
Given that Camelot is the best for these things, I am trying to pull a huge set of pdfs which look like these:

There's a table in red, then in blue and then the 3rd table in green starts and extends on the next page. Some pages have 2 tables, some 1 and likewise. Insides of a particular table (like blue) wouldn't necessarily match with others (like red), too.
I have to automate and run this through several hundred docs.
I've tried several iterations of the options that were provided but none seem to work. Has camelot any option that I could use? Presently I'm using edge_tol, Col_tol, row_tol and flavor = 'stream'.

vinayak-mehta added the enhancement label Sep 11, 2018

vinayak-mehta mentioned this issue Oct 30, 2018

How to manage such table? #182

Closed

vinayak-mehta self-assigned this Oct 30, 2018

vinayak-mehta mentioned this issue Oct 30, 2018

Add public interface to access detected geometries #186

Closed

vinayak-mehta mentioned this issue Nov 22, 2018

[MRG] Add implementation of Anssi Nurminen's table detection algorithm #206

Merged

5 tasks

vinayak-mehta closed this as completed in #206 Nov 23, 2018

vinayak-mehta mentioned this issue Dec 1, 2018

Automatically choose flavor based on type of table in PDF #211

Closed

vinayak-mehta mentioned this issue Jul 4, 2019

Automatically choose flavor based on type of table in PDF camelot-dev/camelot#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect tables automagically when Stream is used #102

Detect tables automagically when Stream is used #102

vinayak-mehta commented Sep 11, 2018

vinayak-mehta commented Oct 30, 2018

imri commented Nov 22, 2018

vinayak-mehta commented Nov 22, 2018

imri commented Nov 24, 2018

vinayak-mehta commented Nov 24, 2018

imri commented Nov 25, 2018

vinayak-mehta commented Dec 1, 2018 •

edited

Loading

vinayak-mehta commented Dec 1, 2018

abhilashabhardwaj commented Jun 12, 2020

ShanksDS commented Oct 15, 2020 •

edited

Loading

Detect tables automagically when Stream is used #102

Detect tables automagically when Stream is used #102

Comments

vinayak-mehta commented Sep 11, 2018

vinayak-mehta commented Oct 30, 2018

imri commented Nov 22, 2018

vinayak-mehta commented Nov 22, 2018

imri commented Nov 24, 2018

vinayak-mehta commented Nov 24, 2018

imri commented Nov 25, 2018

vinayak-mehta commented Dec 1, 2018 • edited Loading

vinayak-mehta commented Dec 1, 2018

abhilashabhardwaj commented Jun 12, 2020

ShanksDS commented Oct 15, 2020 • edited Loading

vinayak-mehta commented Dec 1, 2018 •

edited

Loading

ShanksDS commented Oct 15, 2020 •

edited

Loading