-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect tables automagically when Stream is used #102
Comments
Tabula has an implementation based on Anssi Nurminen's master's thesis, starting from there. |
Hi there, thanks for this library! :) Regarding table detection algorithms - I know that Tabula uses Nurminen's algorithm, but I was wondering - is it the best algorithm that's out there? Do you guys know of any other / better ones? Thanks a lot - you guys rock! |
Hey @imri! There has been a lot of research on detecting and extracting tables from PDFs. All the approaches I've seen are heuristics and no single one gives a 100% table detection accuracy. pdf2table is where I started some time back and followed on the citations and links from there. I've seen Tabula and Nurminen's algorithm work really well on tables that don't have both vertical and horizontal ruling lines and instead rely on spaces to form the grid. As Tabula's author states in this comment tabulapdf/tabula-java#49 (comment), it got really good results in ICDAR 2013 and passed most of the table detection tests that Tabula has. The ones it didn't pass were just corner cases, which could be extracted by specifying table areas or column separators. Since that was 5 years, it's possible that other performant approaches could've been devised. If you come across one, please let us know! |
Nurminen's master's thesis states that after calculating left, middle and right text edges (which are vertical lines that pass through similarly aligned text), each text row is assigned a probability of being part of a table. The comment here NurminenDetectionAlgorithm.java#L237 says that Tabula uses a general heuristic instead, by trying to find the text edge type that intersects most horizontal text rows and then generating table areas using those text edges. I'm guessing that Tabula extends the table areas by including text rows that share a y-axis overlap with the table areas, because that is what I've done in #206 (see comment here L180) and the image that you showed is the kind of result I expect using However, in the case that you showed, |
Interesting. When you say that lattice should work perfectly - I sort of wish to create a generic way to detect and extract tables without having to know which detection method (lattice / stream) is best for a given document - I want to decouple them as much as possible. Imri |
I get your use-case and it is not possible currently through the library itself. But I see two possibilities which can be implemented (both heuristics):
|
@imri Let's continue the conversation on the issue I just opened. |
@vinayak-mehta, I was wondering if any code has been merged regarding 'guess' flavor? |
Hi @vinayak-mehta There's a table in red, then in blue and then the 3rd table in green starts and extends on the next page. Some pages have 2 tables, some 1 and likewise. Insides of a particular table (like blue) wouldn't necessarily match with others (like red), too. |
By default, Stream treats the whole page as a table right now, which fails when there are two or more Stream-type tables on the same page with different number of columns. Change the default to fallback.
12s0324.pdf and how Tabula does this should be a good start.
The text was updated successfully, but these errors were encountered: