Skip to content

Commit

Permalink
Fix table box snapping
Browse files Browse the repository at this point in the history
Signed-off-by: Christoph Auer <[email protected]>
  • Loading branch information
cau-git committed Dec 13, 2024
1 parent 3f854bd commit dd4f72e
Show file tree
Hide file tree
Showing 41 changed files with 386 additions and 394 deletions.
16 changes: 8 additions & 8 deletions docling/utils/layout_postprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,21 +323,21 @@ def _process_special_clusters(self) -> List[Cluster]:
contained = self._sort_clusters(contained)
special.children = contained

# Adjust bbox only for wrapper types
if special.label in self.WRAPPER_TYPES:
# Adjust bbox only for Form and Key-Value-Region, not Table or Picture
if special.label in [DocItemLabel.FORM, DocItemLabel.KEY_VALUE_REGION]:
special.bbox = BoundingBox(
l=min(c.bbox.l for c in contained),
t=min(c.bbox.t for c in contained),
r=max(c.bbox.r for c in contained),
b=max(c.bbox.b for c in contained),
)

# Collect all cells from children
all_cells = []
for child in contained:
all_cells.extend(child.cells)
special.cells = self._deduplicate_cells(all_cells)
special.cells = self._sort_cells(special.cells)
# Collect all cells from children
all_cells = []
for child in contained:
all_cells.extend(child.cells)
special.cells = self._deduplicate_cells(all_cells)
special.cells = self._sort_cells(special.cells)

picture_clusters = [
c for c in special_clusters if c.label == DocItemLabel.PICTURE
Expand Down
47 changes: 23 additions & 24 deletions tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tests/data/groundtruth/docling_v1/2203.01017v2.json

Large diffs are not rendered by default.

24 changes: 11 additions & 13 deletions tests/data/groundtruth/docling_v1/2203.01017v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,12 @@

The occurrence of tables in documents is ubiquitous. They often summarise quantitative or factual data, which is cumbersome to describe in verbose text but nevertheless extremely valuable. Unfortunately, this compact representation is often not easy to parse by machines. There are many implicit conventions used to obtain a compact table representation. For example, tables often have complex columnand row-headers in order to reduce duplicated cell content. Lines of different shapes and sizes are leveraged to separate content or indicate a tree structure. Additionally, tables can also have empty/missing table-entries or multi-row textual table-entries. Fig. 1 shows a table which presents all these issues.

Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.

| | 3 |
|----|-----|
| 2 | |
<!-- image -->

Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.


<!-- image -->

- b. Red-annotation of bounding boxes, Blue-predictions by TableFormer

Expand All @@ -29,16 +27,16 @@ Tables organize valuable content in a concise and compact representation. This c
- c. Structure predicted by TableFormer:



| 0 | 1 2 | 1 |
|--------|-------|-----|
| 3 4 | 5 3 | 6 |
| 9 | 10 | 11 |
| 8 13 2 | 14 | 15 |
| 17 | 18 | 19 |
<!-- image -->

Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
<!-- image -->

| 0 | 1 | 1 | 2 1 | 2 1 | |
|-----|-----|-----|-------|-------|----|
| 3 | 4 | 5 3 | 6 | 7 | |
| 8 | 9 | 10 | 11 | 12 | 2 |
| | 13 | 14 | 15 | 16 | 2 |
| | 17 | 18 | 19 | 20 | 2 |

Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.

Expand Down
2 changes: 1 addition & 1 deletion tests/data/groundtruth/docling_v1/2203.01017v2.pages.json

Large diffs are not rendered by default.

Loading

0 comments on commit dd4f72e

Please sign in to comment.