Fix table box snapping

Signed-off-by: Christoph Auer <[email protected]>
DS4SD · Dec 13, 2024 · dd4f72e · dd4f72e
1 parent 3f854bd
commit dd4f72e
Show file tree

Hide file tree

Showing 41 changed files with 386 additions and 394 deletions.
diff --git a/docling/utils/layout_postprocessor.py b/docling/utils/layout_postprocessor.py
@@ -323,21 +323,21 @@ def _process_special_clusters(self) -> List[Cluster]:
                 contained = self._sort_clusters(contained)
                 special.children = contained
 
-                # Adjust bbox only for wrapper types
-                if special.label in self.WRAPPER_TYPES:
+                # Adjust bbox only for Form and Key-Value-Region, not Table or Picture
+                if special.label in [DocItemLabel.FORM, DocItemLabel.KEY_VALUE_REGION]:
                     special.bbox = BoundingBox(
                         l=min(c.bbox.l for c in contained),
                         t=min(c.bbox.t for c in contained),
                         r=max(c.bbox.r for c in contained),
                         b=max(c.bbox.b for c in contained),
                     )
 
-                    # Collect all cells from children
-                    all_cells = []
-                    for child in contained:
-                        all_cells.extend(child.cells)
-                    special.cells = self._deduplicate_cells(all_cells)
-                    special.cells = self._sort_cells(special.cells)
+                # Collect all cells from children
+                all_cells = []
+                for child in contained:
+                    all_cells.extend(child.cells)
+                special.cells = self._deduplicate_cells(all_cells)
+                special.cells = self._sort_cells(special.cells)
 
         picture_clusters = [
             c for c in special_clusters if c.label == DocItemLabel.PICTURE

diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt b/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.json b/tests/data/groundtruth/docling_v1/2203.01017v2.json
diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.md b/tests/data/groundtruth/docling_v1/2203.01017v2.md
@@ -12,14 +12,12 @@
 
 The occurrence of tables in documents is ubiquitous. They often summarise quantitative or factual data, which is cumbersome to describe in verbose text but nevertheless extremely valuable. Unfortunately, this compact representation is often not easy to parse by machines. There are many implicit conventions used to obtain a compact table representation. For example, tables often have complex columnand row-headers in order to reduce duplicated cell content. Lines of different shapes and sizes are leveraged to separate content or indicate a tree structure. Additionally, tables can also have empty/missing table-entries or multi-row textual table-entries. Fig. 1 shows a table which presents all these issues.
 
-Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.
 
-|    | 3   |
-|----|-----|
-|  2 |     |
+<!-- image -->
+
+Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.
 
 
-<!-- image -->
 
 - b. Red-annotation of bounding boxes, Blue-predictions by TableFormer
 
@@ -29,16 +27,16 @@ Tables organize valuable content in a concise and compact representation. This c
 - c. Structure predicted by TableFormer:
 
 
-
-| 0      | 1 2   |   1 |
-|--------|-------|-----|
-| 3 4    | 5 3   |   6 |
-| 9      | 10    |  11 |
-| 8 13 2 | 14    |  15 |
-| 17     | 18    |  19 |
+<!-- image -->
 
 Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
-<!-- image -->
+
+| 0   |   1 | 1   |   2 1 |   2 1 |    |
+|-----|-----|-----|-------|-------|----|
+| 3   |   4 | 5 3 |     6 |     7 |    |
+| 8   |   9 | 10  |    11 |    12 | 2  |
+|     |  13 | 14  |    15 |    16 | 2  |
+|     |  17 | 18  |    19 |    20 | 2  |
 
 Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.
 

diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json b/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json