Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cells with duplicated indexes #341

Merged
merged 2 commits into from
May 1, 2024

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Apr 30, 2024

The post processing in table-transformer allows for returning multiple cells with same indexes. It happens when 1 cell is covered in more than 50% by two different spanning cells. This happens because this 'cell' is later assigned as subcell for both spanning cells instead of one with the highest probability.

Example:
Simple 2x2 table

# +-----------+----------+
# |    one    |   two    |
# |-----------+----------|
# |    three  |   four   |
# +-----------+----------+

With spanning cells over cells 'one three' (column spanning cell) and another one over 'one two' (row spanning cell). In this case cell "one" will be assigned to both spanning cells. Reproduction:

from pprint import pprint

from unstructured_inference.models.tables import structure_to_cells

table_structure = {
    "rows": [
        {"bbox": [0, 0, 10, 20]},
        {"bbox": [10, 0, 20, 20]},
    ],
    "columns": [
        {"bbox": [0, 0, 20, 10]},
        {"bbox": [0, 10, 20, 20]},
    ],
    "spanning cells": [
        {"bbox": [0, 0, 20, 10], "score": 0.9, "projected row header": False},
        {"bbox": [0, 0, 10, 20], "score": 0.8, "projected row header": False},
    ],
}
tokens = [
    {"text": "one", "bbox": [0, 0, 10, 10], "span_num": 1, "line_num": 1, "block_num": 1},
    {"text": "two", "bbox": [0, 10, 10, 20], "span_num": 1, "line_num": 1, "block_num": 1},
    {"text": "three", "bbox": [10, 0, 20, 10], "span_num": 1, "line_num": 1, "block_num": 1},
    {"text": "four", "bbox": [10, 10, 20, 20], "span_num": 1, "line_num": 1, "block_num": 1},
]

predicted_cells, _ = structure_to_cells(table_structure, tokens=tokens)
pprint(predicted_cells)

This yields:

[
....
 {'cell text': 'one three',
  'column_nums': [0],
  'row_nums': [0, 1]
},
 {''cell text': 'two',
  'column_nums': [0, 1],
  'row_nums': [0]
}]

You can see coordinates (0,0) are included in both spanning cells.

This PR fixes this by assigning only to the most probably spanning cell

@plutasnyy plutasnyy self-assigned this Apr 30, 2024
@plutasnyy plutasnyy requested a review from badGarnet April 30, 2024 17:07
@plutasnyy plutasnyy marked this pull request as ready for review April 30, 2024 17:07
@plutasnyy plutasnyy changed the title Fix cells with duplicated indxes Fix cells with duplicated indexes Apr 30, 2024
@badGarnet badGarnet merged commit a381155 into main May 1, 2024
5 of 7 checks passed
@badGarnet badGarnet deleted the fix-returning-cells-with-same-coordinates branch May 1, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants