Skip to content

Commit

Permalink
fix: disable table_as_cells output by default (#3093)
Browse files Browse the repository at this point in the history
This PR changes the output of table elements: now by default the table
elements' `metadata.table_as_cells` is `None`. The data will only be
populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`.

The original design of the `table_as_cells` is for evaluate table
extraction performance. The format itself is not as readable as the
`table_as_html` metadata for human or RAG consumption. Therefore by
default this data is not needed.

Since this output is meant for evaluation use this PR choose to use an
environment variable to control if it should be present in the
partitioned results. This approach avoids adding parameters to the
`partition` function call. Adding a new parameter to the `partition`
interface increases the complexity of the interface and adds more
maintenance cost since there is a long chain of function calls to pass
down this parameter to where it is needed.

## test

running the following code snippet on main vs. this PR

```python
from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]
```

on main branch `table_cells` contains cell structured data but on this
branch it is a list of `None`

However if we first set in terminal:

```bash
export EXTRACT_TABLE_AS_CELLS=true
```

then run the same code again with this PR the `table_cells` would
contain actual data, the same as on main branch.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: badGarnet <[email protected]>
  • Loading branch information
3 people authored May 24, 2024
1 parent 809c7e5 commit 32df4ee
Show file tree
Hide file tree
Showing 5 changed files with 16 additions and 559 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@

### Fixes

* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* **Add backward compatibility for the deprecated pdf_infer_table_structure parameter**.
* **Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call**.
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* **Chromadb change from Add to Upsert using element_id to make idempotent**
* **Diable `table_as_cells` output by default** to reduce overhead in partition; now `table_as_cells` is only produced when the env `EXTACT_TABLE_AS_CELLS` is `true`
* **Reduce excessive logging** Change per page ocr info level logging into detail level trace logging
* **Replace try block in `document_to_element_list` for handling HTMLDocument** Use `getattr(element, "type", "")` to get the `type` attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,176 +49,6 @@
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century \u2018TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents",
"metadata": {
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></thead><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></table>",
"table_as_cells": [
{
"x": 0,
"y": 0,
"w": 1,
"h": 1,
"content": "Dataset"
},
{
"x": 0,
"y": 1,
"w": 1,
"h": 1,
"content": "PubLayNet [33]"
},
{
"x": 0,
"y": 2,
"w": 1,
"h": 1,
"content": "PRImA [3]"
},
{
"x": 0,
"y": 3,
"w": 1,
"h": 1,
"content": "Newspaper [17]"
},
{
"x": 0,
"y": 4,
"w": 1,
"h": 1,
"content": "TableBank [18]"
},
{
"x": 0,
"y": 5,
"w": 1,
"h": 1,
"content": "HIDataset [31]"
},
{
"x": 1,
"y": 0,
"w": 1,
"h": 1,
"content": "| Base Model!|"
},
{
"x": 1,
"y": 1,
"w": 1,
"h": 1,
"content": "P/M"
},
{
"x": 1,
"y": 2,
"w": 1,
"h": 1,
"content": "M"
},
{
"x": 1,
"y": 3,
"w": 1,
"h": 1,
"content": "P"
},
{
"x": 1,
"y": 4,
"w": 1,
"h": 1,
"content": "P"
},
{
"x": 1,
"y": 5,
"w": 1,
"h": 1,
"content": "P/M"
},
{
"x": 2,
"y": 0,
"w": 1,
"h": 1,
"content": "Large Model"
},
{
"x": 2,
"y": 1,
"w": 1,
"h": 1,
"content": "M"
},
{
"x": 2,
"y": 2,
"w": 1,
"h": 1,
"content": ""
},
{
"x": 2,
"y": 3,
"w": 1,
"h": 1,
"content": ""
},
{
"x": 2,
"y": 4,
"w": 1,
"h": 1,
"content": ""
},
{
"x": 2,
"y": 5,
"w": 1,
"h": 1,
"content": ""
},
{
"x": 3,
"y": 0,
"w": 1,
"h": 1,
"content": "| Notes"
},
{
"x": 3,
"y": 1,
"w": 1,
"h": 1,
"content": "Layouts of modern scientific documents"
},
{
"x": 3,
"y": 2,
"w": 1,
"h": 1,
"content": "Layouts of scanned modern magazines and scientific reports"
},
{
"x": 3,
"y": 3,
"w": 1,
"h": 1,
"content": "Layouts of scanned US newspapers from the 20th century"
},
{
"x": 3,
"y": 4,
"w": 1,
"h": 1,
"content": "Table region on modern scientific and business document"
},
{
"x": 3,
"y": 5,
"w": 1,
"h": 1,
"content": "Layouts of history Japanese documents"
}
],
"filetype": "image/jpeg",
"languages": [
"eng"
Expand Down
Loading

0 comments on commit 32df4ee

Please sign in to comment.