Skip to content

Commit

Permalink
fix: disable table_as_cells output by default
Browse files Browse the repository at this point in the history
- now requires env EXTRACT_TABLE_AS_CELLS to be true to output
  table_as_cells in Table elements' metadata
  • Loading branch information
badGarnet committed May 23, 2024
1 parent 31a53c8 commit dcd7103
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 10 deletions.
7 changes: 4 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.14.3-dev2
## 0.14.3-dev3

### Enhancements

Expand All @@ -9,11 +9,12 @@

### Fixes

**Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call.

* **Chromadb change from Add to Upsert using element_id to make idempotent**
* **Diable `table_as_cells` output by default** to reduce overhead in partition; now `table_as_cells` is only produced when the env `EXTACT_TABLE_AS_CELLS` is `true`


## 0.14.2

Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.3-dev2" # pragma: no cover
__version__ = "0.14.3-dev3" # pragma: no cover
14 changes: 8 additions & 6 deletions unstructured/partition/pdf_image/ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ def supplement_element_with_table_extraction(
"""Supplement the existing layout with table extraction. Any Table elements
that are extracted will have a metadata fields "text_as_html" where
the table's text content is rendered into a html string and "table_as_cells"
with the raw table cells output from table agent
with the raw table cells output from table agent if env_config.EXTRACT_TABLE_AS_CELLS is True
"""
from unstructured_inference.models.tables import cells_to_html

Expand All @@ -279,13 +279,15 @@ def supplement_element_with_table_extraction(
tatr_cells = tables_agent.predict(
cropped_image, ocr_tokens=table_tokens, result_format="cells"
)
text_as_html = cells_to_html(tatr_cells)
simple_table_cells = [
SimpleTableCell.from_table_transformer_cell(cell).to_dict() for cell in tatr_cells
]

text_as_html = cells_to_html(tatr_cells)
element.text_as_html = text_as_html
element.table_as_cells = simple_table_cells

if env_config.EXTRACT_TABLE_AS_CELLS:
simple_table_cells = [
SimpleTableCell.from_table_transformer_cell(cell).to_dict() for cell in tatr_cells
]
element.table_as_cells = simple_table_cells

return elements

Expand Down
5 changes: 5 additions & 0 deletions unstructured/partition/utils/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,11 @@ def EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD(self) -> int:
"""
return self._get_int("EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD", 0)

@property
def EXTRACT_TABLE_AS_CELLS(self) -> bool:
"""adds `table_as_cells` to a Table element's metadata when it is True"""
return self._get_bool("EXTRACT_TABLE_AS_CELLS", False)

@property
def OCR_LAYOUT_SUBREGION_THRESHOLD(self) -> float:
"""threshold to determine if an OCR region is a sub-region of a given block
Expand Down

0 comments on commit dcd7103

Please sign in to comment.