From fec811d493dad5c547b95f1066a89a77d073c2a9 Mon Sep 17 00:00:00 2001 From: cragwolfe Date: Tue, 26 Nov 2024 10:34:08 -0800 Subject: [PATCH 1/4] chore: small tweak to tables utility script to add border for another HTML table pattern --- scripts/user/u-tables-inspect.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/scripts/user/u-tables-inspect.sh b/scripts/user/u-tables-inspect.sh index 7b2263a45d..990dbc123d 100755 --- a/scripts/user/u-tables-inspect.sh +++ b/scripts/user/u-tables-inspect.sh @@ -45,6 +45,7 @@ jq -c '.[] | select(.type == "Table") | .metadata.text_as_html' "$JSON_FILE" | w HTML_CONTENT=${HTML_CONTENT%\"} # add a border and padding to clearly see cell definition # shellcheck disable=SC2001 + HTML_CONTENT=$(echo "$HTML_CONTENT" | sed 's//
/') # add newlines for readability in the html # shellcheck disable=SC2001 From 1f8a102dbe2314362a4cdc89e46d36221b2936da Mon Sep 17 00:00:00 2001 From: cragwolfe Date: Tue, 26 Nov 2024 10:45:04 -0800 Subject: [PATCH 2/4] changelog tweaks --- CHANGELOG.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8bcfb375d2..d2fa4c5729 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,7 +1,7 @@ ## 0.16.7 ### Enhancements -- **Add image_alt_mode to partition_html** Adds an `image_alt_mode` parameter to `partition_html()` to control how alt text is extracted from images in HTML documents. The parameter can be set to `to_text` to extract alt text as text from html tags +- **Add image_alt_mode to partition_html** Adds an `image_alt_mode` parameter to `partition_html()` to control how alt text is extracted from images in HTML documents for `html_parser_version=v2` . The parameter can be set to `to_text` to extract alt text as text from `` html tags ### Features @@ -10,8 +10,8 @@ ## 0.16.6 ### Enhancements -- **Every
tag is considered to be ontology.Table** Added special handling for tables in HTML partitioning. This change is made to improve the accuracy of table extraction from HTML documents. -- **Every HTML has default ontology class assigned** When parsing HTML to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class +- **Every `
` tag is considered to be ontology.Table** Added special handling for tables in HTML partitioning (`html_parser_version=v2`. This change is made to improve the accuracy of table extraction from HTML documents. +- **Every HTML has default ontology class assigned** When parsing HTML with `html_parser_version=v2` to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class - **Use (number of actual table) weighted average for table metrics** In evaluating table metrics the mean aggregation now uses the actual number of tables in a document to weight the metric scores ### Features From 9068451e1b6e7bcb9ddd9cf8dfb7cdf0f84459f4 Mon Sep 17 00:00:00 2001 From: cragwolfe Date: Tue, 26 Nov 2024 10:46:42 -0800 Subject: [PATCH 3/4] dev version --- CHANGELOG.md | 8 ++++++++ unstructured/__version__.py | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index d2fa4c5729..1486db6d4d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,11 @@ +## 0.16.8-dev0 + +### Enhancements + +### Features + +### Fixes + ## 0.16.7 ### Enhancements diff --git a/unstructured/__version__.py b/unstructured/__version__.py index 8685b152b7..b517a990ae 100644 --- a/unstructured/__version__.py +++ b/unstructured/__version__.py @@ -1 +1 @@ -__version__ = "0.16.7" # pragma: no cover +__version__ = "0.16.8-dev0" # pragma: no cover From c55e7af8e6f5ed1cf70f6465d037eff3f41f828c Mon Sep 17 00:00:00 2001 From: cragwolfe Date: Tue, 26 Nov 2024 10:50:22 -0800 Subject: [PATCH 4/4] shell check --- scripts/user/u-tables-inspect.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/scripts/user/u-tables-inspect.sh b/scripts/user/u-tables-inspect.sh index 990dbc123d..d9f22de941 100755 --- a/scripts/user/u-tables-inspect.sh +++ b/scripts/user/u-tables-inspect.sh @@ -46,6 +46,7 @@ jq -c '.[] | select(.type == "Table") | .metadata.text_as_html' "$JSON_FILE" | w # add a border and padding to clearly see cell definition # shellcheck disable=SC2001 HTML_CONTENT=$(echo "$HTML_CONTENT" | sed 's/
/
/') # add newlines for readability in the html # shellcheck disable=SC2001