Skip to content

Commit

Permalink
Merge branch 'main' into fix/invalid-evaluation-doctype-deduction
Browse files Browse the repository at this point in the history
  • Loading branch information
micmarty-deepsense authored May 29, 2024
2 parents 614e4f5 + f445724 commit d5587d6
Show file tree
Hide file tree
Showing 8 changed files with 31 additions and 15 deletions.
14 changes: 12 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
## 0.14.3-dev6
## 0.14.4-dev0

### Enhancements

### Features

### Fixes

* **Fix document type deduction during evaluation** There was a bug that caused the document type for file with more than 2 extensions (or dot symbols) to be inferred incorrectly.

## 0.14.3

### Enhancements

Expand All @@ -10,6 +20,7 @@

### Fixes

* **Fix `partition_pdf()` to keep spaces in the text**. The control character `\t` is now replaced with a space instead of being removed when merging inferred elements with embedded elements.
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* **Add backward compatibility for the deprecated pdf_infer_table_structure parameter**.
Expand All @@ -19,7 +30,6 @@
* **Diable `table_as_cells` output by default** to reduce overhead in partition; now `table_as_cells` is only produced when the env `EXTACT_TABLE_AS_CELLS` is `true`
* **Reduce excessive logging** Change per page ocr info level logging into detail level trace logging
* **Replace try block in `document_to_element_list` for handling HTMLDocument** Use `getattr(element, "type", "")` to get the `type` attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block
* **Fix document type deduction during evaluation** There was a bug that caused the document type for file with more than 2 extensions (or dot symbols) to be inferred incorrectly.

## 0.14.2

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ def test_annotate_layout_elements_file_not_found_error():

@pytest.mark.parametrize(
("text", "expected"),
[("c\to\x0cn\ftrol\ncharacter\rs\b", "control characters"), ("\"'\\", "\"'\\")],
[("test\tco\x0cn\ftrol\ncharacter\rs\b", "test control characters"), ("\"'\\", "\"'\\")],
)
def test_remove_control_characters(text, expected):
assert pdf_image_utils.remove_control_characters(text) == expected
7 changes: 7 additions & 0 deletions test_unstructured_ingest/dest/mongodb.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ function cleanup() {

trap cleanup EXIT

# NOTE(robinson) - per pymongo docs, pymongo ships with its own version of the bson library,
# which is incompatible with the bson installed from pypi. bson is installed as part of the
# astra dependencies.
# ref: https://pymongo.readthedocs.io/en/stable/installation.html
pip uninstall -y bson pymongo
make install-ingest-mongodb

python "$SCRIPT_DIR"/python/test-ingest-mongodb.py \
--uri "$MONGODB_URI" \
--database "$MONGODB_DATABASE_NAME" \
Expand Down
7 changes: 6 additions & 1 deletion test_unstructured_ingest/src/mongodb.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,12 @@ if [ -z "$MONGODB_URI" ] && [ -z "$MONGODB_DATABASE_NAME" ]; then
exit 8
fi

# trap cleanup EXIT
# NOTE(robinson) - per pymongo docs, pymongo ships with its own version of the bson library,
# which is incompatible with the bson installed from pypi. bson is installed as part of the
# astra dependencies.
# ref: https://pymongo.readthedocs.io/en/stable/installation.html
pip uninstall -y bson pymongo
make install-ingest-mongodb

PYTHONPATH=. ./unstructured/ingest/main.py \
mongodb \
Expand Down
5 changes: 1 addition & 4 deletions test_unstructured_ingest/test-ingest-dest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,7 @@ all_tests=(
'sqlite.sh'
'vectara.sh'
'weaviate.sh'
# NOTE(robinson) - mongo conflicts with astra because it ships with its
# own version of bson, and installing bson from pip causes mongo to fail
# ref: https://pymongo.readthedocs.io/en/stable/installation.html
# 'mongodb.sh'
'mongodb.sh'
)

full_python_matrix_tests=(
Expand Down
7 changes: 2 additions & 5 deletions test_unstructured_ingest/test-ingest-src.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ all_tests=(
'confluence-diff.sh'
'confluence-large.sh'
'airtable-diff.sh'
# NOTE(ryan): This test is disabled because it is triggering too many requests to the API
# # NOTE(ryan): This test is disabled because it is triggering too many requests to the API
# 'airtable-large.sh'
'local-single-file.sh'
'local-single-file-basic-chunking.sh'
Expand All @@ -62,10 +62,7 @@ all_tests=(
'local-embed-voyageai.sh'
'sftp.sh'
'opensearch.sh'
# NOTE(robinson) - mongo conflicts with astra because it ships with its
# own version of bson, and installing bson from pip causes mongo to fail
# ref: https://pymongo.readthedocs.io/en/stable/installation.html
# 'mongodb.sh'
'mongodb.sh'
)

full_python_matrix_tests=(
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.3-dev6" # pragma: no cover
__version__ = "0.14.4-dev0" # pragma: no cover
2 changes: 1 addition & 1 deletion unstructured/partition/pdf_image/pdf_image_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,7 @@ def remove_control_characters(text: str) -> str:
"""Removes control characters from text."""

# Replace newline character with a space
text = text.replace("\n", " ")
text = text.replace("\t", " ").replace("\n", " ")
# Remove other control characters
out_text = "".join(c for c in text if unicodedata.category(c)[0] != "C")
return out_text

0 comments on commit d5587d6

Please sign in to comment.