Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.16.3
0.16.2
0.16.2
Enhancements
Features
- Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.
Fixes
- Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.
0.16.1
0.16.1
Enhancements
- Bump
unstructured-inference
to 0.7.39 and upgrade other dependencies - Round coordinates Round coordinates when computing bounding box overlaps in
pdfminer_processing.py
to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine. - Request retry parameters in
partition_via_api
function. Expose retry-mechanism related parameters in thepartition_via_api
function to allow users to configure the retry behavior of the API requests.
Features
- Parsing HTML to Unstructured Elements and back
Fixes
- Remove unsupported chipper model
- Rewrite of
partition.email
module and tests. Use modern Python stdlibemail
module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process. - Minify text_as_html from DOCX. Previously
.metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced bytabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text. - Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by
filetype
was incorrectly identified as a MSG file. - Minify text_as_html from XLSX. Previously
.metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced bypandas
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text. - Minify text_as_html from CSV. Previously
.metadata.text_as_html
for CSV tables was "bloated" with whitespace and noise elements introduced bypandas
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text. - Minify text_as_html from PPTX. Previously
.metadata.text_as_html
for PPTX tables was "bloated" with whitespace and noise elements introduced bytabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.
0.16.0
0.16.0
Enhancements
- Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
- Replace extras in
requirements/ingest
directory with a newingest.txt
extra for installing theunstructured-ingest
library. - Remove the
unstructured.ingest
submodule. - Delete all shell scripts previously used for destination ingest tests.
- Replace extras in
Features
Fixes
- Add language parameter to
OCRAgentGoogleVision
. Introduces an optional language parameter in theOCRAgentGoogleVision
constructor to serve as a language hint fordocument_text_detection
. This ensures compatibility with the OCRAgent'sget_instance
method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.
0.15.14
0.15.14
Enhancements
Features
- Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like
.filename
,.filetype
and.languages
. This will be installed in a closely following PR to replace the four currently being used for this purpose.
Fixes
- Update Python SDK usage in
partition_via_api
. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK. - Remove "unused"
date_from_file_object
parameter. As part of simplifying partitioning parameter set, removedate_from_file_object
parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from themetadata_last_modified
argument. - Fix occasional
KeyError
when mapping parent ids to hash ids. Occasionally the input elements intoassign_and_map_hash_ids
can contain duplicated element instances, which lead to error when mapping parent id. - Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
- Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new
@apply_metadata()
decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners. - Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new
@apply_metadata()
decorator and only decorate the principal partitioner; remove decoration from delegating partitioners. - Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new
@apply_metadata()
decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners. - Remove obsolete min_partition/max_partition args from TXT and EML. The legacy
min_partition
andmax_partition
parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters frompartition_text()
andpartition_email()
. - Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new
@apply_metadata()
decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG. - Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
- Quick-fix CI error in auto test-filetype. Better fix to follow shortly.
0.15.13
0.15.13
Enhancements
- Improve
pdfminer
image cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents.
Features
Fixes
- Fixes high memory overhead for intersection area computation Using
numpy.float32
for coordinates and remove intermediate variables to reduce memory usage when computing intersection areas - Fixes the
arm64
image buildarm64
builds are now fixed and will be available against starting with the0.15.13
release.
0.15.12
0.15.12
Enhancements
- Improve
pdfminer
element processing Implemented splitting ofpdfminer
elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicatedpdfminer
text.
0.15.10
0.15.10
Enhancements
- Enhance
pdfminer
element cleanup Expand removal ofpdfminer
elements to include those inside allnon-pdfminer
elements, not justtables
. - Modified analysis drawing tools to dump to files and draw from dumps If the parameter
analysis
of thepartition_pdf
function is set toTrue
, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances. - Vectorize pdfminer elements deduplication computation. Use
numpy
operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.
Features
Fixes
0.15.9
0.15.9
Enhancements
Features
- Add support for encoding parameter in partition_csv
0.15.8
0.15.8
Enhancements
- Bump unstructured.paddleocr to 2.8.1.0.
Features
- Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.
Fixes
- Replace
pillow-heif
withpi-heif
. Replacespillow-heif
withpi-heif
due to more permissive licensing on the wheel forpi-heif
. - Minify text_as_html from DOCX. Previously
.metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced bytabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text. - Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by
filetype
was incorrectly identified as a MSG file.