Skip to content

Releases: Unstructured-IO/unstructured

0.14.8

24 Jun 13:54
ab88e20
Compare
Choose a tag to compare

0.14.8

Enhancements

  • Move arm64 image to wolfi-base The arm64 image now runs on wolfi-base. The arm64 build for wolfi-base does not yet include libreoffce, and so arm64 does not currently support processing .doc, .ppt, or .xls file. If you need to process those files on arm64, use the legacy rockylinux image.

Features

Fixes

  • Bump unstructured-inference==0.7.36 Fix ValueError when converting cells to html.

  • partition() now forwards strategy arg to partition_docx(), partition_ppt(), and partition_pptx(). A strategy argument passed to partition() (or the default value "auto" assigned by partition()) is now forwarded to partition_docx(), partition_ppt(), and partition_pptx() when those filetypes are detected.

  • Fix missing sensitive field markers for embedders

0.14.7

20 Jun 18:11
80abbcd
Compare
Choose a tag to compare

0.14.7

Enhancements

  • Pull from wolfi-base image. The amd64 image now pulls from the unstructured wolfi-base image to avoid duplication of dependency setup steps.
  • Fix windows temp file. Make the creation of a temp file in unstructured/partition/pdf_image/ocr.py windows compatible.

Features

  • Expose conversion functions for tables Adds public functions to convert tables from HTML to the Deckerd format and back

Fixes

  • Fix an error publishing docker images. Update user in docker-smoke-test to reflect changes made by the amd64 image pull from the "unstructured" "wolfi-base" image.
  • **Fix a IndexError when partitioning a pdf with values for both extract_image_block_types and starting_page_number.

0.14.6

14 Jun 18:56
9552fbb
Compare
Choose a tag to compare

0.14.6

Enhancements

  • Bump unstructured-inference==0.7.35 Fix syntax for generated HTML tables.

Features

  • tqdm ingest support add optional flag to ingest flow to print out progress bar of each step in the process.

Fixes

  • Remove deprecated overwrite_schema kwarg from Delta Table connector.. The overwrite_schema kwarg is deprecated in deltalake>=0.18.0. schema_mode= should be used now instead. schema_mode="overwrite" is equivalent to overwrite_schema=True and schema_mode="merge" is equivalent to overwrite_schema="False". schema_mode defaults to None. You can also now specify engine, which defaults to "pyarrow". You need to specify enginer="rust" to use "schema_mode".
  • Fix passing parameters to python-client - Remove parsing list arguments to strings in passing arguments to python-client in Ingest workflow and partition_via_api
  • table metric bug fix get_element_level_alignment()now will find all the matched indices in predicted table data instead of only returning the first match in the case of multiple matches for the same gt string.
  • fsspec connector path/permissions bug V2 fsspec connectors were failing when defined relative filepaths had leading slash. This strips that slash to guarantee the relative path never has it.
  • Dropbox connector internal file path bugs Dropbox source connector currently raises exceptions when indexing files due to two issues: a path formatting idiosyncrasy of the Dropbox library and a divergence in the definition of the Dropbox libraries fs.info method, expecting a 'url' parameter rather than 'path'.
  • update table metric evaluation to handle corrected HTML syntax for tables This change is connected to the update in unstructured-inference change - fixes transforming HTML table to deckerd and internal cells format.

0.14.5

10 Jun 13:50
b4876f1
Compare
Choose a tag to compare

0.14.5

Enhancements

  • Filtering for tar extraction Adds tar filtering to the compression module for connectors to avoid decompression malicious content in .tar.gz files. This was added to the Python tarfile lib in Python 3.12. The change only applies when using Python 3.12 and above.
  • Use python-oxmsg for partition_msg(). Outlook MSG emails are now partitioned using the python-oxmsg package which resolves some shortcomings of the prior MSG parser.

Features

Fixes

  • 8-bit string Outlook MSG files are parsed. partition_msg() is now able to parse non-unicode Outlook MSG emails.
  • Attachments to Outlook MSG files are extracted intact. partition_msg() is now able to extract attachments without corruption.

0.14.4

03 Jun 21:16
1dede50
Compare
Choose a tag to compare

Enhancements

  • Move logger error to debug level when PDFminer fails to extract text which includes error message for Invalid dictionary construct.
  • Add support for Pinecone serverless Adds Pinecone serverless to the connector tests. Pinecone
    serverless will work version versions >=0.14.2, but hadn't been tested until now.

Features

  • Allow configuration of the Google Vision API endpoint Add an environment variable to select the Google Vision API in the US or the EU.

Fixes

  • Address the issue of unrecognized tables in UnstructuredTableTransformerModel When a table is not recognized, the element.metadata.text_as_html attribute is set to an empty string.
  • Remove root handlers in ingest logger. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
  • Fix V2 S3 Destination Connector authentication Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
  • Clarified dependence on particular version of python-docx Pinned python-docx version to ensure a particular method unstructured uses is included.
  • Ingest preserves original file extension Ingest V2 introduced a change that dropped the original extension for upgraded connectors. This reverts that change.

0.14.3

29 May 06:10
f445724
Compare
Choose a tag to compare

Enhancements

  • Move category field from Text class to Element class.
  • partition_docx() now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCX Paragraph and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.
  • Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.

Features

Fixes

  • Fix partition_pdf() to keep spaces in the text. The control character \t is now replaced with a space instead of being removed when merging inferred elements with embedded elements.
  • Turn off XML resolve entities Sets resolve_entities=False for XML parsing with lxml
    to avoid text being dynamically injected into the XML document.
  • Add backward compatibility for the deprecated pdf_infer_table_structure parameter.
  • Add the missing form_extraction_skip_tables argument to the partition_pdf_or_image call.
    to avoid text being dynamically injected into the XML document.
  • Chromadb change from Add to Upsert using element_id to make idempotent
  • Diable table_as_cells output by default to reduce overhead in partition; now table_as_cells is only produced when the env EXTACT_TABLE_AS_CELLS is true
  • Reduce excessive logging Change per page ocr info level logging into detail level trace logging
  • Replace try block in document_to_element_list for handling HTMLDocument Use getattr(element, "type", "") to get the type attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block

0.14.2

22 May 23:27
18428f2
Compare
Choose a tag to compare

Enhancements

  • Bump unstructured-inference==0.7.33.

Features

  • Add attribution to the pinecone connector.

0.14.1

21 May 22:52
30e5a0c
Compare
Choose a tag to compare

Enhancements

  • Refactor code related to embedded text extraction. The embedded text extraction code is moved from unstructured-inference to unstructured.

Features

  • Large improvements to the ingest process:
    • Support for multiprocessing and async, with limits for both.
    • Streamlined to process when mapping CLI invocations to the underlying code
    • More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
    • Use the python client when calling the unstructured api for partitioning or chunking
    • Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
    • Leverage last modified date when deciding if new files should be downloaded and reprocessed.
    • Add attribution to the pinecone connector
  • Add support for Python 3.12. unstructured now works with Python 3.12!

0.14.0

17 May 22:15
76831f1
Compare
Choose a tag to compare

0.14.0

BREAKING CHANGES

  • Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.

Enhancements

  • Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
  • Faster evaluation Support for concurrent processing of documents during evaluation
  • Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
  • Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.

Features

  • Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

  • Add missing starting_page_num param to partition_image
  • Make the filename and file params for partition_image and partition_pdf match the other partitioners
  • Fix include_slide_notes and include_page_breaks params in partition_ppt
  • Re-apply: skip accuracy calculation feature Overwritten by mistake
  • Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.
  • Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.
  • Improve CSV delimeter detection. partition_csv() would raise on CSV files with very long lines.
  • Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().
  • Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.
  • Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().
  • AstraDB: option to prevent indexing metadata

0.13.7

08 May 17:28
b64a484
Compare
Choose a tag to compare

Enhancements

  • Remove page_number metadata fields for HTML partition until we have a better strategy to decide page counting.
  • Extract OCRAgent.get_agent(). Generalize access to the configured OCRAgent instance beyond its use for PDFs.
  • Add calculation of table related metrics which take into account colspans and rowspans

Features

  • add ability to get ratio of cid characters in embedded text extracted by pdfminer.

Fixes

  • partition_docx() handles short table rows. The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate .text and .metadata.text_as_html for these tables.
  • Remedy macOS test failure not triggered by CI. Generalize temp-file detection beyond hard-coded Linux-specific prefix.
  • Remove unnecessary warning log for using default layout model.
  • Add chunking to partition_tsv Even though partition_tsv() produces a single Table element, chunking is made available because the Table element is often larger than the desired chunk size and must be divided into smaller chunks.