Skip to content

Releases: Unstructured-IO/unstructured

0.7.0

31 May 20:13
d3600dd
Compare
Choose a tag to compare

0.7.0

Enhancements

  • Installing detectron2 from source is no longer required when using the local-inference extra.
  • Updates .pptx parsing to include text in tables.

Features

Fixes

  • Fixes an issue in _add_element_metadata that caused all elements to have page_number=1
    in the element metadata.
  • Adds .log as a file extension for TXT files.
  • Adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.
  • Allow passed encoding to be used in the replace_mime_encodings
  • Fixes page metadata for partition_html when include_metadata=False
  • A ValueError now raises if file_filename is not specified when you use partition_via_api
    with a file-like object.

0.6.11

30 May 13:47
66058e7
Compare
Choose a tag to compare

0.6.11

Enhancements

  • Supports epub tests since pandoc is updated in base image

Features

Fixes

0.6.10

26 May 08:57
c5d9469
Compare
Choose a tag to compare

0.6.10

Enhancements

  • XLS support from auto partition

Features

Fixes

0.6.9

24 May 22:31
c82bad1
Compare
Choose a tag to compare

0.6.9

Enhancements

  • fast strategy for pdf now keeps element bounding box data
  • setup.py refactor

Features

Fixes

  • Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds additional MIME types for CSV

0.6.8

19 May 19:58
21c821d
Compare
Choose a tag to compare

0.6.8

Enhancements

Features

  • Add partition_csv for CSV files.

Fixes

0.6.7

19 May 17:31
046af73
Compare
Choose a tag to compare

0.6.7

Enhancements

  • Deprecate --s3-url in favor of --remote-url in CLI
  • Refactor out non-connector-specific config variables
  • Add file_directory to metadata
  • Add page_name to metadata. Currently used for the sheet name in XLSX documents.
  • Added a --partition-strategy parameter to unstructured-ingest so that users can specify
    partition strategy in CLI. For example, --partition-strategy fast.
  • Added metadata for filetype.
  • Add Discord connector to pull messages from a list of channels
  • Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
  • Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

  • Add partition_xml for XML files.
  • Add partition_xlsx for Microsoft Excel documents.

Fixes

  • Supports hml filetype for partition as a variation of html filetype.
  • Makes pytesseract a function level import in partition_pdf so you can use the "fast"
    or "hi_res" strategies if pytesseract is not installed. Also adds the
    required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
  • Fix to ensure filename is tracked in metadata for docx tables.

0.6.6

12 May 17:47
727d366
Compare
Choose a tag to compare

0.6.6

Enhancements

  • Adds an "auto" strategy that chooses the partitioning strategy based on document
    characteristics and function kwargs. This is the new default strategy for partition_pdf
    and partition_image. Users can maintain existing behavior by explicitly setting
    strategy="hi_res".
  • Added an additional trace logger for NLP debugging.
  • Add get_date method to ElementMetadata for converting the datestring to a datetime object.
  • Cleanup the filename attribute on ElementMetadata to remove the full filepath.

Features

  • Added table reading as html with URL parsing to partition_docx in docx
  • Added metadata field for text_as_html for docx files

Fixes

  • fileutils/file_type check json and eml decode ignore error
  • partition_email was updated to more flexibly handle deviations from the RFC-2822 standard.
    The time in the metadata returns None if the time does not match RFC-2822 at all.
  • Include all metadata fields when converting to dataframe or CSV

0.6.5

10 May 04:40
b52638f
Compare
Choose a tag to compare

0.6.5

Enhancements

  • Added support for SpooledTemporaryFile file argument.

Features

Fixes

0.6.4

08 May 17:57
3d3f3df
Compare
Choose a tag to compare

0.6.4

Enhancements

  • Added an "ocr_only" strategy for partition_pdf. Refactored the strategy decision
    logic into its own module.

Features

Fixes

0.6.3

04 May 20:25
392cccd
Compare
Choose a tag to compare

0.6.3

Enhancements

  • Add an "ocr_only" strategy for partition_image.

Features

  • Added partition_multiple_via_api for partitioning multiple documents in a single REST
    API call.
  • Added stage_for_baseplate function to prepare outputs for ingestion into Baseplate.
  • Added partition_odt for processing Open Office documents.

Fixes

  • Updates the grouping logic in the partition_pdf fast strategy to group together text
    in the same bounding box.