Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.7.0
0.7.0
Enhancements
- Installing
detectron2
from source is no longer required when using thelocal-inference
extra. - Updates
.pptx
parsing to include text in tables.
Features
Fixes
- Fixes an issue in
_add_element_metadata
that caused all elements to havepage_number=1
in the element metadata. - Adds
.log
as a file extension for TXT files. - Adds functionality to try other common encodings for email (
.eml
) files if an error related to the encoding is raised and the user has not specified an encoding. - Allow passed encoding to be used in the
replace_mime_encodings
- Fixes page metadata for
partition_html
wheninclude_metadata=False
- A
ValueError
now raises iffile_filename
is not specified when you usepartition_via_api
with a file-like object.
0.6.11
0.6.11
Enhancements
- Supports epub tests since pandoc is updated in base image
Features
Fixes
0.6.10
0.6.9
0.6.9
Enhancements
- fast strategy for pdf now keeps element bounding box data
- setup.py refactor
Features
Fixes
- Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
- Adds additional MIME types for CSV
0.6.8
0.6.8
Enhancements
Features
- Add
partition_csv
for CSV files.
Fixes
0.6.7
0.6.7
Enhancements
- Deprecate
--s3-url
in favor of--remote-url
in CLI - Refactor out non-connector-specific config variables
- Add
file_directory
to metadata - Add
page_name
to metadata. Currently used for the sheet name in XLSX documents. - Added a
--partition-strategy
parameter to unstructured-ingest so that users can specify
partition strategy in CLI. For example,--partition-strategy fast
. - Added metadata for filetype.
- Add Discord connector to pull messages from a list of channels
- Refactor
unstructured/file-utils/filetype.py
to better utilise hashmap to return mime type. - Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for
test_filetype.py
.
Features
- Add
partition_xml
for XML files. - Add
partition_xlsx
for Microsoft Excel documents.
Fixes
- Supports
hml
filetype for partition as a variation of html filetype. - Makes
pytesseract
a function level import inpartition_pdf
so you can use the"fast"
or"hi_res"
strategies ifpytesseract
is not installed. Also adds the
required_dependencies
decorator for the"hi_res"
and"ocr_only"
strategies. - Fix to ensure
filename
is tracked in metadata fordocx
tables.
0.6.6
0.6.6
Enhancements
- Adds an
"auto"
strategy that chooses the partitioning strategy based on document
characteristics and function kwargs. This is the new default strategy forpartition_pdf
andpartition_image
. Users can maintain existing behavior by explicitly setting
strategy="hi_res"
. - Added an additional trace logger for NLP debugging.
- Add
get_date
method toElementMetadata
for converting the datestring to adatetime
object. - Cleanup the
filename
attribute onElementMetadata
to remove the full filepath.
Features
- Added table reading as html with URL parsing to
partition_docx
in docx - Added metadata field for text_as_html for docx files
Fixes
fileutils/file_type
check json and eml decode ignore errorpartition_email
was updated to more flexibly handle deviations from the RFC-2822 standard.
The time in the metadata returnsNone
if the time does not match RFC-2822 at all.- Include all metadata fields when converting to dataframe or CSV
0.6.5
0.6.5
Enhancements
- Added support for SpooledTemporaryFile file argument.
Features
Fixes
0.6.4
0.6.4
Enhancements
- Added an "ocr_only" strategy for
partition_pdf
. Refactored the strategy decision
logic into its own module.
Features
Fixes
0.6.3
0.6.3
Enhancements
- Add an "ocr_only" strategy for
partition_image
.
Features
- Added
partition_multiple_via_api
for partitioning multiple documents in a single REST
API call. - Added
stage_for_baseplate
function to prepare outputs for ingestion into Baseplate. - Added
partition_odt
for processing Open Office documents.
Fixes
- Updates the grouping logic in the
partition_pdf
fast strategy to group together text
in the same bounding box.