Skip to content

Commit

Permalink
Merge branch 'main' into fix/invalid-evaluation-doctype-deduction
Browse files Browse the repository at this point in the history
  • Loading branch information
micmarty-deepsense authored Jun 3, 2024
2 parents d5587d6 + 1b43102 commit 2405a2a
Show file tree
Hide file tree
Showing 167 changed files with 869 additions and 176 deletions.
5 changes: 4 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -391,6 +391,8 @@ jobs:
env:
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
S3_INGEST_TEST_ACCESS_KEY: ${{ secrets.S3_INGEST_TEST_ACCESS_KEY }}
S3_INGEST_TEST_SECRET_KEY: ${{ secrets.S3_INGEST_TEST_SECRET_KEY }}
AZURE_SEARCH_ENDPOINT: ${{ secrets.AZURE_SEARCH_ENDPOINT }}
AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
BOX_APP_CONFIG: ${{ secrets.BOX_APP_CONFIG }}
Expand Down Expand Up @@ -504,4 +506,5 @@ jobs:
uses: anchore/scan-action@v3
with:
image: "unstructured:dev"
severity-cutoff: medium
# NOTE(robinson) - revert this to medium when we bump libreoffice
severity-cutoff: high
34 changes: 0 additions & 34 deletions .github/workflows/create_issue.yml

This file was deleted.

14 changes: 12 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
## 0.14.4-dev0
## 0.14.4-dev7

### Enhancements

* **Move logger error to debug level when PDFminer fails to extract text** which includes error message for Invalid dictionary construct.
* **Add support for Pinecone serverless** Adds Pinecone serverless to the connector tests. Pinecone
serverless will work version versions >=0.14.2, but hadn't been tested until now.

### Features

- **Allow configuration of the Google Vision API endpoint** Add an environment variable to select the Google Vision API in the US or the EU.

### Fixes

* **Fix document type deduction during evaluation** There was a bug that caused the document type for file with more than 2 extensions (or dot symbols) to be inferred incorrectly.
* **Remove root handlers in ingest logger**. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
* **Fix V2 S3 Destination Connector authentication** Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
* **Clarified dependence on particular version of `python-docx`** Pinned `python-docx` version to ensure a particular method `unstructured` uses is included.
* **Ingest preserves original file extension** Ingest V2 introduced a change that dropped the original extension for upgraded connectors. This reverts that change.
* * **Fix document type deduction during evaluation** There was a bug that caused the document type for file with more than 2 extensions (or dot symbols) to be inferred incorrectly.

## 0.14.3

Expand Down
50 changes: 15 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,21 +37,7 @@
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h2>

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://unstructured-io.github.io/unstructured/core.html#partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

<h3 align="center">
<p>API Announcement!</p>
</h3>

We are thrilled to announce our newly launched [Unstructured API](https://unstructured-io.github.io/unstructured/api.html), providing the Unstructured capabilities from `unstructured` as an API. Check out the [`unstructured-api` GitHub repository](https://github.com/Unstructured-IO/unstructured-api) to start making API calls. You’ll also find instructions about how to host your own API version.

While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours [here](https://unstructured.io/api-key) and start using it today! Check out the [`unstructured-api` README](https://github.com/Unstructured-IO/unstructured-api#--) to start making API calls.</p>

#### :rocket: Beta Feature: Chipper Model

We are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the `hi_res_model_name=chipper` parameter. Please refer to the documentation [here](https://unstructured-io.github.io/unstructured/api.html#beta-version-hi-res-strategy-with-chipper-model).

As the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on [Slack community](https://short.unstructured.io/pzw05l7).
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://docs.unstructured.io/open-source/core-functionality/partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

## :eight_pointed_black_star: Quick Start

Expand Down Expand Up @@ -182,29 +168,23 @@ This starts a docker container with your local repo mounted to `/mnt/local_unstr
## :clap: Quick Tour

### Documentation
This README overviews how to install, use and develop the library. For more comprehensive documentation, visit https://unstructured-io.github.io/unstructured/ .
For more comprehensive documentation, visit https://docs.unstructured.io . You can also learn
more about our other products on the documentation page, including our SaaS API.

### Concepts Guide
Here are a few pages from the [Open Source documentation page](https://docs.unstructured.io/open-source/introduction/overview)
that are helpful for new users to review:

The `unstructured` library includes core functionality for partitioning, chunking, cleaning, and
staging raw documents for NLP tasks.
You can see a complete list of available functions and how to use them from the [Core Functionality documentation](https://unstructured-io.github.io/unstructured/core.html).
- [Quick Start](https://docs.unstructured.io/open-source/introduction/quick-start)
- [Using the `unstructured` open source package](https://docs.unstructured.io/open-source/core-functionality/overview)
- [Connectors](https://docs.unstructured.io/open-source/ingest/overview)
- [Concepts](https://docs.unstructured.io/open-source/concepts/document-elements)
- [Integrations](https://docs.unstructured.io/open-source/integrations)

In general, these functions fall into several categories:
- *Partitioning* functions break raw documents into standard, structured elements.
- *Cleaning* functions remove unwanted text from documents, such as boilerplate and sentence fragments.
- *Staging* functions format data for downstream tasks, such as ML inference and data labeling.
- *Chunking* functions split documents into smaller sections for use in RAG apps and similarity
search.
- *Embedding* encoder classes provide an interfaces for easily converting preprocessed text to
vectors.

The **Connectors** 🔗 in `unstructured` serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in `unstructured` library, please check out the [Connectors GitHub folder](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest/connector) and [documentation](https://unstructured-io.github.io/unstructured/ingest/index.html)

### PDF Document Parsing Example
The following examples show how to get started with the `unstructured` library. You can parse over a dozen document types with one line of code! Use this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the example below.

The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional parameters via `pip install unstructured[local-inference]`. Ensure you first install `libmagic` using the instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection) `partition` will always apply the default arguments. If you need advanced features, use a document-specific partitioning function.
The following examples show how to get started with the `unstructured` library. The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional dependencies per doc type.
For example, to install docx dependencies you need to run `pip install "unstructured[docx]"`.
See our [installation guide](https://docs.unstructured.io/open-source/installation/full-installation) for more details.

```python
from unstructured.partition.auto import partition
Expand Down Expand Up @@ -245,7 +225,7 @@ Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
including document image classification [11,
```

See the [partitioning](https://unstructured-io.github.io/unstructured/core.html#partitioning)
See the [partitioning](https://docs.unstructured.io/open-source/core-functionality/partitioning)
section in our documentation for a full list of options and instructions on how to use
file-specific partitioning functions.

Expand All @@ -263,7 +243,7 @@ Encountered a bug? Please create a new [GitHub issue](https://github.com/Unstruc
| Section | Description |
|-|-|
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
| [Documentation](https://unstructured-io.github.io/unstructured) | Full API documentation |
| [Documentation](https://docs.unstructured.io/) | Full API documentation |
| [Batch Processing](unstructured/ingest/README.md) | Ingesting batches of documents through Unstructured |

## :chart_with_upwards_trend: Analytics
Expand Down
15 changes: 5 additions & 10 deletions docs/source/404.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,6 @@
.. _404:

404 Error
=========

.. raw:: html

<script type="text/javascript">
window.location.href = "index.html";
</script>
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/api_parameters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/api_sdks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/aws_marketplace.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/azure_marketplace.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/saas_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/usage_methods.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/validation_errors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices/models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices/strategies.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices/table_extraction_pdf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/chunking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/cleaning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/embedding.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/extracting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/partition.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/staging.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples/chroma.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples/databricks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples/dict_to_elements.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/chunking_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/embedding_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/fsspec_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/partition_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/permissions_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/processor_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/read_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
Loading

0 comments on commit 2405a2a

Please sign in to comment.