Skip to content

Commit

Permalink
Merge branch 'main' into feat/configure-googlevision-api-endpoint
Browse files Browse the repository at this point in the history
  • Loading branch information
MthwRobinson authored May 30, 2024
2 parents d61522f + 23e570f commit 0bc68f2
Show file tree
Hide file tree
Showing 125 changed files with 727 additions and 44 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -504,4 +504,5 @@ jobs:
uses: anchore/scan-action@v3
with:
image: "unstructured:dev"
severity-cutoff: medium
# NOTE(robinson) - revert this to medium when we bump libreoffice
severity-cutoff: high
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
## 0.14.4-dev1
## 0.14.4-dev2

### Enhancements

### Features

### Fixes

* **Clarified dependence on particular version of `python-docx`** Pinned `python-docx` version to ensure a particular method `unstructured` uses is included.
* **Ingest preserves original file extension** Ingest V2 introduced a change that dropped the original extension for upgraded connectors. This reverts that change.

## 0.14.3
Expand Down
50 changes: 15 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,21 +37,7 @@
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h2>

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://unstructured-io.github.io/unstructured/core.html#partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

<h3 align="center">
<p>API Announcement!</p>
</h3>

We are thrilled to announce our newly launched [Unstructured API](https://unstructured-io.github.io/unstructured/api.html), providing the Unstructured capabilities from `unstructured` as an API. Check out the [`unstructured-api` GitHub repository](https://github.com/Unstructured-IO/unstructured-api) to start making API calls. You’ll also find instructions about how to host your own API version.

While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours [here](https://unstructured.io/api-key) and start using it today! Check out the [`unstructured-api` README](https://github.com/Unstructured-IO/unstructured-api#--) to start making API calls.</p>

#### :rocket: Beta Feature: Chipper Model

We are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the `hi_res_model_name=chipper` parameter. Please refer to the documentation [here](https://unstructured-io.github.io/unstructured/api.html#beta-version-hi-res-strategy-with-chipper-model).

As the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on [Slack community](https://short.unstructured.io/pzw05l7).
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://docs.unstructured.io/open-source/core-functionality/partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

## :eight_pointed_black_star: Quick Start

Expand Down Expand Up @@ -182,29 +168,23 @@ This starts a docker container with your local repo mounted to `/mnt/local_unstr
## :clap: Quick Tour

### Documentation
This README overviews how to install, use and develop the library. For more comprehensive documentation, visit https://unstructured-io.github.io/unstructured/ .
For more comprehensive documentation, visit https://docs.unstructured.io . You can also learn
more about our other products on the documentation page, including our SaaS API.

### Concepts Guide
Here are a few pages from the [Open Source documentation page](https://docs.unstructured.io/open-source/introduction/overview)
that are helpful for new users to review:

The `unstructured` library includes core functionality for partitioning, chunking, cleaning, and
staging raw documents for NLP tasks.
You can see a complete list of available functions and how to use them from the [Core Functionality documentation](https://unstructured-io.github.io/unstructured/core.html).
- [Quick Start](https://docs.unstructured.io/open-source/introduction/quick-start)
- [Using the `unstructured` open source package](https://docs.unstructured.io/open-source/core-functionality/overview)
- [Connectors](https://docs.unstructured.io/open-source/ingest/overview)
- [Concepts](https://docs.unstructured.io/open-source/concepts/document-elements)
- [Integrations](https://docs.unstructured.io/open-source/integrations)

In general, these functions fall into several categories:
- *Partitioning* functions break raw documents into standard, structured elements.
- *Cleaning* functions remove unwanted text from documents, such as boilerplate and sentence fragments.
- *Staging* functions format data for downstream tasks, such as ML inference and data labeling.
- *Chunking* functions split documents into smaller sections for use in RAG apps and similarity
search.
- *Embedding* encoder classes provide an interfaces for easily converting preprocessed text to
vectors.

The **Connectors** 🔗 in `unstructured` serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in `unstructured` library, please check out the [Connectors GitHub folder](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest/connector) and [documentation](https://unstructured-io.github.io/unstructured/ingest/index.html)

### PDF Document Parsing Example
The following examples show how to get started with the `unstructured` library. You can parse over a dozen document types with one line of code! Use this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the example below.

The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional parameters via `pip install unstructured[local-inference]`. Ensure you first install `libmagic` using the instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection) `partition` will always apply the default arguments. If you need advanced features, use a document-specific partitioning function.
The following examples show how to get started with the `unstructured` library. The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional dependencies per doc type.
For example, to install docx dependencies you need to run `pip install "unstructured[docx]"`.
See our [installation guide](https://docs.unstructured.io/open-source/installation/full-installation) for more details.

```python
from unstructured.partition.auto import partition
Expand Down Expand Up @@ -245,7 +225,7 @@ Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
including document image classification [11,
```

See the [partitioning](https://unstructured-io.github.io/unstructured/core.html#partitioning)
See the [partitioning](https://docs.unstructured.io/open-source/core-functionality/partitioning)
section in our documentation for a full list of options and instructions on how to use
file-specific partitioning functions.

Expand All @@ -263,7 +243,7 @@ Encountered a bug? Please create a new [GitHub issue](https://github.com/Unstruc
| Section | Description |
|-|-|
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
| [Documentation](https://unstructured-io.github.io/unstructured) | Full API documentation |
| [Documentation](https://docs.unstructured.io/) | Full API documentation |
| [Batch Processing](unstructured/ingest/README.md) | Ingesting batches of documents through Unstructured |

## :chart_with_upwards_trend: Analytics
Expand Down
6 changes: 6 additions & 0 deletions docs/source/404.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/api_parameters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/api_sdks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/aws_marketplace.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/azure_marketplace.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/saas_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/usage_methods.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/apis/validation_errors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices/models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices/strategies.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/best_practices/table_extraction_pdf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/chunking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/cleaning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/embedding.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/extracting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/partition.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/core/staging.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples/chroma.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples/databricks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/examples/dict_to_elements.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/chunking_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/embedding_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/fsspec_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/partition_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/permissions_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/processor_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/read_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/configs/retry_strategy_config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors/astra.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors/azure.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors/box.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors/chroma.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors/clarifai.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
6 changes: 6 additions & 0 deletions docs/source/ingest/destination_connectors/delta_table.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Unstructured Documentation
==========================

The Unstructured documentation page has moved! Check out our new and improved docs page at
`https://docs.unstructured.io <https://docs.unstructured.io>`_ to learn more about our
products and tools.
Loading

0 comments on commit 0bc68f2

Please sign in to comment.