Skip to content

Commit

Permalink
[Win][Config] Enable full support of UnstructuredIO API features on W…
Browse files Browse the repository at this point in the history
…indows (#2)

## PR Summary
1. Merged changes from upstream.
2. Update `unstructuredio_api.spec`.
3. Update `unstructuredio_api.py`.
4. Add additional setup dependencies to the `docs/Windows.md`. 

* build(deps): version bumps for maintenance (Unstructured-IO#424)

### Summary
Version bumps for regular maintenance and to address moderate CVEs from
security scans.
- bump `unstructured` to `0.14.6`
- bump `unstructured-inference` to `0.7.35`

* build: replace rockylinux with chainguard/wolfi as a base image (Unstructured-IO#423)

### Summary
Updates the Dockerfile to use the Chainguard wolfi-base image to reduce
CVEs. Also adds a step in the docker publish job that scans the images
and checks for CVEs before publishing.

### Testing
Run `make docker-build` and  `make docker-start-api`, then try:
```
from unstructured.partition.api import partition_via_api

elements = partition_via_api(
    filename=filename,
    api_url="http://localhost:8000/general/v0/general",
    api_key="<API-KEY>",
    strategy="hi_res",
)

print("\n\n".join([str(el) for el in elements]))
```

* fix: build and push workflow failing due to missing `-f` option `buildx build` command (Unstructured-IO#425)

I noticed that images on main branch are failing to build (and push) due
to missing `-f` parameter in `docker buildx build`. By default it
expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and
`Dockerfile-arm64`


![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4)

---------

Co-authored-by: christinestraub <[email protected]>

* fix: update SHA for the base images (both architectures) after `base-images` repo update (Unstructured-IO#427)

build and publish CI steps are failing, because the base images have
changed in quay (their SHAs)

![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fc4e9aac-0820-4c90-9ad9-68cc6d9aad03)


![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fafe2ca4-dab2-4610-a26b-a7a4d56723a5)

* fix: revert to rockylinux SHA that works (arm64) (Unstructured-IO#428)

unnecessary SHA update introduced in
Unstructured-IO#427 that needs
to be reverted

* fix: re-add `DOCKER_IMAGE` env var in `Test image` step (Unstructured-IO#429)

shell syntax error occurs in docker-publish.yml workflow

* fix: invalid env var setting in `docker-publish` workflow (Unstructured-IO#430)

bug introduced in previous PR causing build failure on main

* fix: `docker-publish` workflow failing on main due to inexisting `ARCH` env var (Unstructured-IO#431)

* build(deps): bump dependency versions (Unstructured-IO#434)

### Summary

Bumps dependency versions for the API. Closes Unstructured-IO#432.

* fix/Fix MS Office filetype errors and harden docker smoketest (Unstructured-IO#436)

# Changes
**Fix for docx and other office files returning `{"detail":"File type
None is not supported."}`**
After moving to the wolfi base image, the `mimetypes` lib no longer
knows about these file extensions. To avoid issues like this, let's add
an explicit mapping for all the file extensions we care about. I added a
`filetypes.py` and moved `get_validated_mimetype` over. When this file
is imported, we'll call `mimetypes.add_type` for all file extensions we
support.

**Update smoke test coverage**
This bug snuck past because we were already providing the mimetype in
the docker smoke test. I updated `test_happy_path` to test against the
container with and without passing `content_type`. I added some missing
filetypes, and sorted the test params by extension so we can see when
new types are missing.

# Testing
The new smoke test will verify that all filetypes are working. You can
also `make docker-build && make docker-start-api`, and test out the docx
in the sample docs dir. On `main`, this file will give you the error
above.
```
curl 'http://localhost:8000/general/v0/general' \
--form 'files=@"fake.docx"'
```

* merge main; validated format: xml, txv, csv, xml, json, html, docs, docx, ppt, pptx, xlsx, xls, pdf

* compilable setting

* update Windows markdown

* Disable debug mode in `unstructuredio_api.spec`

* Enable pdf test case in `test_app.py`

---------

Co-authored-by: Christine Straub <[email protected]>
Co-authored-by: Michał Martyniak <[email protected]>
Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Austin Walker <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
  • Loading branch information
6 people authored Jul 11, 2024
1 parent 4c22810 commit 392a12e
Show file tree
Hide file tree
Showing 23 changed files with 965 additions and 351 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,9 @@ jobs:
source .venv/bin/activate
make docker-build
make docker-test
- name: Scan image
uses: anchore/scan-action@v3
with:
image: "pipeline-family-${{ env.PIPELINE_FAMILY }}-dev"
# NOTE(robinson) - revert this to medium when we bump libreoffice
severity-cutoff: high
22 changes: 10 additions & 12 deletions .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,17 @@ jobs:
build-images:
strategy:
matrix:
docker-platform: ["linux/arm64", "linux/amd64"]
arch: ["arm64", "amd64"]
runs-on: ubuntu-latest-m
needs: [setup, set-short-sha]
env:
SHORT_SHA: ${{ needs.set-short-sha.outputs.short_sha }}
DOCKER_PLATFORM: linux/${{ matrix.arch }}
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver: ${{ matrix.docker-platform == 'linux/amd64' && 'docker' || 'docker-container' }}
driver: ${{ matrix.arch == 'amd64' && 'docker' || 'docker-container' }}
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Quay.io
Expand All @@ -68,15 +69,15 @@ jobs:
# Clear some space (https://github.com/actions/runner-images/issues/2840)
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
DOCKER_BUILDKIT=1 docker buildx build --platform=$ARCH --load \
DOCKER_BUILDKIT=1 docker buildx build --load -f Dockerfile-${{ matrix.arch }} \
--platform=$DOCKER_PLATFORM \
--build-arg PIP_VERSION=$PIP_VERSION \
--build-arg BUILDKIT_INLINE_CACHE=1 \
--build-arg PIPELINE_PACKAGE=${{ env.PIPELINE_FAMILY }} \
--provenance=false \
--progress plain \
--cache-from $DOCKER_BUILD_REPOSITORY:$ARCH \
-t $DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA .
--cache-from $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }} \
-t $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA .
- name: Set virtualenv cache
uses: actions/cache@v4
id: virtualenv-cache
Expand All @@ -88,20 +89,17 @@ jobs:
uses: docker/setup-qemu-action@v3
- name: Test image
run: |
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
source .venv/bin/activate
if [ "${{ matrix.docker-platform }}" == "linux/arm64" ]; then
DOCKER_PLATFORM="${{ matrix.docker-platform }}" DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA" \
export DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA"
if [ "$DOCKER_PLATFORM" == "linux/arm64" ]; then
SKIP_INFERENCE_TESTS=true make docker-test
else
DOCKER_PLATFORM="${{ matrix.docker-platform }}" DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA" \
make docker-test
fi
- name: Push image
run: |
# write to the build repository to cache for the publish-images job
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
docker push $DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA
docker push $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA
publish-images:
runs-on: ubuntu-latest-m
needs: [setup, set-short-sha, build-images]
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
## 0.0.72

* Fix certain filetypes failing mimetype lookup in the new base image

## 0.0.71

* replace rockylinux with chainguard/wolfi as a base image for `amd64`

## 0.0.70

* Bump to `unstructured` 0.14.6
* Bump to `unstructured-inference` 0.7.35

## 0.0.69

* Bump to `unstructured` 0.14.4
Expand Down
43 changes: 43 additions & 0 deletions Dockerfile-amd64
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# syntax=docker/dockerfile:experimental
FROM quay.io/unstructured-io/base-images:wolfi-base@sha256:7c3af225a39f730f4feee705df6cd8d1570739dc130456cf589ac53347da0f1d as base

# NOTE(crag): NB_USER ARG for mybinder.org compat:
# https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
ARG NB_USER=notebook-user
ARG NB_UID=1000
ARG PIP_VERSION
ARG PIPELINE_PACKAGE
ARG PYTHON_VERSION="3.11"

# Set up environment
ENV PYTHON python${PYTHON_VERSION}
ENV PIP ${PYTHON} -m pip

WORKDIR ${HOME}
USER ${NB_USER}

ENV PYTHONPATH="${PYTHONPATH}:${HOME}"
ENV PATH="/home/${NB_USER}/.local/bin:${PATH}"

FROM base as python-deps
COPY --chown=${NB_USER}:${NB_USER} requirements/base.txt requirements-base.txt
RUN ${PIP} install pip==${PIP_VERSION}
RUN ${PIP} install --no-cache -r requirements-base.txt

FROM python-deps as model-deps
RUN ${PYTHON} -c "import nltk; nltk.download('punkt')" && \
${PYTHON} -c "import nltk; nltk.download('averaged_perceptron_tagger')" && \
${PYTHON} -c "from unstructured.partition.model_init import initialize; initialize()"

FROM model-deps as code
COPY --chown=${NB_USER}:${NB_USER} CHANGELOG.md CHANGELOG.md
COPY --chown=${NB_USER}:${NB_USER} logger_config.yaml logger_config.yaml
COPY --chown=${NB_USER}:${NB_USER} prepline_${PIPELINE_PACKAGE}/ prepline_${PIPELINE_PACKAGE}/
COPY --chown=${NB_USER}:${NB_USER} exploration-notebooks exploration-notebooks
COPY --chown=${NB_USER}:${NB_USER} scripts/app-start.sh scripts/app-start.sh

ENTRYPOINT ["scripts/app-start.sh"]
# Expose a default port of 8000. Note: The EXPOSE instruction does not actually publish the port,
# but some tooling will inspect containers and perform work contingent on networking support declared.

EXPOSE 8000
2 changes: 1 addition & 1 deletion Dockerfile → Dockerfile-arm64
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ COPY --chown=${NB_USER}:${NB_USER} scripts/app-start.sh scripts/app-start.sh
ENTRYPOINT ["scripts/app-start.sh"]
# Expose a default port of 8000. Note: The EXPOSE instruction does not actually publish the port,
# but some tooling will inspect containers and perform work contingent on networking support declared.
EXPOSE 8000
EXPOSE 8000
52 changes: 52 additions & 0 deletions _internal/config/logger_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
version: 1
disable_existing_loggers: False
formatters:
default_format:
"()": uvicorn.logging.DefaultFormatter
format: '%(asctime)s %(name)s %(levelname)s %(message)s'
access:
"()": uvicorn.logging.AccessFormatter
format: '%(asctime)s %(client_addr)s %(request_line)s - %(status_code)s'
handlers:
access_handler:
formatter: access
class: logging.StreamHandler
stream: ext://sys.stderr
standard_handler:
formatter: default_format
class: logging.StreamHandler
stream: ext://sys.stderr
loggers:
uvicorn.error:
level: INFO
handlers:
- standard_handler
propagate: no
# disable logging for uvicorn.error by not having a handler
uvicorn.access:
level: INFO
handlers:
- access_handler
propagate: no
# disable logging for uvicorn.access by not having a handler
unstructured:
level: INFO
handlers:
- standard_handler
propagate: no
unstructured.trace:
level: CRITICAL
handlers:
- standard_handler
propagate: no
unstructured_inference:
level: DEBUG
handlers:
- standard_handler
propagate: no
unstructured_api:
level: DEBUG
handlers:
- standard_handler
propagate: no

Loading

0 comments on commit 392a12e

Please sign in to comment.