Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Win][Config] Enable full support of UnstructuredIO API features on W…
…indows (#2) ## PR Summary 1. Merged changes from upstream. 2. Update `unstructuredio_api.spec`. 3. Update `unstructuredio_api.py`. 4. Add additional setup dependencies to the `docs/Windows.md`. * build(deps): version bumps for maintenance (Unstructured-IO#424) ### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. - bump `unstructured` to `0.14.6` - bump `unstructured-inference` to `0.7.35` * build: replace rockylinux with chainguard/wolfi as a base image (Unstructured-IO#423) ### Summary Updates the Dockerfile to use the Chainguard wolfi-base image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. ### Testing Run `make docker-build` and `make docker-start-api`, then try: ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename=filename, api_url="http://localhost:8000/general/v0/general", api_key="<API-KEY>", strategy="hi_res", ) print("\n\n".join([str(el) for el in elements])) ``` * fix: build and push workflow failing due to missing `-f` option `buildx build` command (Unstructured-IO#425) I noticed that images on main branch are failing to build (and push) due to missing `-f` parameter in `docker buildx build`. By default it expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and `Dockerfile-arm64` ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4) --------- Co-authored-by: christinestraub <[email protected]> * fix: update SHA for the base images (both architectures) after `base-images` repo update (Unstructured-IO#427) build and publish CI steps are failing, because the base images have changed in quay (their SHAs) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fc4e9aac-0820-4c90-9ad9-68cc6d9aad03) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fafe2ca4-dab2-4610-a26b-a7a4d56723a5) * fix: revert to rockylinux SHA that works (arm64) (Unstructured-IO#428) unnecessary SHA update introduced in Unstructured-IO#427 that needs to be reverted * fix: re-add `DOCKER_IMAGE` env var in `Test image` step (Unstructured-IO#429) shell syntax error occurs in docker-publish.yml workflow * fix: invalid env var setting in `docker-publish` workflow (Unstructured-IO#430) bug introduced in previous PR causing build failure on main * fix: `docker-publish` workflow failing on main due to inexisting `ARCH` env var (Unstructured-IO#431) * build(deps): bump dependency versions (Unstructured-IO#434) ### Summary Bumps dependency versions for the API. Closes Unstructured-IO#432. * fix/Fix MS Office filetype errors and harden docker smoketest (Unstructured-IO#436) # Changes **Fix for docx and other office files returning `{"detail":"File type None is not supported."}`** After moving to the wolfi base image, the `mimetypes` lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a `filetypes.py` and moved `get_validated_mimetype` over. When this file is imported, we'll call `mimetypes.add_type` for all file extensions we support. **Update smoke test coverage** This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated `test_happy_path` to test against the container with and without passing `content_type`. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing. # Testing The new smoke test will verify that all filetypes are working. You can also `make docker-build && make docker-start-api`, and test out the docx in the sample docs dir. On `main`, this file will give you the error above. ``` curl 'http://localhost:8000/general/v0/general' \ --form 'files=@"fake.docx"' ``` * merge main; validated format: xml, txv, csv, xml, json, html, docs, docx, ppt, pptx, xlsx, xls, pdf * compilable setting * update Windows markdown * Disable debug mode in `unstructuredio_api.spec` * Enable pdf test case in `test_app.py` --------- Co-authored-by: Christine Straub <[email protected]> Co-authored-by: Michał Martyniak <[email protected]> Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Austin Walker <[email protected]> Co-authored-by: tjtanaa <[email protected]>
- Loading branch information