Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range while extracting images from pdf? #432

Closed
alimoezzi opened this issue Jun 23, 2024 · 4 comments · Fixed by #434
Closed

IndexError: list index out of range while extracting images from pdf? #432

alimoezzi opened this issue Jun 23, 2024 · 4 comments · Fixed by #434

Comments

@alimoezzi
Copy link

I'm using latest docker image along with latest js client to partition pdf and extract images.
When I include extractImageBlockTypes: ['Image'] in partition parameters, the whole partitioning fails with the following error in logs:

2024-06-23 13:50:11,086 127.0.0.1:60040 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2024-06-23 13:50:11,087 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/prepline_general/api/general.py", line 850, in general_partition
    list(response_generator(is_multipart=False))[0]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/prepline_general/api/general.py", line 785, in response_generator
    response = pipeline_api(
               ^^^^^^^^^^^^^
  File "/home/notebook-user/prepline_general/api/general.py", line 440, in pipeline_api
    elements = partition_pdf_splits(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/prepline_general/api/general.py", line 220, in partition_pdf_splits
    return partition(
           ^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/auto.py", line 426, in partition
    elements = _partition_pdf(
               ^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 192, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 288, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 676, in _partition_pdf_or_image_local
    save_elements(
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py", line 195, in save_elements
    image_path = image_paths[page_number - 1]
                 ~~~~~~~~~~~^^^^^^^^^^^^^^^^^
IndexError: list index out of range

The client call looks like this:

const { elements } = await client.general.partition({
      partitionParameters: {
        files: {
          fileName: filename,
          content: data,
        },
        strategy: Strategy.Auto,
        skipInferTableTypes: ['jpg+png'],
        extractImageBlockTypes: ['Image'],
      },
    });
@awalker4
Copy link
Collaborator

Hi there, we recently identified a bug with extractImageBlockTypes which was fixed in the core library here: Unstructured-IO/unstructured#3246

We'll just need to bump the library version in the api requirements which should be quick.

@awalker4
Copy link
Collaborator

cc @MthwRobinson can you get a version bump going?

@MthwRobinson
Copy link
Contributor

Will do!

@MthwRobinson
Copy link
Contributor

@awalker4 - Versions bumps are in #434!

tjtanaa added a commit to EmbeddedLLM/unstructured-api-executable that referenced this issue Jul 11, 2024
…indows (#2)

## PR Summary
1. Merged changes from upstream.
2. Update `unstructuredio_api.spec`.
3. Update `unstructuredio_api.py`.
4. Add additional setup dependencies to the `docs/Windows.md`. 

* build(deps): version bumps for maintenance (Unstructured-IO#424)

### Summary
Version bumps for regular maintenance and to address moderate CVEs from
security scans.
- bump `unstructured` to `0.14.6`
- bump `unstructured-inference` to `0.7.35`

* build: replace rockylinux with chainguard/wolfi as a base image (Unstructured-IO#423)

### Summary
Updates the Dockerfile to use the Chainguard wolfi-base image to reduce
CVEs. Also adds a step in the docker publish job that scans the images
and checks for CVEs before publishing.

### Testing
Run `make docker-build` and  `make docker-start-api`, then try:
```
from unstructured.partition.api import partition_via_api

elements = partition_via_api(
    filename=filename,
    api_url="http://localhost:8000/general/v0/general",
    api_key="<API-KEY>",
    strategy="hi_res",
)

print("\n\n".join([str(el) for el in elements]))
```

* fix: build and push workflow failing due to missing `-f` option `buildx build` command (Unstructured-IO#425)

I noticed that images on main branch are failing to build (and push) due
to missing `-f` parameter in `docker buildx build`. By default it
expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and
`Dockerfile-arm64`


![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4)

---------

Co-authored-by: christinestraub <[email protected]>

* fix: update SHA for the base images (both architectures) after `base-images` repo update (Unstructured-IO#427)

build and publish CI steps are failing, because the base images have
changed in quay (their SHAs)

![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fc4e9aac-0820-4c90-9ad9-68cc6d9aad03)


![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fafe2ca4-dab2-4610-a26b-a7a4d56723a5)

* fix: revert to rockylinux SHA that works (arm64) (Unstructured-IO#428)

unnecessary SHA update introduced in
Unstructured-IO#427 that needs
to be reverted

* fix: re-add `DOCKER_IMAGE` env var in `Test image` step (Unstructured-IO#429)

shell syntax error occurs in docker-publish.yml workflow

* fix: invalid env var setting in `docker-publish` workflow (Unstructured-IO#430)

bug introduced in previous PR causing build failure on main

* fix: `docker-publish` workflow failing on main due to inexisting `ARCH` env var (Unstructured-IO#431)

* build(deps): bump dependency versions (Unstructured-IO#434)

### Summary

Bumps dependency versions for the API. Closes Unstructured-IO#432.

* fix/Fix MS Office filetype errors and harden docker smoketest (Unstructured-IO#436)

# Changes
**Fix for docx and other office files returning `{"detail":"File type
None is not supported."}`**
After moving to the wolfi base image, the `mimetypes` lib no longer
knows about these file extensions. To avoid issues like this, let's add
an explicit mapping for all the file extensions we care about. I added a
`filetypes.py` and moved `get_validated_mimetype` over. When this file
is imported, we'll call `mimetypes.add_type` for all file extensions we
support.

**Update smoke test coverage**
This bug snuck past because we were already providing the mimetype in
the docker smoke test. I updated `test_happy_path` to test against the
container with and without passing `content_type`. I added some missing
filetypes, and sorted the test params by extension so we can see when
new types are missing.

# Testing
The new smoke test will verify that all filetypes are working. You can
also `make docker-build && make docker-start-api`, and test out the docx
in the sample docs dir. On `main`, this file will give you the error
above.
```
curl 'http://localhost:8000/general/v0/general' \
--form 'files=@"fake.docx"'
```

* merge main; validated format: xml, txv, csv, xml, json, html, docs, docx, ppt, pptx, xlsx, xls, pdf

* compilable setting

* update Windows markdown

* Disable debug mode in `unstructuredio_api.spec`

* Enable pdf test case in `test_app.py`

---------

Co-authored-by: Christine Straub <[email protected]>
Co-authored-by: Michał Martyniak <[email protected]>
Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Austin Walker <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants