Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: batch multiple files in a single Unstructured API request #4525

Merged

Conversation

MthwRobinson
Copy link
Contributor

Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests. Support for requests with multiple files was added to both UnstructuredAPIFileLoader and UnstructuredAPIFileIOLoader. Note that if you submit multiple files in "single" mode, the result will be concatenated into a single document. We recommend using this feature in "elements" mode.

Testing

The following should load both documents, using two of the example docs from the integration tests folder.

    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()

@dev2049 dev2049 requested a review from eyurtsev May 11, 2023 18:15
mode: str = "single",
url: str = "https://api.unstructured.io/general/v0/general",
api_key: str = "",
files: Optional[List[IO]] = None,
file_filename: Optional[str] = "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could these be part of unstructured_kwargs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as in file_filename and file_filenames)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep those can just be unstructured_kwargs. I'll update that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

mode: str = "single",
url: str = "https://api.unstructured.io/general/v0/general",
api_key: str = "",
file_paths: Optional[List[str]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add * operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated file_path to a Union, so this kwarg doesn't exist anymore.

class UnstructuredAPIFileLoader(UnstructuredFileLoader):
"""Loader that uses the unstructured web API to load files."""

def __init__(
self,
file_path: str,
file_path: str = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to union this field instead?

file_path: Union[str, Sequence[str]]

and ideally

FileLike: Union[str, Path, bytes.IO]
`file_path`: Union[FileLike, Sequence[FileLike]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made file_path and file both Unions.

mode: str = "single",
url: str = "https://api.unstructured.io/general/v0/general",
api_key: str = "",
files: Optional[List[IO]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Favor Sequence instead of List on inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this to Sequence.


def test_unstructured_api_file_loader() -> None:
"""Test unstructured loader."""
file_path = str(Path(__file__).parent.parent / "examples/layout-parser-paper.pdf")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to lift the path into the global namespace -- the code that finds the file path is duplicated -- which will make it more difficult to refactor locations of data fixtures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense to me, I'll make that update

@dev2049 dev2049 requested a review from eyurtsev May 17, 2023 22:29
@hwchase17 hwchase17 merged commit bf3f554 into langchain-ai:master May 22, 2023
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants