-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: batch multiple files in a single Unstructured API request #4525
feat: batch multiple files in a single Unstructured API request #4525
Conversation
mode: str = "single", | ||
url: str = "https://api.unstructured.io/general/v0/general", | ||
api_key: str = "", | ||
files: Optional[List[IO]] = None, | ||
file_filename: Optional[str] = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could these be part of unstructured_kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as in file_filename and file_filenames)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep those can just be unstructured_kwargs
. I'll update that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
mode: str = "single", | ||
url: str = "https://api.unstructured.io/general/v0/general", | ||
api_key: str = "", | ||
file_paths: Optional[List[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add *
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated file_path
to a Union
, so this kwarg doesn't exist anymore.
class UnstructuredAPIFileLoader(UnstructuredFileLoader): | ||
"""Loader that uses the unstructured web API to load files.""" | ||
|
||
def __init__( | ||
self, | ||
file_path: str, | ||
file_path: str = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to union this field instead?
file_path
: Union[str, Sequence[str]]
and ideally
FileLike: Union[str, Path, bytes.IO]
`file_path`: Union[FileLike, Sequence[FileLike]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made file_path
and file
both Unions.
mode: str = "single", | ||
url: str = "https://api.unstructured.io/general/v0/general", | ||
api_key: str = "", | ||
files: Optional[List[IO]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Favor Sequence
instead of List
on inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this to Sequence
.
|
||
def test_unstructured_api_file_loader() -> None: | ||
"""Test unstructured loader.""" | ||
file_path = str(Path(__file__).parent.parent / "examples/layout-parser-paper.pdf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to lift the path into the global namespace -- the code that finds the file path is duplicated -- which will make it more difficult to refactor locations of data fixtures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense to me, I'll make that update
Submit Multiple Files to the Unstructured API
Enables batching multiple files into a single Unstructured API requests. Support for requests with multiple files was added to both
UnstructuredAPIFileLoader
andUnstructuredAPIFileIOLoader
. Note that if you submit multiple files in "single" mode, the result will be concatenated into a single document. We recommend using this feature in "elements" mode.Testing
The following should load both documents, using two of the example docs from the integration tests folder.