-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: DocumentLoaderAsParser wrapper #27749
base: master
Are you sure you want to change the base?
community: DocumentLoaderAsParser wrapper #27749
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
…ed for extended tests. despite @pytest.mark.requires("unstructured")
…neDriveLoader (#27716) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - **Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form:** **Example:** ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: #27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - **Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader`** **Dependencies:** None **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
I've written down am extra explanation of why one might need to mutate Document Loader into parser. If you're unsure about that, please keep reading:
So the idea is to be able to do:
Hope this makes sense. |
…neDriveLoader (langchain-ai#27716) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - **Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form:** **Example:** ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: langchain-ai#27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - **Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader`** **Dependencies:** None **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
@vbarda Please take a look :) |
""" | ||
Use underlying DocumentLoader to lazily parse the blob. | ||
""" | ||
doc_loader = self.DocumentLoaderClass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just create this on init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because on init, you don't know the file_path
. The file path is pulled from blob
when lazy_parse
is called. This is consistent with all other parsers.
"can be morphed into a parser." | ||
) | ||
|
||
def lazy_parse(self, blob: Blob) -> Iterator[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should add parse
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse
method is created inside BaseBlobParser
here. calling .parse()
will end up calling list()
on .lazy_parse()
|
||
# Ensure the document loader class has a `file_path` parameter | ||
init_signature = inspect.signature(document_loader_class.__init__) | ||
if "file_path" not in init_signature.parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a sufficient condition for being able to convert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the only condition we're adding. Of course any number of problems can arise if the provided document loader class cannot load provided file, however that would be handled inside of given document loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies for the delay -- looks good high level, added some questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
- Can we add tests?
- Thinking through alternatives, does it make sense to add a "sub" document loader class as an optional attribute of
O365BaseLoader
orSharePointLoader
? Introducing a new abstraction brings some risk if we don't get the interface right (or it's not as generic / extendable as we thought). - If we were to introduce a new class or method, consider decorating it with a
@beta
parameter if we're not 100% on the API (see example here).
Regarding unstructured: I would favor use of the langchain-unstructured
package over the integrations in community. The community integrations are older and could be deprecated but we have not put in the work to verify that we don't lose functionality in langchain-unstructured
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered code comments. Please take a look
""" | ||
Use underlying DocumentLoader to lazily parse the blob. | ||
""" | ||
doc_loader = self.DocumentLoaderClass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because on init, you don't know the file_path
. The file path is pulled from blob
when lazy_parse
is called. This is consistent with all other parsers.
|
||
# Ensure the document loader class has a `file_path` parameter | ||
init_signature = inspect.signature(document_loader_class.__init__) | ||
if "file_path" not in init_signature.parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the only condition we're adding. Of course any number of problems can arise if the provided document loader class cannot load provided file, however that would be handled inside of given document loader.
"can be morphed into a parser." | ||
) | ||
|
||
def lazy_parse(self, blob: Blob) -> Iterator[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse
method is created inside BaseBlobParser
here. calling .parse()
will end up calling list()
on .lazy_parse()
@ccurme Please take a look. Thanks!
I've had tests in previous commits that I had rolled back. There is a problem that
This was my original thought. However after spending some time on this issue, I found this approach much better. Updating O365BaseLoader to accept sub-loaders would fix the issue for O365. The same issue will surface in other loaders as well. The conceptual problem is that we're missing parsers. This class would take care of the root problem.
I can add
|
Description
This pull request introduces the
DocumentLoaderAsParser
class, which acts as an adapter to transform document loaders into parsers within the LangChain framework. The class enables document loaders that accept afile_path
parameter to be utilized as blob parsers. This is particularly useful for integrating various document loading capabilities seamlessly into the LangChain ecosystem.When merged in together with PR #27716 It opens options for
SharePointLoader
/OneDriveLoader
to process any filetype that has a document loader.Features
DocumentLoaderAsParser
class can adapt any document loader that meets the criteria of accepting afile_path
argument, allowing for lazy parsing of documents.Usage Example
To use the
DocumentLoaderAsParser
, you would initialize it with a suitable document loader class and any required parameters. Here’s an example of how to do this with theUnstructuredExcelLoader
:If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.