-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716
community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
…ub.com/MacanPN/langchain into triska/SharePoint-allow_custom_parsers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser
as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:
blob_parser = MimeTypeBasedParser({
"application/msword": MsWordParser(),
"application/pdf": PDFMinerParser(),
""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)
libs/community/langchain_community/document_loaders/onedrive.py
Outdated
Show resolved
Hide resolved
This is a good point. I've been thinking about this but consider:
Please also consider that one could not simply pass in general |
In general the extra logic implemented in this PR is not about instantiation of |
libs/community/langchain_community/document_loaders/base_o365.py
Outdated
Show resolved
Hide resolved
Makes sense, thanks for the follow up! Approved pending two small nits. IMO this is the key argument in favor of this change:
|
@vbarda Minor nitpicks addressed. Are we good to merge? |
@vbarda I can't merge it myself. Can you please merge the PR? |
…neDriveLoader (langchain-ai#27716) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - **Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form:** **Example:** ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: langchain-ai#27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - **Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader`** **Dependencies:** None **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
Thanks for your contribution! Looks very good. |
What this PR does?
Currently
O365BaseLoader
(and consequently both derived loaders) are limited topdf
,doc
,docx
files.Example:
This dictionary is then passed to
MimeTypeBasedParser
same as in the current implementation.Currently
SharePointLoader
andOneDriveLoader
are separate loaders that both inherit fromO365BaseLoader
However both of these implement the same functionality. The only differences are:
SharePointLoader
requires argumentdocument_library_id
whereasOneDriveLoader
requiresdrive_id
. These are just different names for the same thing.SharePointLoader
implements significantly more features.OneDriveLoader
is replaced with an empty shell just renamingdrive_id
todocument_library_id
and inheriting fromSharePointLoader
Dependencies: None
Twitter handle: @martintriska1
If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.