community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

MacanPN · 2024-10-29T17:02:17Z

What this PR does?

Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files.

Solution: here we introduce handlers attribute that allows for custom handlers to be passed in. This is done in dict form:

Example:

from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: https://github.com/langchain-ai/langchain/pull/27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "doc": MsWordParser()
    "pdf": PDFMinerParser()
    "txt": TextParser()
    "xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
                            handlers=handlers
                            )

This dictionary is then passed to MimeTypeBasedParser same as in the current implementation.

Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader`

However both of these implement the same functionality. The only differences are:

SharePointLoader requires argument document_library_id whereas OneDriveLoader requires drive_id. These are just different names for the same thing.
SharePointLoader implements significantly more features.
Solution: OneDriveLoader is replaced with an empty shell just renaming drive_id to document_library_id and inheriting from SharePointLoader

Dependencies: None
Twitter handle: @martintriska1

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

vercel · 2024-10-29T17:02:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 6, 2024 4:10pm

…lass

…ub.com/MacanPN/langchain into triska/SharePoint-allow_custom_parsers

vbarda

Thanks for your contribution!

Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:

blob_parser = MimeTypeBasedParser({
    "application/msword": MsWordParser(),
    "application/pdf": PDFMinerParser(),
    ""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)

libs/community/langchain_community/document_loaders/base_o365.py

libs/community/langchain_community/document_loaders/onedrive.py

MacanPN · 2024-11-04T09:55:05Z

Thanks for your contribution!

Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:
blob_parser = MimeTypeBasedParser({
    "application/msword": MsWordParser(),
    "application/pdf": PDFMinerParser(),
    ""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)

This is a good point. I've been thinking about this but consider:

We still need to implement functions for file extension <-> mime type conversions.
We still need to set up _file_types and _mime_types properties (for interface compatibility)
We still need to provide default parser.
So we wouldn't really simplify the logic that much. What the implementation has on top is option to pass in file types (file extensions) instead of mime types, which I think is a real quality of life improvement. Some mime types can be not as straight forward as the ones in the example above, such as:
.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document
xlsx: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Please also consider that one could not simply pass in general BaseBlobParser but it would have to be MimeTypeBasedParser (because we are looking at handlers to determine whether the file should be downloaded and parsed or not).

MacanPN · 2024-11-04T10:58:13Z

In general the extra logic implemented in this PR is not about instantiation of MimeTypeBasedParser (that is 1 line thing) but rather about supporting arbitrary file types (rather than just 3 enumerated types).

libs/community/langchain_community/document_loaders/base_o365.py

vbarda · 2024-11-05T15:10:59Z

Thanks for your contribution!
Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:
blob_parser = MimeTypeBasedParser({
    "application/msword": MsWordParser(),
    "application/pdf": PDFMinerParser(),
    ""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)
This is a good point. I've been thinking about this but consider:

We still need to implement functions for file extension <-> mime type conversions.

We still need to set up _file_types and _mime_types properties (for interface compatibility)

We still need to provide default parser.
So we wouldn't really simplify the logic that much. What the implementation has on top is option to pass in file types (file extensions) instead of mime types, which I think is a real quality of life improvement. Some mime types can be not as straight forward as the ones in the example above, such as:

.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document

xlsx: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Please also consider that one could not simply pass in general BaseBlobParser but it would have to be MimeTypeBasedParser (because we are looking at handlers to determine whether the file should be downloaded and parsed or not).

Makes sense, thanks for the follow up! Approved pending two small nits. IMO this is the key argument in favor of this change:

What the implementation has on top is option to pass in file types (file extensions) instead of mime types, which I think is a real quality of life improvement. Some mime types can be not as straight forward as the ones in the example above, such as:
.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document
xlsx: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

MacanPN · 2024-11-06T08:37:35Z

@vbarda Minor nitpicks addressed. Are we good to merge?

MacanPN · 2024-11-06T16:13:48Z

@vbarda I can't merge it myself. Can you please merge the PR?

…neDriveLoader (langchain-ai#27716) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - **Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form:** **Example:** ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: langchain-ai#27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - **Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader`** **Dependencies:** None **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

rasc0der · 2024-11-24T20:25:44Z

Thanks for your contribution! Looks very good.

MacanPN added 3 commits October 29, 2024 17:23

allow for custom handlers in o365 loader

39e2cac

formatting

0b03a1a

linting

9f48c04

MacanPN marked this pull request as ready for review October 29, 2024 17:23

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Oct 29, 2024

MacanPN added 4 commits October 30, 2024 13:41

stricter type checking for handlers

275c8fc

Merge branch 'master' into triska/SharePoint-allow_custom_parsers

de361bd

reverting handlers type to Any since BaseBlobParser is not pydantic c…

c7ce4c7

…lass

Merge branch 'triska/SharePoint-allow_custom_parsers' of https://gith…

befefd7

…ub.com/MacanPN/langchain into triska/SharePoint-allow_custom_parsers

MacanPN mentioned this pull request Oct 30, 2024

community: DocumentLoaderAsParser wrapper #27749

Open

efriis assigned vbarda Oct 31, 2024

MacanPN added 2 commits October 31, 2024 13:45

handler now accepts either file types or mime types

54fa6a8

docs + linting

71d6904

vercel bot deployed to Preview October 31, 2024 13:36 View deployment

vbarda reviewed Oct 31, 2024

View reviewed changes

vbarda approved these changes Nov 5, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/base_o365.py Outdated Show resolved Hide resolved

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Nov 5, 2024

nitpick updates

edecf6a

vercel bot deployed to Preview November 6, 2024 08:24 View deployment

line too long fix

d19e7c5

vercel bot deployed to Preview November 6, 2024 08:34 View deployment

Merge branch 'master' into triska/SharePoint-allow_custom_parsers

8fb15f7

vercel bot deployed to Preview November 6, 2024 13:04 View deployment

Merge branch 'master' into triska/SharePoint-allow_custom_parsers

ef4355d

vercel bot deployed to Preview November 6, 2024 16:10 View deployment

vbarda merged commit 90189f5 into langchain-ai:master Nov 6, 2024
20 checks passed

MacanPN mentioned this pull request Nov 7, 2024

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

MacanPN commented Oct 29, 2024 •

edited

Loading

vercel bot commented Oct 29, 2024 •

edited

Loading

vbarda left a comment

MacanPN commented Nov 4, 2024 •

edited

Loading

MacanPN commented Nov 4, 2024

vbarda commented Nov 5, 2024

MacanPN commented Nov 6, 2024

MacanPN commented Nov 6, 2024

rasc0der commented Nov 24, 2024 •

edited

Loading

community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

Conversation

MacanPN commented Oct 29, 2024 • edited Loading

What this PR does?

Currently O365BaseLoader (and consequently both derived loaders) are limited to pdf, doc, docx files.

Currently SharePointLoader and OneDriveLoader are separate loaders that both inherit from O365BaseLoader

vercel bot commented Oct 29, 2024 • edited Loading

vbarda left a comment

Choose a reason for hiding this comment

MacanPN commented Nov 4, 2024 • edited Loading

MacanPN commented Nov 4, 2024

vbarda commented Nov 5, 2024

MacanPN commented Nov 6, 2024

MacanPN commented Nov 6, 2024

rasc0der commented Nov 24, 2024 • edited Loading

MacanPN commented Oct 29, 2024 •

edited

Loading

Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files.

Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader`

vercel bot commented Oct 29, 2024 •

edited

Loading

MacanPN commented Nov 4, 2024 •

edited

Loading

rasc0der commented Nov 24, 2024 •

edited

Loading