Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

Merged

Conversation

MacanPN
Copy link
Contributor

@MacanPN MacanPN commented Oct 29, 2024

What this PR does?

Currently O365BaseLoader (and consequently both derived loaders) are limited to pdf, doc, docx files.

  • Solution: here we introduce handlers attribute that allows for custom handlers to be passed in. This is done in dict form:

Example:

from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: https://github.com/langchain-ai/langchain/pull/27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "doc": MsWordParser()
    "pdf": PDFMinerParser()
    "txt": TextParser()
    "xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
                            handlers=handlers
                            )

This dictionary is then passed to MimeTypeBasedParser same as in the current implementation.

Currently SharePointLoader and OneDriveLoader are separate loaders that both inherit from O365BaseLoader

However both of these implement the same functionality. The only differences are:

  • SharePointLoader requires argument document_library_id whereas OneDriveLoader requires drive_id. These are just different names for the same thing.
  • SharePointLoader implements significantly more features.
  • Solution: OneDriveLoader is replaced with an empty shell just renaming drive_id to document_library_id and inheriting from SharePointLoader

Dependencies: None
Twitter handle: @martintriska1

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

Copy link

vercel bot commented Oct 29, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 6, 2024 4:10pm

@MacanPN MacanPN marked this pull request as ready for review October 29, 2024 17:23
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Oct 29, 2024
Copy link
Contributor

@vbarda vbarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:

blob_parser = MimeTypeBasedParser({
    "application/msword": MsWordParser(),
    "application/pdf": PDFMinerParser(),
    ""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)

@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 4, 2024

Thanks for your contribution!

Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:

blob_parser = MimeTypeBasedParser({
    "application/msword": MsWordParser(),
    "application/pdf": PDFMinerParser(),
    ""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)

This is a good point. I've been thinking about this but consider:

  • We still need to implement functions for file extension <-> mime type conversions.
  • We still need to set up _file_types and _mime_types properties (for interface compatibility)
  • We still need to provide default parser.
    So we wouldn't really simplify the logic that much. What the implementation has on top is option to pass in file types (file extensions) instead of mime types, which I think is a real quality of life improvement. Some mime types can be not as straight forward as the ones in the example above, such as:
  • .docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • xlsx: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Please also consider that one could not simply pass in general BaseBlobParser but it would have to be MimeTypeBasedParser (because we are looking at handlers to determine whether the file should be downloaded and parsed or not).

@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 4, 2024

In general the extra logic implemented in this PR is not about instantiation of MimeTypeBasedParser (that is 1 line thing) but rather about supporting arbitrary file types (rather than just 3 enumerated types).

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Nov 5, 2024
@vbarda
Copy link
Contributor

vbarda commented Nov 5, 2024

Thanks for your contribution!
Looks good overall, but my biggest question is: can't we just allow a single optional MimeTypeBasedParser as an input to this document loader? I think it could simplify a lot of this logic. The user could then do something like:

blob_parser = MimeTypeBasedParser({
    "application/msword": MsWordParser(),
    "application/pdf": PDFMinerParser(),
    ""audio/mpeg": OpenAIWhisperParser()
})
loader = OneDriveLoader(..., blob_parser=blob_parser)

This is a good point. I've been thinking about this but consider:

  • We still need to implement functions for file extension <-> mime type conversions.
  • We still need to set up _file_types and _mime_types properties (for interface compatibility)
  • We still need to provide default parser.
    So we wouldn't really simplify the logic that much. What the implementation has on top is option to pass in file types (file extensions) instead of mime types, which I think is a real quality of life improvement. Some mime types can be not as straight forward as the ones in the example above, such as:
  • .docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • xlsx: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Please also consider that one could not simply pass in general BaseBlobParser but it would have to be MimeTypeBasedParser (because we are looking at handlers to determine whether the file should be downloaded and parsed or not).

Makes sense, thanks for the follow up! Approved pending two small nits. IMO this is the key argument in favor of this change:

What the implementation has on top is option to pass in file types (file extensions) instead of mime types, which I think is a real quality of life improvement. Some mime types can be not as straight forward as the ones in the example above, such as:
.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document
xlsx: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 6, 2024

@vbarda Minor nitpicks addressed. Are we good to merge?

@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 6, 2024

@vbarda I can't merge it myself. Can you please merge the PR?
image

@vbarda vbarda merged commit 90189f5 into langchain-ai:master Nov 6, 2024
20 checks passed
yanomaly pushed a commit to yanomaly/langchain that referenced this pull request Nov 8, 2024
…neDriveLoader (langchain-ai#27716)

## What this PR does?

### Currently `O365BaseLoader` (and consequently both derived loaders)
are limited to `pdf`, `doc`, `docx` files.
- **Solution: here we introduce _handlers_ attribute that allows for
custom handlers to be passed in. This is done in _dict_ form:**

**Example:**
```python
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: langchain-ai#27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "doc": MsWordParser()
    "pdf": PDFMinerParser()
    "txt": TextParser()
    "xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
                            handlers=handlers
                            )
```
This dictionary is then passed to `MimeTypeBasedParser` same as in the
[current
implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13).


### Currently `SharePointLoader` and `OneDriveLoader` are separate
loaders that both inherit from `O365BaseLoader`
However both of these implement the same functionality. The only
differences are:
- `SharePointLoader` requires argument `document_library_id` whereas
`OneDriveLoader` requires `drive_id`. These are just different names for
the same thing.
  - `SharePointLoader` implements significantly more features.
- **Solution: `OneDriveLoader` is replaced with an empty shell just
renaming `drive_id` to `document_library_id` and inheriting from
`SharePointLoader`**

**Dependencies:** None
**Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
@rasc0der
Copy link

rasc0der commented Nov 24, 2024

Thanks for your contribution! Looks very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants