community: DocumentLoaderAsParser wrapper #27749

MacanPN · 2024-10-30T16:36:01Z

Description

This pull request introduces the DocumentLoaderAsParser class, which acts as an adapter to transform document loaders into parsers within the LangChain framework. The class enables document loaders that accept a file_path parameter to be utilized as blob parsers. This is particularly useful for integrating various document loading capabilities seamlessly into the LangChain ecosystem.

When merged in together with PR #27716 It opens options for SharePointLoader / OneDriveLoader to process any filetype that has a document loader.

Features

Flexible Parsing: The DocumentLoaderAsParser class can adapt any document loader that meets the criteria of accepting a file_path argument, allowing for lazy parsing of documents.
Compatibility: The class has been designed to work with various document loaders, making it versatile for different use cases.

Usage Example

To use the DocumentLoaderAsParser, you would initialize it with a suitable document loader class and any required parameters. Here’s an example of how to do this with the UnstructuredExcelLoader:

from langchain_community.document_loaders.blob_loaders import Blob
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

# Initialize the parser adapter with UnstructuredExcelLoader
xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# Use parser, for ex. pass it to MimeTypeBasedParser
MimeTypeBasedParser(
    handlers={
        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": xlsx_parser
    }
)

Dependencies: None
Twitter handle: @martintriska1

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

vercel · 2024-10-30T16:36:06Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 30, 2024 5:45pm

…ed for extended tests. despite @pytest.mark.requires("unstructured")

…neDriveLoader (#27716) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - **Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form:** **Example:** ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: #27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - **Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader`** **Dependencies:** None **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

MacanPN · 2024-11-07T20:38:33Z

I've written down am extra explanation of why one might need to mutate Document Loader into parser. If you're unsure about that, please keep reading:

Document loaders function as <whatever> -> documents
BlobParsers process blob -> documents.
Many document loaders actually implement logic until they have a blob. Then the actual parsing is done by calling parser with similar name. As an example you can look at AmazonTextractPDFLoader and AmazonTextractPDFParser.
Some document loaders may want to process numerous filetypes. An example is a SharePointLoader that is supposed to fetch all parsable files, parse them and produce documents. Until recently it processed only 3 file types doc, docx, pdf and ignored everything else. Recently I got this pr merged that enables a user to pass in dict mapping file types (or mime types) to parsers. Now one can easily use SharePointLoader with any files where suitable parser is available.
Unfortunatelly many document loaders do not define BlobParser but rather handle everything directly inside the loader class. UnstructuredLoader is one of those.
Originally I went ahead and implemented separate ExcelParser (this PR). However this duplicated some of the code. That could be mitigated by replacing the part of the code inside UnstructuredExcelLoader responsible for parsing the file with calling the Parser. This would basically bring it to agreement with how things are done for ex. in PDF Loaders.
However I found out that I'd have to do this shuffling for a fairly long list of document loaders. Given how much time such effort would require, I opted to create DocumentLoaderAsParser wrapper. This wrapper can be used on any DocumentLoader that accepts file_path argument and turns it into BaseBlobParser.

So the idea is to be able to do:

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")
mp3_parser = DocumentLoaderAsParser(GoogleSpeechToTextLoader, project_id="...")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "xlsx": xlsx_parser,
    "mp3": mp3_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

Hope this makes sense.

…neDriveLoader (langchain-ai#27716) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - **Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form:** **Example:** ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: langchain-ai#27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - **Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader`** **Dependencies:** None **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

MacanPN · 2024-11-12T10:11:49Z

@vbarda Please take a look :)

MacanPN · 2024-11-18T12:03:21Z

@efriis Would you mind taking a look yourself please? Seems like @vbarda might not have the throughput right now? Thank you and please let me know if there is anything I can do on my part to move this forward.

vbarda · 2024-11-20T22:04:09Z

libs/community/langchain_community/document_loaders/parsers/documentloader_adapter.py

+        """
+        Use underlying DocumentLoader to lazily parse the blob.
+        """
+        doc_loader = self.DocumentLoaderClass(


why not just create this on init?

Because on init, you don't know the file_path. The file path is pulled from blob when lazy_parse is called. This is consistent with all other parsers.

vbarda · 2024-11-20T22:04:28Z

libs/community/langchain_community/document_loaders/parsers/documentloader_adapter.py

+                "can be morphed into a parser."
+            )
+
+    def lazy_parse(self, blob: Blob) -> Iterator[Document]:


i think we should add parse as well

parse method is created inside BaseBlobParser here. calling .parse() will end up calling list() on .lazy_parse()

vbarda · 2024-11-20T22:04:51Z

libs/community/langchain_community/document_loaders/parsers/documentloader_adapter.py

+
+        # Ensure the document loader class has a `file_path` parameter
+        init_signature = inspect.signature(document_loader_class.__init__)
+        if "file_path" not in init_signature.parameters:


is this a sufficient condition for being able to convert?

Yes, that is the only condition we're adding. Of course any number of problems can arise if the provided document loader class cannot load provided file, however that would be handled inside of given document loader.

vbarda

apologies for the delay -- looks good high level, added some questions

ccurme

Thanks for this.

Can we add tests?
Thinking through alternatives, does it make sense to add a "sub" document loader class as an optional attribute of O365BaseLoader or SharePointLoader? Introducing a new abstraction brings some risk if we don't get the interface right (or it's not as generic / extendable as we thought).
If we were to introduce a new class or method, consider decorating it with a @beta parameter if we're not 100% on the API (see example here).

Regarding unstructured: I would favor use of the langchain-unstructured package over the integrations in community. The community integrations are older and could be deprecated but we have not put in the work to verify that we don't lose functionality in langchain-unstructured.

MacanPN

Answered code comments. Please take a look

MacanPN · 2024-11-22T12:24:10Z

libs/community/langchain_community/document_loaders/parsers/documentloader_adapter.py

+        """
+        Use underlying DocumentLoader to lazily parse the blob.
+        """
+        doc_loader = self.DocumentLoaderClass(


Because on init, you don't know the file_path. The file path is pulled from blob when lazy_parse is called. This is consistent with all other parsers.

MacanPN · 2024-11-22T12:25:58Z

libs/community/langchain_community/document_loaders/parsers/documentloader_adapter.py

+
+        # Ensure the document loader class has a `file_path` parameter
+        init_signature = inspect.signature(document_loader_class.__init__)
+        if "file_path" not in init_signature.parameters:


Yes, that is the only condition we're adding. Of course any number of problems can arise if the provided document loader class cannot load provided file, however that would be handled inside of given document loader.

MacanPN · 2024-11-22T12:29:57Z

libs/community/langchain_community/document_loaders/parsers/documentloader_adapter.py

+                "can be morphed into a parser."
+            )
+
+    def lazy_parse(self, blob: Blob) -> Iterator[Document]:


parse method is created inside BaseBlobParser here. calling .parse() will end up calling list() on .lazy_parse()

MacanPN · 2024-11-22T13:13:05Z

@ccurme Please take a look. Thanks!

Thanks for this.
1. Can we add tests?

I've had tests in previous commits that I had rolled back. There is a problem that unstructured requires python<3.13 and langchain wants tests to run on 3.13. This is the error I'm getting:

The current project's supported Python range (>=3.9,<4.0) is not compatible with some of the required packages Python requirement:
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0

2. Thinking through alternatives, does it make sense to add a "sub" document loader class as an optional attribute of `O365BaseLoader` or `SharePointLoader`? Introducing a new abstraction brings some risk if we don't get the interface right (or it's not as generic / extendable as we thought).

This was my original thought. However after spending some time on this issue, I found this approach much better. Updating O365BaseLoader to accept sub-loaders would fix the issue for O365. The same issue will surface in other loaders as well. The conceptual problem is that we're missing parsers. This class would take care of the root problem.

3. If we were to introduce a new class or method, consider decorating it with a `@beta` parameter if we're not 100% on the API (see example [here](https://github.com/langchain-ai/langchain/blob/f5f53d1101ea73d8465deb7d73b0a4e70bb556e7/libs/community/langchain_community/graph_vectorstores/base.py#L708)).

I can add @beta parameter. What version should I say that it will be added in?

Regarding unstructured: I would favor use of the langchain-unstructured package over the integrations in community. The community integrations are older and could be deprecated but we have not put in the work to verify that we don't lose functionality in langchain-unstructured.
Got it, I will be using langchain-unstructured going forward. It should not have material influence on this PR since DocumentLoaderAsParser can be used on either (any many other loaders)

MacanPN added 2 commits October 30, 2024 17:03

DocumentLoaderAsParser implementation

aa7bbec

adding unit test

187446a

MacanPN added 4 commits October 30, 2024 17:51

typo fix

a9e5bce

removing openpyxl requirement from test

b8312df

adding unstructured requirement to test

3efd2b1

Remove test. Fails with unstructured is not installed but is requir…

b00ed40

…ed for extended tests. despite @pytest.mark.requires("unstructured")

MacanPN marked this pull request as ready for review October 30, 2024 17:48

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Oct 30, 2024

MacanPN changed the title ~~document loader as parser~~ community: DocumentLoaderAsParser wrapper Oct 30, 2024

MacanPN mentioned this pull request Nov 4, 2024

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

Closed

efriis assigned vbarda Nov 7, 2024

vbarda reviewed Nov 20, 2024

View reviewed changes

ccurme reviewed Nov 21, 2024

View reviewed changes

MacanPN commented Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: DocumentLoaderAsParser wrapper #27749

community: DocumentLoaderAsParser wrapper #27749

MacanPN commented Oct 30, 2024 •

edited

Loading

vercel bot commented Oct 30, 2024 •

edited

Loading

MacanPN commented Nov 7, 2024 •

edited

Loading

MacanPN commented Nov 12, 2024

MacanPN commented Nov 18, 2024

vbarda Nov 20, 2024

MacanPN Nov 22, 2024

vbarda Nov 20, 2024

MacanPN Nov 22, 2024

vbarda Nov 20, 2024

MacanPN Nov 22, 2024

vbarda left a comment

ccurme left a comment

MacanPN left a comment

MacanPN Nov 22, 2024

MacanPN Nov 22, 2024

MacanPN Nov 22, 2024

MacanPN commented Nov 22, 2024 •

edited

Loading

community: DocumentLoaderAsParser wrapper #27749

Are you sure you want to change the base?

community: DocumentLoaderAsParser wrapper #27749

Conversation

MacanPN commented Oct 30, 2024 • edited Loading

Description

Features

Usage Example

vercel bot commented Oct 30, 2024 • edited Loading

MacanPN commented Nov 7, 2024 • edited Loading

MacanPN commented Nov 12, 2024

MacanPN commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarda left a comment

Choose a reason for hiding this comment

ccurme left a comment

Choose a reason for hiding this comment

MacanPN left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MacanPN commented Nov 22, 2024 • edited Loading

MacanPN commented Oct 30, 2024 •

edited

Loading

vercel bot commented Oct 30, 2024 •

edited

Loading

MacanPN commented Nov 7, 2024 •

edited

Loading

MacanPN commented Nov 22, 2024 •

edited

Loading