Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: DocumentLoaderAsParser wrapper #27749

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

MacanPN
Copy link
Contributor

@MacanPN MacanPN commented Oct 30, 2024

Description

This pull request introduces the DocumentLoaderAsParser class, which acts as an adapter to transform document loaders into parsers within the LangChain framework. The class enables document loaders that accept a file_path parameter to be utilized as blob parsers. This is particularly useful for integrating various document loading capabilities seamlessly into the LangChain ecosystem.

When merged in together with PR #27716 It opens options for SharePointLoader / OneDriveLoader to process any filetype that has a document loader.

Features

  • Flexible Parsing: The DocumentLoaderAsParser class can adapt any document loader that meets the criteria of accepting a file_path argument, allowing for lazy parsing of documents.
  • Compatibility: The class has been designed to work with various document loaders, making it versatile for different use cases.

Usage Example

To use the DocumentLoaderAsParser, you would initialize it with a suitable document loader class and any required parameters. Here’s an example of how to do this with the UnstructuredExcelLoader:

from langchain_community.document_loaders.blob_loaders import Blob
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

# Initialize the parser adapter with UnstructuredExcelLoader
xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# Use parser, for ex. pass it to MimeTypeBasedParser
MimeTypeBasedParser(
    handlers={
        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": xlsx_parser
    }
)
  • Dependencies: None
  • Twitter handle: @martintriska1

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

Copy link

vercel bot commented Oct 30, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 30, 2024 5:45pm

@MacanPN MacanPN marked this pull request as ready for review October 30, 2024 17:48
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Oct 30, 2024
@MacanPN MacanPN changed the title document loader as parser community: DocumentLoaderAsParser wrapper Oct 30, 2024
vbarda pushed a commit that referenced this pull request Nov 6, 2024
…neDriveLoader (#27716)

## What this PR does?

### Currently `O365BaseLoader` (and consequently both derived loaders)
are limited to `pdf`, `doc`, `docx` files.
- **Solution: here we introduce _handlers_ attribute that allows for
custom handlers to be passed in. This is done in _dict_ form:**

**Example:**
```python
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: #27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "doc": MsWordParser()
    "pdf": PDFMinerParser()
    "txt": TextParser()
    "xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
                            handlers=handlers
                            )
```
This dictionary is then passed to `MimeTypeBasedParser` same as in the
[current
implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13).


### Currently `SharePointLoader` and `OneDriveLoader` are separate
loaders that both inherit from `O365BaseLoader`
However both of these implement the same functionality. The only
differences are:
- `SharePointLoader` requires argument `document_library_id` whereas
`OneDriveLoader` requires `drive_id`. These are just different names for
the same thing.
  - `SharePointLoader` implements significantly more features.
- **Solution: `OneDriveLoader` is replaced with an empty shell just
renaming `drive_id` to `document_library_id` and inheriting from
`SharePointLoader`**

**Dependencies:** None
**Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 7, 2024

I've written down am extra explanation of why one might need to mutate Document Loader into parser. If you're unsure about that, please keep reading:

  • Document loaders function as <whatever> -> documents
  • BlobParsers process blob -> documents.
  • Many document loaders actually implement logic until they have a blob. Then the actual parsing is done by calling parser with similar name. As an example you can look at AmazonTextractPDFLoader and AmazonTextractPDFParser.
  • Some document loaders may want to process numerous filetypes. An example is a SharePointLoader that is supposed to fetch all parsable files, parse them and produce documents. Until recently it processed only 3 file types doc, docx, pdf and ignored everything else. Recently I got this pr merged that enables a user to pass in dict mapping file types (or mime types) to parsers. Now one can easily use SharePointLoader with any files where suitable parser is available.
  • Unfortunatelly many document loaders do not define BlobParser but rather handle everything directly inside the loader class. UnstructuredLoader is one of those.
  • Originally I went ahead and implemented separate ExcelParser (this PR). However this duplicated some of the code. That could be mitigated by replacing the part of the code inside UnstructuredExcelLoader responsible for parsing the file with calling the Parser. This would basically bring it to agreement with how things are done for ex. in PDF Loaders.
  • However I found out that I'd have to do this shuffling for a fairly long list of document loaders. Given how much time such effort would require, I opted to create DocumentLoaderAsParser wrapper. This wrapper can be used on any DocumentLoader that accepts file_path argument and turns it into BaseBlobParser.

So the idea is to be able to do:

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")
mp3_parser = DocumentLoaderAsParser(GoogleSpeechToTextLoader, project_id="...")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "xlsx": xlsx_parser,
    "mp3": mp3_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

Hope this makes sense.

yanomaly pushed a commit to yanomaly/langchain that referenced this pull request Nov 8, 2024
…neDriveLoader (langchain-ai#27716)

## What this PR does?

### Currently `O365BaseLoader` (and consequently both derived loaders)
are limited to `pdf`, `doc`, `docx` files.
- **Solution: here we introduce _handlers_ attribute that allows for
custom handlers to be passed in. This is done in _dict_ form:**

**Example:**
```python
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: langchain-ai#27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "doc": MsWordParser()
    "pdf": PDFMinerParser()
    "txt": TextParser()
    "xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
                            handlers=handlers
                            )
```
This dictionary is then passed to `MimeTypeBasedParser` same as in the
[current
implementation](https://github.com/langchain-ai/langchain/blob/5a2cfb49e045988d290a1c7e3a0c589d6b371694/libs/community/langchain_community/document_loaders/parsers/registry.py#L13).


### Currently `SharePointLoader` and `OneDriveLoader` are separate
loaders that both inherit from `O365BaseLoader`
However both of these implement the same functionality. The only
differences are:
- `SharePointLoader` requires argument `document_library_id` whereas
`OneDriveLoader` requires `drive_id`. These are just different names for
the same thing.
  - `SharePointLoader` implements significantly more features.
- **Solution: `OneDriveLoader` is replaced with an empty shell just
renaming `drive_id` to `document_library_id` and inheriting from
`SharePointLoader`**

**Dependencies:** None
**Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 12, 2024

@vbarda Please take a look :)

@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 18, 2024

@efriis Would you mind taking a look yourself please? Seems like @vbarda might not have the throughput right now? Thank you and please let me know if there is anything I can do on my part to move this forward.

"""
Use underlying DocumentLoader to lazily parse the blob.
"""
doc_loader = self.DocumentLoaderClass(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just create this on init?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because on init, you don't know the file_path. The file path is pulled from blob when lazy_parse is called. This is consistent with all other parsers.

"can be morphed into a parser."
)

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should add parse as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse method is created inside BaseBlobParser here. calling .parse() will end up calling list() on .lazy_parse()


# Ensure the document loader class has a `file_path` parameter
init_signature = inspect.signature(document_loader_class.__init__)
if "file_path" not in init_signature.parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a sufficient condition for being able to convert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the only condition we're adding. Of course any number of problems can arise if the provided document loader class cannot load provided file, however that would be handled inside of given document loader.

Copy link
Contributor

@vbarda vbarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies for the delay -- looks good high level, added some questions

Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

  1. Can we add tests?
  2. Thinking through alternatives, does it make sense to add a "sub" document loader class as an optional attribute of O365BaseLoader or SharePointLoader? Introducing a new abstraction brings some risk if we don't get the interface right (or it's not as generic / extendable as we thought).
  3. If we were to introduce a new class or method, consider decorating it with a @beta parameter if we're not 100% on the API (see example here).

Regarding unstructured: I would favor use of the langchain-unstructured package over the integrations in community. The community integrations are older and could be deprecated but we have not put in the work to verify that we don't lose functionality in langchain-unstructured.

Copy link
Contributor Author

@MacanPN MacanPN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered code comments. Please take a look

"""
Use underlying DocumentLoader to lazily parse the blob.
"""
doc_loader = self.DocumentLoaderClass(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because on init, you don't know the file_path. The file path is pulled from blob when lazy_parse is called. This is consistent with all other parsers.


# Ensure the document loader class has a `file_path` parameter
init_signature = inspect.signature(document_loader_class.__init__)
if "file_path" not in init_signature.parameters:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the only condition we're adding. Of course any number of problems can arise if the provided document loader class cannot load provided file, however that would be handled inside of given document loader.

"can be morphed into a parser."
)

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse method is created inside BaseBlobParser here. calling .parse() will end up calling list() on .lazy_parse()

@MacanPN
Copy link
Contributor Author

MacanPN commented Nov 22, 2024

@ccurme Please take a look. Thanks!

Thanks for this.

1. Can we add tests?

I've had tests in previous commits that I had rolled back. There is a problem that unstructured requires python<3.13 and langchain wants tests to run on 3.13. This is the error I'm getting:

The current project's supported Python range (>=3.9,<4.0) is not compatible with some of the required packages Python requirement:
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
2. Thinking through alternatives, does it make sense to add a "sub" document loader class as an optional attribute of `O365BaseLoader` or `SharePointLoader`? Introducing a new abstraction brings some risk if we don't get the interface right (or it's not as generic / extendable as we thought).

This was my original thought. However after spending some time on this issue, I found this approach much better. Updating O365BaseLoader to accept sub-loaders would fix the issue for O365. The same issue will surface in other loaders as well. The conceptual problem is that we're missing parsers. This class would take care of the root problem.

3. If we were to introduce a new class or method, consider decorating it with a `@beta` parameter if we're not 100% on the API (see example [here](https://github.com/langchain-ai/langchain/blob/f5f53d1101ea73d8465deb7d73b0a4e70bb556e7/libs/community/langchain_community/graph_vectorstores/base.py#L708)).

I can add @beta parameter. What version should I say that it will be added in?

Regarding unstructured: I would favor use of the langchain-unstructured package over the integrations in community. The community integrations are older and could be deprecated but we have not put in the work to verify that we don't lose functionality in langchain-unstructured.
Got it, I will be using langchain-unstructured going forward. It should not have material influence on this PR since DocumentLoaderAsParser can be used on either (any many other loaders)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:M This PR changes 30-99 lines, ignoring generated files.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

3 participants