Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

Conversation

MacanPN
Copy link
Contributor

@MacanPN MacanPN commented Oct 4, 2024

Description

Currently there is no implementation of .xlsx parser and consequently FileSystemBlobLoader as well as all loaders derived from O365BaseLoader are limited to .doc, .docx, .pdf and .txt. This PR uses unstructured to implement ExcelParser parser very similar to MsWordParser (with just an extra bit of post processing).

Dependencies:

  • htmltabletomd

Tests

I've implemented a unit test for the .xlsx parser however it depends on unstructured module and I have a problem getting any version of unstructured to pyproject.toml that would work and didn't break something somewhere. Please let me know if anyone is willing to help me with that.

Twitter handle: @martintriska1

Is anyone willing to review this please? Thanks! @baskaryan, @eyurtsev, @ccurme

Copy link

vercel bot commented Oct 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 8, 2024 8:08am

@MacanPN MacanPN marked this pull request as ready for review October 4, 2024 16:58
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 4, 2024
@MacanPN MacanPN force-pushed the triska/add_xlsx_to_sharepoint_loader branch from 3af4aab to 5a13580 Compare October 4, 2024 22:11
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 4, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 8, 2024
@petergoldstein
Copy link

XLSX files are important for a variety of enterprise use cases. Having a parser will be very handy for some of our current challenges.

@avalon-swain
Copy link

This will be helpful


"""
try:
from unstructured.partition.xlsx import partition_xlsx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. How do you feel about using langchain-unstructured here? Are there important differences in the resulting Document objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccurme Actually in the meantime, I've written up this wrapper that allows to turn a DocumentLoader into Parser. Excel parser is already implemented here so I'd abandon this PR in favor of merging the DocumentLoaderAsParser wrapper together with some more changes to UnstructuredExcelLoader (or possibly langchain-unstructured.document_loaders.UnstructuredLoader). What's the current thinking? Is the implementation in community being deprecated? (In community there is implementation with inheritance UnstructuredExcelLoader : UnstructuredFileLoader : UnstructuredBaseLoader : BaseLoader where as in langchain-unstructured there is independent implementation only inheriting directly from BaseLoader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccurme I see that the DocumentLoaderAsParser was assigned status needs support. I would appreciate if you could take a look at that one in context of being substitute for this PR. Thanks!

Copy link
Contributor Author

@MacanPN MacanPN Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccurme I'm thinking I might not have made clear enough about why having BlobParser rather than DocumentLoader is important. Please let me breifly explain:

  • Document loaders function as <whatever> -> documents
  • BlobParsers process blob -> documents.
  • Many document loaders actually implement logic until they have a blob. Then the actual parsing is done by calling parser with similar name. As an example you can look at AmazonTextractPDFLoader and AmazonTextractPDFParser.
  • Some document loaders may want to process numerous filetypes. An example is a SharePointLoader that is supposed to fetch all parsable files, parse them and produce documents. Until recently it processed only 3 file types doc, docx, pdf and ignored everything else. Recently I got this pr merged that enables a user to pass in dict mapping file types (or mime types) to parsers. Now one can easily use SharePointLoader with any files where suitable parser is available.
  • Unfortunatelly many document loaders do not define BlobParser but rather handle everything directly inside the loader class. UnstructuredLoader is one of those.
  • Originally I went ahead and implemented separate ExcelParser (this PR). However this duplicated some of the code. That could be mitigated by replacing the part of the code inside UnstructuredExcelLoader responsible for parsing the file with calling the Parser. This would basically bring it to agreement with how things are done for ex. in PDF Loaders.
  • However I found out that I'd have to do this shuffling for a fairly long list of document loaders. Given how much time such effort would require, I opted to create DocumentLoaderAsParser wrapper. This wrapper can be used on any DocumentLoader that accepts file_path argument and turns it into BaseBlobParser.

So the idea is to be able to do:

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")
mp3_parser = DocumentLoaderAsParser(GoogleSpeechToTextLoader, project_id="...")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "xlsx": xlsx_parser,
    "mp3": mp3_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

Hope this makes sense.

@ccurme
Copy link
Collaborator

ccurme commented Nov 21, 2024

Closing in favor of #27749, let me know if I misunderstood the intent.

@ccurme ccurme closed this Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

4 participants