[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

MacanPN · 2024-10-04T14:03:00Z

Description

Currently there is no implementation of .xlsx parser and consequently FileSystemBlobLoader as well as all loaders derived from O365BaseLoader are limited to .doc, .docx, .pdf and .txt. This PR uses unstructured to implement ExcelParser parser very similar to MsWordParser (with just an extra bit of post processing).

Dependencies:

htmltabletomd

Tests

I've implemented a unit test for the .xlsx parser however it depends on unstructured module and I have a problem getting any version of unstructured to pyproject.toml that would work and didn't break something somewhere. Please let me know if anyone is willing to help me with that.

Twitter handle: @martintriska1

Is anyone willing to review this please? Thanks! @baskaryan, @eyurtsev, @ccurme

vercel · 2024-10-04T14:03:04Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 8, 2024 8:08am

petergoldstein · 2024-10-09T14:55:15Z

XLSX files are important for a variety of enterprise use cases. Having a parser will be very handy for some of our current challenges.

avalon-swain · 2024-10-09T19:36:30Z

This will be helpful

ccurme · 2024-10-31T15:13:25Z

libs/community/langchain_community/document_loaders/parsers/excel.py

+
+        """
+        try:
+            from unstructured.partition.xlsx import partition_xlsx


Thanks for this. How do you feel about using langchain-unstructured here? Are there important differences in the resulting Document objects?

@ccurme Actually in the meantime, I've written up this wrapper that allows to turn a DocumentLoader into Parser. Excel parser is already implemented here so I'd abandon this PR in favor of merging the DocumentLoaderAsParser wrapper together with some more changes to UnstructuredExcelLoader (or possibly langchain-unstructured.document_loaders.UnstructuredLoader). What's the current thinking? Is the implementation in community being deprecated? (In community there is implementation with inheritance UnstructuredExcelLoader : UnstructuredFileLoader : UnstructuredBaseLoader : BaseLoader where as in langchain-unstructured there is independent implementation only inheriting directly from BaseLoader.

@ccurme I see that the DocumentLoaderAsParser was assigned status needs support. I would appreciate if you could take a look at that one in context of being substitute for this PR. Thanks!

@ccurme I'm thinking I might not have made clear enough about why having BlobParser rather than DocumentLoader is important. Please let me breifly explain:

Document loaders function as <whatever> -> documents

BlobParsers process blob -> documents.

Many document loaders actually implement logic until they have a blob. Then the actual parsing is done by calling parser with similar name. As an example you can look at AmazonTextractPDFLoader and AmazonTextractPDFParser.

Some document loaders may want to process numerous filetypes. An example is a SharePointLoader that is supposed to fetch all parsable files, parse them and produce documents. Until recently it processed only 3 file types doc, docx, pdf and ignored everything else. Recently I got this pr merged that enables a user to pass in dict mapping file types (or mime types) to parsers. Now one can easily use SharePointLoader with any files where suitable parser is available.

Unfortunatelly many document loaders do not define BlobParser but rather handle everything directly inside the loader class. UnstructuredLoader is one of those.

Originally I went ahead and implemented separate ExcelParser (this PR). However this duplicated some of the code. That could be mitigated by replacing the part of the code inside UnstructuredExcelLoader responsible for parsing the file with calling the Parser. This would basically bring it to agreement with how things are done for ex. in PDF Loaders.

However I found out that I'd have to do this shuffling for a fairly long list of document loaders. Given how much time such effort would require, I opted to create DocumentLoaderAsParser wrapper. This wrapper can be used on any DocumentLoader that accepts file_path argument and turns it into BaseBlobParser.

So the idea is to be able to do:

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") mp3_parser = DocumentLoaderAsParser(GoogleSpeechToTextLoader, project_id="...") # create dictionary mapping file types to handlers (parsers) handlers = { "xlsx": xlsx_parser, "mp3": mp3_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load()

Hope this makes sense.

ccurme · 2024-11-21T14:56:04Z

Closing in favor of #27749, let me know if I misunderstood the intent.

implementation of excel parser and including it in o365 loader

3a6e502

MacanPN added 2 commits October 4, 2024 16:47

lint

2adcbb6

small refactor

5a13580

MacanPN marked this pull request as ready for review October 4, 2024 16:58

MacanPN force-pushed the triska/add_xlsx_to_sharepoint_loader branch from 3af4aab to 5a13580 Compare October 4, 2024 22:11

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 4, 2024

expanded class docstring and slight changes to output format

44dc527

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 8, 2024

efriis assigned ccurme Oct 31, 2024

ccurme reviewed Oct 31, 2024

View reviewed changes

MacanPN mentioned this pull request Nov 7, 2024

community: DocumentLoaderAsParser wrapper #27749

Open

ccurme closed this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

MacanPN commented Oct 4, 2024 •

edited

Loading

vercel bot commented Oct 4, 2024 •

edited

Loading

petergoldstein commented Oct 9, 2024

avalon-swain commented Oct 9, 2024

ccurme Oct 31, 2024

MacanPN Nov 4, 2024

MacanPN Nov 5, 2024

MacanPN Nov 7, 2024 •

edited

Loading

ccurme commented Nov 21, 2024

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

[community] [feature]: Implementation of excel parser and including it in o365 loader #27103

Conversation

MacanPN commented Oct 4, 2024 • edited Loading

Description

Dependencies:

Tests

vercel bot commented Oct 4, 2024 • edited Loading

petergoldstein commented Oct 9, 2024

avalon-swain commented Oct 9, 2024

ccurme Oct 31, 2024

Choose a reason for hiding this comment

MacanPN Nov 4, 2024

Choose a reason for hiding this comment

MacanPN Nov 5, 2024

Choose a reason for hiding this comment

MacanPN Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

ccurme commented Nov 21, 2024

MacanPN commented Oct 4, 2024 •

edited

Loading

vercel bot commented Oct 4, 2024 •

edited

Loading

MacanPN Nov 7, 2024 •

edited

Loading