-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[community] [feature]: Implementation of excel parser and including it in o365 loader #27103
[community] [feature]: Implementation of excel parser and including it in o365 loader #27103
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
3af4aab
to
5a13580
Compare
XLSX files are important for a variety of enterprise use cases. Having a parser will be very handy for some of our current challenges. |
This will be helpful |
|
||
""" | ||
try: | ||
from unstructured.partition.xlsx import partition_xlsx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. How do you feel about using langchain-unstructured here? Are there important differences in the resulting Document
objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme Actually in the meantime, I've written up this wrapper that allows to turn a DocumentLoader
into Parser
. Excel parser is already implemented here so I'd abandon this PR in favor of merging the DocumentLoaderAsParser
wrapper together with some more changes to UnstructuredExcelLoader
(or possibly langchain-unstructured.document_loaders.UnstructuredLoader
). What's the current thinking? Is the implementation in community being deprecated? (In community there is implementation with inheritance UnstructuredExcelLoader : UnstructuredFileLoader : UnstructuredBaseLoader : BaseLoader
where as in langchain-unstructured
there is independent implementation only inheriting directly from BaseLoader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme I see that the DocumentLoaderAsParser was assigned status needs support. I would appreciate if you could take a look at that one in context of being substitute for this PR. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme I'm thinking I might not have made clear enough about why having BlobParser
rather than DocumentLoader
is important. Please let me breifly explain:
- Document loaders function as
<whatever> -> documents
- BlobParsers process
blob -> documents
. - Many document loaders actually implement logic until they have a blob. Then the actual parsing is done by calling parser with similar name. As an example you can look at
AmazonTextractPDFLoader
andAmazonTextractPDFParser
. - Some document loaders may want to process numerous filetypes. An example is a
SharePointLoader
that is supposed to fetch all parsable files, parse them and produce documents. Until recently it processed only 3 file typesdoc, docx, pdf
and ignored everything else. Recently I got this pr merged that enables a user to pass in dict mapping file types (or mime types) to parsers. Now one can easily useSharePointLoader
with any files where suitable parser is available. - Unfortunatelly many document loaders do not define
BlobParser
but rather handle everything directly inside the loader class.UnstructuredLoader
is one of those. - Originally I went ahead and implemented separate
ExcelParser
(this PR). However this duplicated some of the code. That could be mitigated by replacing the part of the code insideUnstructuredExcelLoader
responsible for parsing the file with calling theParser
. This would basically bring it to agreement with how things are done for ex. in PDF Loaders. - However I found out that I'd have to do this shuffling for a fairly long list of document loaders. Given how much time such effort would require, I opted to create
DocumentLoaderAsParser
wrapper. This wrapper can be used on any DocumentLoader that acceptsfile_path
argument and turns it intoBaseBlobParser
.
So the idea is to be able to do:
xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")
mp3_parser = DocumentLoaderAsParser(GoogleSpeechToTextLoader, project_id="...")
# create dictionary mapping file types to handlers (parsers)
handlers = {
"xlsx": xlsx_parser,
"mp3": mp3_parser
}
loader = SharePointLoader(document_library_id="...",
handlers=handlers # pass handlers to SharePointLoader
)
documents = loader.load()
Hope this makes sense.
Closing in favor of #27749, let me know if I misunderstood the intent. |
Description
Currently there is no implementation of
.xlsx
parser and consequentlyFileSystemBlobLoader
as well as all loaders derived fromO365BaseLoader
are limited to.doc
,.docx
,.pdf
and.txt
. This PR uses unstructured to implementExcelParser
parser very similar toMsWordParser
(with just an extra bit of post processing).Dependencies:
Tests
I've implemented a unit test for the
.xlsx
parser however it depends onunstructured
module and I have a problem getting any version of unstructured topyproject.toml
that would work and didn't break something somewhere. Please let me know if anyone is willing to help me with that.Twitter handle: @martintriska1
Is anyone willing to review this please? Thanks! @baskaryan, @eyurtsev, @ccurme