-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstractions for document processing #2833
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some high level questions:
-
why is the difference between blob and document important? is it because there is common logic to be factored out for processing? If so, why cant those just be helper functions to transform str -> str or Document -> Document?
-
If blobs are so great, should lazy_load just always return a generator of blobs?
This PR is experimental, trying to see if introducing another data type (blob) will help out reduce complexity / make code more re-usable.
See section below for more details. There's code duplication between the loaders. Some loaders hard-code parsers that should be configurable. A lot of loaders hard-code assumption that the file resides on the local file system
The current definition of a Options:
Raw data isn't a With option (2), loaders listed below (s3, azure, directory loader) would yield blobs, allowing one to specify which parser should be used to convert the raw data into a document, and allow avoiding re-implementation of file handling code.
@hwchase17 Zooming out for context; 2 main issues I see:
Here is a few loaders that should be generic that hard-code unstructured parser (and also load content eagerly) Directory loader But langchain already supports different strategies for loading PDFs -- one just can't use them with the s3 loader or the azure loader etc. |
Here's an example of the PDF loader implementing utility code to resolve file vs. URLs to load the content. https://github.com/hwchase17/langchain/blob/main/langchain/document_loaders/pdf.py#L37-L37 |
Proposal (Version 1)
Option #2Instead of introducing a new Option #3Minimal changes for now. Perhaps only add an TODOCurrent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool, have a few noob questions
items = p.rglob(self.glob) if self.recursive else p.glob(self.glob) | ||
for item in items: | ||
if item.is_file(): | ||
if _is_visible(item.relative_to(p)) or self.load_hidden: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we flip, since checking load_hidden (ever so slightly) faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah can update -- this was copied from an existing implementation
|
||
from langchain.docstore.document import Document | ||
from langchain.document_loaders.blob_loaders import Blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it weird that base imports from something more nested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps BlobParser should live in blob_loaders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can definitely move the BlobParser
, we'll need to decide what's the most logical place
One thing I wasn't so sure about is that the document loaders right now are basically content fetchers + parsers and the majority of their work should be focused on parsing, so it felt like a parser abstraction could be accommodated at this level of the hierarchy. I'll start diffing smaller more careful changes now, and we can iron out all the namespaces to make sure everything is in the most logical place.
base
aka typedefs
ideally don't import much at all (except for other typedefs), so there's that...
IMO it's generally OK to import more nested code, since the nested code is owned by the importing code. The dangerous imports are when reaching into parent paths or sibling paths since that means that we're reaching into nested code that doesn't belong to the given module/package
self, | ||
path: str, | ||
glob: str = "**/[!.]*", | ||
*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what're pros of including this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
being *
or the default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
class FileSystemLoader(BlobLoader): | ||
"""Loading logic for loading documents from a directory.""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hwchase17 how do we think about when to use pydantic (and let it handle constructors) vs not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch -- this should've probably be pydantic, i was moving code around carelessly
|
||
def load(self) -> List[Document]: | ||
from bs4 import BeautifulSoup | ||
"""Load all documents eagerly.""" | ||
return list(self.lazy_load()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be implemented on BaseLoader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not super clean, since if users don't want to implement lazy_load they need to overwrite load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do either way -- one possibility is another base class that provides a default implementation
* A blob parser provides a way to parse a blob into one or more documents | ||
""" | ||
|
||
@abc.abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractmethod
|
||
import abc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
del
loader_kwargs: Keyword arguments to pass to loader_cls. | ||
recursive: If True, will recursively load files. | ||
""" | ||
self.loader = FileSystemLoader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would something like blob_loader or file_loader be clearer name?
from langchain.schema import Document | ||
|
||
|
||
class BaseDocumentProcessor(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now have DocumentTransformer abstraction that seems very similar, should we try combining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably yes -- will need to take a look at those
these abstractions will go in last in the sequencing order
"""Initialize with path.""" | ||
self.file_path = path | ||
self.loader = FileSystemLoader(path, glob="**/*.md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what're your thoughts on having BlobLoaders as instance vars versus inheriting from them? e.g. an alternative implementation could be
class RoamLoader(BaseLoader, FileSystemLoader):
def __init__(self, path):
super().__init__(path, glob="**/*.md")
def lazy_load(self) -> ...:
for blob in self.yield_blobs():
yield ...
i don't have a good way of deciding b/n the two approaches, curious if you do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer encouraging a compositional pattern, so the loader doesn't end up getting tied to a particular storage system (e.g., the file system).
i.e., if the md
files are stored on s3 rather than the local file system, we'd want to swap out the loader instead of implementing another RoamLoader class that's specialized for s3
# Represent location on the local file system | ||
# Useful for situations where downstream code assumes it must work with file paths | ||
# rather than in-memory content. | ||
path: Optional[PathLike] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if this was pulled from e.g., a URL this would be None
? A temp dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undecided -- we can set to None or set to the source URL if known or set to a temp location on disk if it was downloaded to the file system and not stored in memory.
Any opinions?
We can extend it to support URLs and support a driver for loading content feels a bit complex..
e.g., there's more than one way to fetch an HTML file (one with requests and another with something like playwright to execute the js)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have opinions here. You're right, that sounds complex
Returns: | ||
Blob instance | ||
""" | ||
mimetype = mimetypes.guess_type(path)[0] if guess_type else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc: would it make sense to make the mimetype required and use a pptern like
@classmethod
def from_path(
cls,
path: Union[str, PurePath],
*,
encoding: str = "utf-8",
mimetype: Optional[str] = None,
) -> "Blob":
...
if mimetype is None:
mimetype = mimetypes.guess_type(path)[0]
...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vowelparrot @dev2049 Thank you for the feedback!
|
||
from langchain.docstore.document import Document | ||
from langchain.document_loaders.blob_loaders import Blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can definitely move the BlobParser
, we'll need to decide what's the most logical place
One thing I wasn't so sure about is that the document loaders right now are basically content fetchers + parsers and the majority of their work should be focused on parsing, so it felt like a parser abstraction could be accommodated at this level of the hierarchy. I'll start diffing smaller more careful changes now, and we can iron out all the namespaces to make sure everything is in the most logical place.
base
aka typedefs
ideally don't import much at all (except for other typedefs), so there's that...
IMO it's generally OK to import more nested code, since the nested code is owned by the importing code. The dangerous imports are when reaching into parent paths or sibling paths since that means that we're reaching into nested code that doesn't belong to the given module/package
class FileSystemLoader(BlobLoader): | ||
"""Loading logic for loading documents from a directory.""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch -- this should've probably be pydantic, i was moving code around carelessly
self, | ||
path: str, | ||
glob: str = "**/[!.]*", | ||
*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
being *
or the default values?
items = p.rglob(self.glob) if self.recursive else p.glob(self.glob) | ||
for item in items: | ||
if item.is_file(): | ||
if _is_visible(item.relative_to(p)) or self.load_hidden: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah can update -- this was copied from an existing implementation
# Represent location on the local file system | ||
# Useful for situations where downstream code assumes it must work with file paths | ||
# rather than in-memory content. | ||
path: Optional[PathLike] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undecided -- we can set to None or set to the source URL if known or set to a temp location on disk if it was downloaded to the file system and not stored in memory.
Any opinions?
We can extend it to support URLs and support a driver for loading content feels a bit complex..
e.g., there's more than one way to fetch an HTML file (one with requests and another with something like playwright to execute the js)
Returns: | ||
Blob instance | ||
""" | ||
mimetype = mimetypes.guess_type(path)[0] if guess_type else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
yield from sub_docs | ||
except Exception as e: | ||
if self.silent_errors: | ||
logger.warning(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we should update to logger.error probably or allow controlling the level
|
||
def load(self) -> List[Document]: | ||
from bs4 import BeautifulSoup | ||
"""Load all documents eagerly.""" | ||
return list(self.lazy_load()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do either way -- one possibility is another base class that provides a default implementation
from langchain.schema import Document | ||
|
||
|
||
class BaseDocumentProcessor(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably yes -- will need to take a look at those
these abstractions will go in last in the sequencing order
"""Initialize with path.""" | ||
self.file_path = path | ||
self.loader = FileSystemLoader(path, glob="**/*.md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer encouraging a compositional pattern, so the loader doesn't end up getting tied to a particular storage system (e.g., the file system).
i.e., if the md
files are stored on s3 rather than the local file system, we'd want to swap out the loader instead of implementing another RoamLoader class that's specialized for s3
This PR introduces a Blob data type and a Blob loader interface. This is the first of a sequence of PRs that follows this proposal: #2833 The primary goals of these abstraction are: * Decouple content loading from content parsing code. * Help duplicated content loading code from document loaders. * Make lazy loading a default for langchain.
Adding a lazy iteration for document loaders. Following the plan here: #2833 Keeping the `load` method as is for backwards compatibility. The `load` returns a materialized list of documents and downstream users may rely on that fact. A new method that returns an iterable is introduced for handling lazy loading. --------- Co-authored-by: Zander Chase <[email protected]>
This PR introduces a Blob data type and a Blob loader interface. This is the first of a sequence of PRs that follows this proposal: #2833 The primary goals of these abstraction are: * Decouple content loading from content parsing code. * Help duplicated content loading code from document loaders. * Make lazy loading a default for langchain.
Adding a lazy iteration for document loaders. Following the plan here: #2833 Keeping the `load` method as is for backwards compatibility. The `load` returns a materialized list of documents and downstream users may rely on that fact. A new method that returns an iterable is introduced for handling lazy loading. --------- Co-authored-by: Zander Chase <[email protected]>
This PR introduces a Blob data type and a Blob loader interface. This is the first of a sequence of PRs that follows this proposal: langchain-ai#2833 The primary goals of these abstraction are: * Decouple content loading from content parsing code. * Help duplicated content loading code from document loaders. * Make lazy loading a default for langchain.
Adding a lazy iteration for document loaders. Following the plan here: langchain-ai#2833 Keeping the `load` method as is for backwards compatibility. The `load` returns a materialized list of documents and downstream users may rely on that fact. A new method that returns an iterable is introduced for handling lazy loading. --------- Co-authored-by: Zander Chase <[email protected]>
This PR adds the BlobParser abstraction. It follows the proposal described here: #2833 (comment)
Closing PR as most of the stuff has already been committed! |
WIP