-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Implement DirectoryLoader lazy_load function #19537
community: Implement DirectoryLoader lazy_load function #19537
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy_load needs to be on feature parity with load() or needs to fail loudly if it's not.
e.g., we should add support for the progress bar -- this can work by first running the glob to quickly count the files
@@ -197,3 +198,55 @@ def load(self) -> List[Document]: | |||
pbar.close() | |||
|
|||
return docs | |||
|
|||
def lazy_load(self) -> Iterator[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great if we're upgrading to lazy_load() could you delete the load() implementation? (it will proxy to lazy_load automatically)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted the previous load_file function, and made load proxy to lazy_load
for i in items: | ||
yield from self.lazy_load_file(i, p) | ||
|
||
def lazy_load_file(self, item: Path, path: Path) -> Iterator[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this private? This is a non standard part of the document loader API so generally we do not expect users to be relying on it.
Could you also update load_file to proxy to this method, so we can avoid code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed lazy_load_file to _lazy_load_file. Made load proxy to lazy_load. load_file wasn't used anymore, so I deleted it.
For the documentation, file directory loader is documented here:
3 options:
|
@DasDingoCodes i reverted doc changes for now so we can merge. The doc changes in the mdx file didn't look right, and we don't want the other notebook present given that there's already a place for this documentation. If you want you can follow up with a PR to replace the mdx file with your original notebook (or somehow combine the content together) |
…hain-ai#19537) Thank you for contributing to LangChain! - [x] **PR title**: "community: Implement DirectoryLoader lazy_load function" - [x] **Description**: The `lazy_load` function of the `DirectoryLoader` yields each document separately. If the given `loader_cls` of the `DirectoryLoader` also implemented `lazy_load`, it will be used to yield subdocuments of the file. - [x] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access: `libs/community/tests/unit_tests/document_loaders/test_directory_loader.py` 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory: `docs/docs/integrations/document_loaders/directory.ipynb` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: Eugene Yurtsev <[email protected]>
Thank you for contributing to LangChain! - [x] **PR title**: "community: Implement DirectoryLoader lazy_load function" - [x] **Description**: The `lazy_load` function of the `DirectoryLoader` yields each document separately. If the given `loader_cls` of the `DirectoryLoader` also implemented `lazy_load`, it will be used to yield subdocuments of the file. - [x] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access: `libs/community/tests/unit_tests/document_loaders/test_directory_loader.py` 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory: `docs/docs/integrations/document_loaders/directory.ipynb` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: Eugene Yurtsev <[email protected]>
Thank you for contributing to LangChain!
PR title: "community: Implement DirectoryLoader lazy_load function"
Description: The
lazy_load
function of theDirectoryLoader
yields each document separately. If the givenloader_cls
of theDirectoryLoader
also implementedlazy_load
, it will be used to yield subdocuments of the file.Add tests and docs: If you're adding a new integration, please include
libs/community/tests/unit_tests/document_loaders/test_directory_loader.py
docs/docs/integrations
directory:docs/docs/integrations/document_loaders/directory.ipynb
Lint and test: Run
make format
,make lint
andmake test
from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/Additional guidelines:
If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.