Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Implement DirectoryLoader lazy_load function #19537

Merged
merged 23 commits into from
Mar 29, 2024

Conversation

DasDingoCodes
Copy link
Contributor

Thank you for contributing to LangChain!

  • PR title: "community: Implement DirectoryLoader lazy_load function"

  • Description: The lazy_load function of the DirectoryLoader yields each document separately. If the given loader_cls of the DirectoryLoader also implemented lazy_load, it will be used to yield subdocuments of the file.

  • Add tests and docs: If you're adding a new integration, please include

    1. a test for the integration, preferably unit tests that do not rely on network access: libs/community/tests/unit_tests/document_loaders/test_directory_loader.py
    2. an example notebook showing its use. It lives in docs/docs/integrations directory: docs/docs/integrations/document_loaders/directory.ipynb
  • Lint and test: Run make format, make lint and make test from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:

  • Make sure optional dependencies are imported within a function.
  • Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests.
  • Most PRs should not touch more than one package.
  • Changes should be backwards compatible.
  • If you are adding something to community, do not re-import it in langchain.

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.

Copy link

vercel bot commented Mar 25, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 29, 2024 2:46pm

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: doc loader Related to document loader module (not documentation) 🔌: openai Primarily related to OpenAI integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Mar 25, 2024
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy_load needs to be on feature parity with load() or needs to fail loudly if it's not.

e.g., we should add support for the progress bar -- this can work by first running the glob to quickly count the files

@@ -197,3 +198,55 @@ def load(self) -> List[Document]:
pbar.close()

return docs

def lazy_load(self) -> Iterator[Document]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great if we're upgrading to lazy_load() could you delete the load() implementation? (it will proxy to lazy_load automatically)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the previous load_file function, and made load proxy to lazy_load

for i in items:
yield from self.lazy_load_file(i, p)

def lazy_load_file(self, item: Path, path: Path) -> Iterator[Document]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this private? This is a non standard part of the document loader API so generally we do not expect users to be relying on it.

Could you also update load_file to proxy to this method, so we can avoid code duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed lazy_load_file to _lazy_load_file. Made load proxy to lazy_load. load_file wasn't used anymore, so I deleted it.

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Mar 27, 2024
@eyurtsev
Copy link
Collaborator

For the documentation, file directory loader is documented here:

docs/docs/modules/data_connection/document_loaders/file_directory.mdx

3 options:

  1. Replace the contents of the original documentation with the new documentation (potentially include any missing stuff) in this PR
  2. same as above but in a separate PR
  3. Don't do documentation since it's already documented

@eyurtsev
Copy link
Collaborator

@DasDingoCodes i reverted doc changes for now so we can merge. The doc changes in the mdx file didn't look right, and we don't want the other notebook present given that there's already a place for this documentation.

If you want you can follow up with a PR to replace the mdx file with your original notebook (or somehow combine the content together)

@eyurtsev eyurtsev enabled auto-merge (squash) March 29, 2024 14:39
@eyurtsev eyurtsev merged commit 73eb3f8 into langchain-ai:master Mar 29, 2024
59 checks passed
gkorland pushed a commit to FalkorDB/langchain that referenced this pull request Mar 30, 2024
…hain-ai#19537)

Thank you for contributing to LangChain!

- [x] **PR title**: "community: Implement DirectoryLoader lazy_load
function"

- [x] **Description**: The `lazy_load` function of the `DirectoryLoader`
yields each document separately. If the given `loader_cls` of the
`DirectoryLoader` also implemented `lazy_load`, it will be used to yield
subdocuments of the file.

- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access:
`libs/community/tests/unit_tests/document_loaders/test_directory_loader.py`
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory:
`docs/docs/integrations/document_loaders/directory.ipynb`


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
@DasDingoCodes DasDingoCodes deleted the directoryloader-lazy-load branch March 30, 2024 10:28
hinthornw pushed a commit that referenced this pull request Apr 26, 2024
Thank you for contributing to LangChain!

- [x] **PR title**: "community: Implement DirectoryLoader lazy_load
function"

- [x] **Description**: The `lazy_load` function of the `DirectoryLoader`
yields each document separately. If the given `loader_cls` of the
`DirectoryLoader` also implemented `lazy_load`, it will be used to yield
subdocuments of the file.

- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access:
`libs/community/tests/unit_tests/document_loaders/test_directory_loader.py`
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory:
`docs/docs/integrations/document_loaders/directory.ipynb`


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. 🔌: openai Primarily related to OpenAI integrations size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants