-
Notifications
You must be signed in to change notification settings - Fork 85
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* feat: Add support for LlamaIndex Document type Added support for LlamaIndex Document type Feature #COG-337 * docs: Add Jupyer Notebook for cognee with llama index document type Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage Docs #COG-337 * feat: Add metadata migration from LlamaIndex document type Allow usage of metadata from LlamaIndex documents Feature #COG-337 * refactor: Change llama index migration function name Change name of llama index function Refactor #COG-337 * chore: Add llama index core dependency Downgrade needed on tenacity and instructor modules to support llama index Chore #COG-337 * Feature: Add ingest_data_with_metadata task Added task that will have access to metadata if data is provided from different data ingestion tools Feature #COG-337 * docs: Add description on why specific type checking is done Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well Docs #COG-337 * fix: Add missing parameter to function call Added missing parameter to function call Fix #COG-337 * refactor: Move storing of data from async to sync function Moved data storing from async to sync Refactor #COG-337 * refactor: Pretend ingest_data was changes instead of having two tasks Refactor so ingest_data file was modified instead of having two ingest tasks Refactor #COG-337 * refactor: Use old name for data ingestion with metadata Merged new and old data ingestion tasks into one Refactor #COG-337 * refactor: Return ingest_data and save_data_to_storage Tasks Returned ingest_data and save_data_to_storage tasks Refactor #COG-337 * refactor: Return previous ingestion Tasks to add function Returned previous ignestion tasks to add function Refactor #COG-337 * fix: Remove dict and use string for search query Remove dictionary and use string for query in notebook and simple example Fix COG-337 * refactor: Add changes request in pull request Added the following changes that were requested in pull request: Added synchronize label, Made uniform syntax in if statement in workflow, fixed instructor dependency, added llama-index to be optional Refactor COG-337 * fix: Resolve issue with llama-index being mandatory Resolve issue with llama-index being mandatory to run cognee Fix COG-337 * fix: Add install of llama-index to notebook Removed additional references to llama-index from core cognee lib. Added llama-index-core install from notebook Fix COG-337 ---------
- Loading branch information
Showing
16 changed files
with
588 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
name: test | llama index notebook | ||
|
||
on: | ||
workflow_dispatch: | ||
pull_request: | ||
branches: | ||
- main | ||
types: [labeled, synchronize] | ||
|
||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ||
cancel-in-progress: true | ||
|
||
env: | ||
RUNTIME__LOG_LEVEL: ERROR | ||
|
||
jobs: | ||
get_docs_changes: | ||
name: docs changes | ||
uses: ./.github/workflows/get_docs_changes.yml | ||
|
||
run_notebook_test: | ||
name: test | ||
needs: get_docs_changes | ||
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks' | ||
runs-on: ubuntu-latest | ||
defaults: | ||
run: | ||
shell: bash | ||
steps: | ||
- name: Check out | ||
uses: actions/checkout@master | ||
|
||
- name: Setup Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: '3.11.x' | ||
|
||
- name: Install Poetry | ||
uses: snok/[email protected] | ||
with: | ||
virtualenvs-create: true | ||
virtualenvs-in-project: true | ||
installer-parallel: true | ||
|
||
- name: Install dependencies | ||
run: | | ||
poetry install --no-interaction --all-extras --no-root | ||
poetry add jupyter --no-interaction | ||
- name: Execute Jupyter Notebook | ||
env: | ||
ENV: 'dev' | ||
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }} | ||
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }} | ||
run: | | ||
poetry run jupyter nbconvert \ | ||
--to notebook \ | ||
--execute notebooks/cognee_llama_index.ipynb \ | ||
--output executed_notebook.ipynb \ | ||
--ExecutePreprocessor.timeout=1200 |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,4 @@ | ||
from .ingest_data import ingest_data | ||
from .save_data_to_storage import save_data_to_storage | ||
from .save_data_item_to_storage import save_data_item_to_storage | ||
from .save_data_item_with_metadata_to_storage import save_data_item_with_metadata_to_storage |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
import dlt | ||
import cognee.modules.ingestion as ingestion | ||
from typing import Any | ||
from cognee.shared.utils import send_telemetry | ||
from cognee.modules.users.models import User | ||
from cognee.infrastructure.databases.relational import get_relational_engine | ||
from cognee.modules.data.methods import create_dataset | ||
from cognee.modules.users.permissions.methods import give_permission_on_document | ||
from .get_dlt_destination import get_dlt_destination | ||
from .save_data_item_with_metadata_to_storage import save_data_item_with_metadata_to_storage | ||
|
||
async def ingest_data_with_metadata(data: Any, dataset_name: str, user: User): | ||
destination = get_dlt_destination() | ||
|
||
pipeline = dlt.pipeline( | ||
pipeline_name = "file_load_from_filesystem", | ||
destination = destination, | ||
) | ||
|
||
@dlt.resource(standalone = True, merge_key = "id") | ||
async def data_resources(data: Any, user: User): | ||
if not isinstance(data, list): | ||
# Convert data to a list as we work with lists further down. | ||
data = [data] | ||
|
||
# Process data | ||
for data_item in data: | ||
|
||
file_path = save_data_item_with_metadata_to_storage(data_item, dataset_name) | ||
|
||
# Ingest data and add metadata | ||
with open(file_path.replace("file://", ""), mode = "rb") as file: | ||
classified_data = ingestion.classify(file) | ||
|
||
data_id = ingestion.identify(classified_data) | ||
|
||
file_metadata = classified_data.get_metadata() | ||
|
||
from sqlalchemy import select | ||
from cognee.modules.data.models import Data | ||
|
||
db_engine = get_relational_engine() | ||
|
||
async with db_engine.get_async_session() as session: | ||
dataset = await create_dataset(dataset_name, user.id, session) | ||
|
||
data_point = (await session.execute( | ||
select(Data).filter(Data.id == data_id) | ||
)).scalar_one_or_none() | ||
|
||
if data_point is not None: | ||
data_point.name = file_metadata["name"] | ||
data_point.raw_data_location = file_metadata["file_path"] | ||
data_point.extension = file_metadata["extension"] | ||
data_point.mime_type = file_metadata["mime_type"] | ||
|
||
await session.merge(data_point) | ||
await session.commit() | ||
else: | ||
data_point = Data( | ||
id = data_id, | ||
name = file_metadata["name"], | ||
raw_data_location = file_metadata["file_path"], | ||
extension = file_metadata["extension"], | ||
mime_type = file_metadata["mime_type"], | ||
) | ||
|
||
dataset.data.append(data_point) | ||
await session.commit() | ||
|
||
yield { | ||
"id": data_id, | ||
"name": file_metadata["name"], | ||
"file_path": file_metadata["file_path"], | ||
"extension": file_metadata["extension"], | ||
"mime_type": file_metadata["mime_type"], | ||
} | ||
|
||
await give_permission_on_document(user, data_id, "read") | ||
await give_permission_on_document(user, data_id, "write") | ||
|
||
|
||
send_telemetry("cognee.add EXECUTION STARTED", user_id = user.id) | ||
run_info = pipeline.run( | ||
data_resources(data, user), | ||
table_name = "file_metadata", | ||
dataset_name = dataset_name, | ||
write_disposition = "merge", | ||
) | ||
send_telemetry("cognee.add EXECUTION COMPLETED", user_id = user.id) | ||
|
||
return run_info |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from typing import Union, BinaryIO | ||
from cognee.modules.ingestion import save_data_to_file | ||
|
||
def save_data_item_to_storage(data_item: Union[BinaryIO, str], dataset_name: str) -> str: | ||
|
||
# data is a file object coming from upload. | ||
if hasattr(data_item, "file"): | ||
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename) | ||
|
||
elif isinstance(data_item, str): | ||
# data is a file path | ||
if data_item.startswith("file://") or data_item.startswith("/"): | ||
file_path = data_item.replace("file://", "") | ||
# data is text | ||
else: | ||
file_path = save_data_to_file(data_item, dataset_name) | ||
else: | ||
raise ValueError(f"Data type not supported: {type(data_item)}") | ||
|
||
return file_path |
28 changes: 28 additions & 0 deletions
28
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
from typing import Union, BinaryIO, Any | ||
from cognee.modules.ingestion import save_data_to_file | ||
|
||
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str: | ||
# Dynamic import is used because the llama_index module is optional. | ||
# For the same reason Any is accepted as a data item | ||
from llama_index.core import Document | ||
from .transform_data import get_data_from_llama_index | ||
|
||
# Check if data is of type Document or any of it's subclasses | ||
if isinstance(data_item, Document): | ||
file_path = get_data_from_llama_index(data_item, dataset_name) | ||
|
||
# data is a file object coming from upload. | ||
elif hasattr(data_item, "file"): | ||
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename) | ||
|
||
elif isinstance(data_item, str): | ||
# data is a file path | ||
if data_item.startswith("file://") or data_item.startswith("/"): | ||
file_path = data_item.replace("file://", "") | ||
# data is text | ||
else: | ||
file_path = save_data_to_file(data_item, dataset_name) | ||
else: | ||
raise ValueError(f"Data type not supported: {type(data_item)}") | ||
|
||
return file_path |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
from llama_index.core import Document | ||
from llama_index.core.schema import ImageDocument | ||
from cognee.modules.ingestion import save_data_to_file | ||
from typing import Union | ||
|
||
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str: | ||
# Specific type checking is used to ensure it's not a child class from Document | ||
if type(data_point) == Document: | ||
file_path = data_point.metadata.get("file_path") | ||
if file_path is None: | ||
file_path = save_data_to_file(data_point.text, dataset_name) | ||
return file_path | ||
return file_path | ||
elif type(data_point) == ImageDocument: | ||
if data_point.image_path is None: | ||
file_path = save_data_to_file(data_point.text, dataset_name) | ||
return file_path | ||
return data_point.image_path |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.