Skip to content

Commit

Permalink
Cog 337 llama index support (#186)
Browse files Browse the repository at this point in the history
* feat: Add support for LlamaIndex Document type

Added support for LlamaIndex Document type

Feature #COG-337

* docs: Add Jupyer Notebook for cognee with llama index document type

Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage

Docs #COG-337

* feat: Add metadata migration from LlamaIndex document type

Allow usage of metadata from LlamaIndex documents

Feature #COG-337

* refactor: Change llama index migration function name

Change name of llama index function

Refactor #COG-337

* chore: Add llama index core dependency

Downgrade needed on tenacity and instructor modules to support llama index

Chore #COG-337

* Feature: Add ingest_data_with_metadata task

Added task that will have access to metadata if data is provided from different data ingestion tools

Feature #COG-337

* docs: Add description on why specific type checking is done

Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well

Docs #COG-337

* fix: Add missing parameter to function call

Added missing parameter to function call

Fix #COG-337

* refactor: Move storing of data from async to sync function

Moved data storing from async to sync

Refactor #COG-337

* refactor: Pretend ingest_data was changes instead of having two tasks

Refactor so ingest_data file was modified instead of having two ingest tasks

Refactor #COG-337

* refactor: Use old name for data ingestion with metadata

Merged new and old data ingestion tasks into one

Refactor #COG-337

* refactor: Return ingest_data and save_data_to_storage Tasks

Returned ingest_data and save_data_to_storage tasks

Refactor #COG-337

* refactor: Return previous ingestion Tasks to add function

Returned previous ignestion tasks to add function

Refactor #COG-337

* fix: Remove dict and use string for search query

Remove dictionary and use string for query in notebook and simple example

Fix COG-337

* refactor: Add changes request in pull request

Added the following changes that were requested in pull request:

Added synchronize label,
Made uniform syntax in if statement in workflow,
fixed instructor dependency,
added llama-index to be optional

Refactor COG-337

* fix: Resolve issue with llama-index being mandatory

Resolve issue with llama-index being mandatory to run cognee

Fix COG-337

* fix: Add install of llama-index to notebook

Removed additional references to llama-index from core cognee lib.
Added llama-index-core install from notebook

Fix COG-337

---------
  • Loading branch information
dexters1 authored Nov 17, 2024
1 parent a63490b commit d30adb5
Show file tree
Hide file tree
Showing 16 changed files with 588 additions and 37 deletions.
Binary file removed .DS_Store
Binary file not shown.
63 changes: 63 additions & 0 deletions .github/workflows/test_cognee_llama_index_notebook.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: test | llama index notebook

on:
workflow_dispatch:
pull_request:
branches:
- main
types: [labeled, synchronize]


concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

env:
RUNTIME__LOG_LEVEL: ERROR

jobs:
get_docs_changes:
name: docs changes
uses: ./.github/workflows/get_docs_changes.yml

run_notebook_test:
name: test
needs: get_docs_changes
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'
runs-on: ubuntu-latest
defaults:
run:
shell: bash
steps:
- name: Check out
uses: actions/checkout@master

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11.x'

- name: Install Poetry
uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true

- name: Install dependencies
run: |
poetry install --no-interaction --all-extras --no-root
poetry add jupyter --no-interaction
- name: Execute Jupyter Notebook
env:
ENV: 'dev'
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}
run: |
poetry run jupyter nbconvert \
--to notebook \
--execute notebooks/cognee_llama_index.ipynb \
--output executed_notebook.ipynb \
--ExecutePreprocessor.timeout=1200
Binary file removed cognee/.DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion cognee/api/v1/add/add_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_nam
pipeline = run_tasks(tasks, data, "add_pipeline")

async for result in pipeline:
print(result)
print(result)
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@ async def check_permission_on_documents(user: User, permission_type: str, docume
has_permissions = all(document_id in resource_ids for document_id in document_ids)

if not has_permissions:
raise PermissionDeniedException(f"User {user.username} does not have {permission_type} permission on documents")
raise PermissionDeniedException(f"User {user.email} does not have {permission_type} permission on documents")
2 changes: 2 additions & 0 deletions cognee/tasks/ingestion/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
from .ingest_data import ingest_data
from .save_data_to_storage import save_data_to_storage
from .save_data_item_to_storage import save_data_item_to_storage
from .save_data_item_with_metadata_to_storage import save_data_item_with_metadata_to_storage
2 changes: 1 addition & 1 deletion cognee/tasks/ingestion/ingest_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from cognee.shared.utils import send_telemetry
from cognee.modules.users.models import User
from cognee.infrastructure.databases.relational import get_relational_config, get_relational_engine
from cognee.infrastructure.databases.relational import get_relational_engine
from cognee.modules.data.methods import create_dataset
from cognee.modules.users.permissions.methods import give_permission_on_document
from .get_dlt_destination import get_dlt_destination
Expand Down
92 changes: 92 additions & 0 deletions cognee/tasks/ingestion/ingest_data_with_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
import dlt
import cognee.modules.ingestion as ingestion
from typing import Any
from cognee.shared.utils import send_telemetry
from cognee.modules.users.models import User
from cognee.infrastructure.databases.relational import get_relational_engine
from cognee.modules.data.methods import create_dataset
from cognee.modules.users.permissions.methods import give_permission_on_document
from .get_dlt_destination import get_dlt_destination
from .save_data_item_with_metadata_to_storage import save_data_item_with_metadata_to_storage

async def ingest_data_with_metadata(data: Any, dataset_name: str, user: User):
destination = get_dlt_destination()

pipeline = dlt.pipeline(
pipeline_name = "file_load_from_filesystem",
destination = destination,
)

@dlt.resource(standalone = True, merge_key = "id")
async def data_resources(data: Any, user: User):
if not isinstance(data, list):
# Convert data to a list as we work with lists further down.
data = [data]

# Process data
for data_item in data:

file_path = save_data_item_with_metadata_to_storage(data_item, dataset_name)

# Ingest data and add metadata
with open(file_path.replace("file://", ""), mode = "rb") as file:
classified_data = ingestion.classify(file)

data_id = ingestion.identify(classified_data)

file_metadata = classified_data.get_metadata()

from sqlalchemy import select
from cognee.modules.data.models import Data

db_engine = get_relational_engine()

async with db_engine.get_async_session() as session:
dataset = await create_dataset(dataset_name, user.id, session)

data_point = (await session.execute(
select(Data).filter(Data.id == data_id)
)).scalar_one_or_none()

if data_point is not None:
data_point.name = file_metadata["name"]
data_point.raw_data_location = file_metadata["file_path"]
data_point.extension = file_metadata["extension"]
data_point.mime_type = file_metadata["mime_type"]

await session.merge(data_point)
await session.commit()
else:
data_point = Data(
id = data_id,
name = file_metadata["name"],
raw_data_location = file_metadata["file_path"],
extension = file_metadata["extension"],
mime_type = file_metadata["mime_type"],
)

dataset.data.append(data_point)
await session.commit()

yield {
"id": data_id,
"name": file_metadata["name"],
"file_path": file_metadata["file_path"],
"extension": file_metadata["extension"],
"mime_type": file_metadata["mime_type"],
}

await give_permission_on_document(user, data_id, "read")
await give_permission_on_document(user, data_id, "write")


send_telemetry("cognee.add EXECUTION STARTED", user_id = user.id)
run_info = pipeline.run(
data_resources(data, user),
table_name = "file_metadata",
dataset_name = dataset_name,
write_disposition = "merge",
)
send_telemetry("cognee.add EXECUTION COMPLETED", user_id = user.id)

return run_info
20 changes: 20 additions & 0 deletions cognee/tasks/ingestion/save_data_item_to_storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from typing import Union, BinaryIO
from cognee.modules.ingestion import save_data_to_file

def save_data_item_to_storage(data_item: Union[BinaryIO, str], dataset_name: str) -> str:

# data is a file object coming from upload.
if hasattr(data_item, "file"):
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)

elif isinstance(data_item, str):
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
# data is text
else:
file_path = save_data_to_file(data_item, dataset_name)
else:
raise ValueError(f"Data type not supported: {type(data_item)}")

return file_path
28 changes: 28 additions & 0 deletions cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from typing import Union, BinaryIO, Any
from cognee.modules.ingestion import save_data_to_file

def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
# Dynamic import is used because the llama_index module is optional.
# For the same reason Any is accepted as a data item
from llama_index.core import Document
from .transform_data import get_data_from_llama_index

# Check if data is of type Document or any of it's subclasses
if isinstance(data_item, Document):
file_path = get_data_from_llama_index(data_item, dataset_name)

# data is a file object coming from upload.
elif hasattr(data_item, "file"):
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)

elif isinstance(data_item, str):
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
# data is text
else:
file_path = save_data_to_file(data_item, dataset_name)
else:
raise ValueError(f"Data type not supported: {type(data_item)}")

return file_path
18 changes: 3 additions & 15 deletions cognee/tasks/ingestion/save_data_to_storage.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from typing import Union, BinaryIO
from cognee.modules.ingestion import save_data_to_file
from cognee.tasks.ingestion.save_data_item_to_storage import save_data_item_to_storage

def save_data_to_storage(data: Union[BinaryIO, str], dataset_name) -> list[str]:
if not isinstance(data, list):
Expand All @@ -9,19 +9,7 @@ def save_data_to_storage(data: Union[BinaryIO, str], dataset_name) -> list[str]:
file_paths = []

for data_item in data:
# data is a file object coming from upload.
if hasattr(data_item, "file"):
file_path = save_data_to_file(data_item.file, dataset_name, filename = data_item.filename)
file_paths.append(file_path)

if isinstance(data_item, str):
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_paths.append(data_item.replace("file://", ""))

# data is text
else:
file_path = save_data_to_file(data_item, dataset_name)
file_paths.append(file_path)
file_path = save_data_item_to_storage(data_item, dataset_name)
file_paths.append(file_path)

return file_paths
18 changes: 18 additions & 0 deletions cognee/tasks/ingestion/transform_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from llama_index.core import Document
from llama_index.core.schema import ImageDocument
from cognee.modules.ingestion import save_data_to_file
from typing import Union

def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
# Specific type checking is used to ensure it's not a child class from Document
if type(data_point) == Document:
file_path = data_point.metadata.get("file_path")
if file_path is None:
file_path = save_data_to_file(data_point.text, dataset_name)
return file_path
return file_path
elif type(data_point) == ImageDocument:
if data_point.image_path is None:
file_path = save_data_to_file(data_point.text, dataset_name)
return file_path
return data_point.image_path
3 changes: 1 addition & 2 deletions examples/python/simple_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,7 @@ async def main():

# Query cognee for insights on the added text
search_results = await cognee.search(
SearchType.INSIGHTS,
{'query': 'Tell me about NLP'}
SearchType.INSIGHTS, query='Tell me about NLP'
)

# Display search results
Expand Down
Loading

0 comments on commit d30adb5

Please sign in to comment.