Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 170 pgvector adapter #158

Merged
merged 34 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
d68a3be
feat: Add config support for pgvector
dexters1 Oct 11, 2024
c62dfdd
feat: Add PGVectorAdapter
dexters1 Oct 11, 2024
268396a
feature: Checkpoint during pgvector integration development
dexters1 Oct 11, 2024
9fbf2d8
feat: Add PGVector support
dexters1 Oct 17, 2024
9b9ae6c
refactor: Remove unused env parameter
dexters1 Oct 17, 2024
aa26eab
refactor: Remove echo for database
dexters1 Oct 17, 2024
02cd240
feat: Add batch search to PGVectorAdapter
dexters1 Oct 17, 2024
58e5854
Merge branch 'main' of github.com:topoteretes/cognee into COG-170-PGv…
dexters1 Oct 18, 2024
325e6cd
refactor: Rewrite search query
dexters1 Oct 18, 2024
2cd2557
refactor: Add formatting to PGVector Adapter
dexters1 Oct 18, 2024
7f7b015
refactor: Add formatting to create_vector_engine
dexters1 Oct 18, 2024
d2772d2
refactor: Formatting change for create_vector_engine
dexters1 Oct 18, 2024
240c660
refactor: Change raw SQL queries to SQLalchemy ORM for PGVectorAdapter
dexters1 Oct 21, 2024
05e4ef3
fix: Fix pruning of postgres database
dexters1 Oct 21, 2024
f8babba
test: Add test for PGVectorAdapter
dexters1 Oct 21, 2024
9f4b8f2
test: Add github action workflow to run PGVectorAdapter integration test
dexters1 Oct 21, 2024
4c381a3
chore: Add pgvector dependency
dexters1 Oct 21, 2024
9461ba0
chore: Add psycopg2 dependency
dexters1 Oct 21, 2024
2cedcbe
refactor: Change database name in PGVectorAdapter test and workflow
dexters1 Oct 21, 2024
71c1374
refactor: Move serialize_datetime function
dexters1 Oct 22, 2024
4a73505
refactor: Move create_db_and_tables module from vectors to pgvector
dexters1 Oct 22, 2024
a358168
refactor: Add setting of database configs through dictionary
dexters1 Oct 22, 2024
7b2022e
refactor: Move psycopg2 to an optional dependency
dexters1 Oct 22, 2024
88ded6e
Merge branch 'main' of github.com:topoteretes/cognee into COG-170-PGv…
dexters1 Oct 22, 2024
8002db7
chore: Add installing of depdendencies along with postgres group
dexters1 Oct 22, 2024
dbc86e2
chore: Add pgvector back to mandatory dependencies
dexters1 Oct 22, 2024
c7ed46d
fix: Change to new syntax for vector_engine_provider
dexters1 Oct 22, 2024
6b9a142
refactor: Fix spacing, remove unused config methods
dexters1 Oct 22, 2024
c78627f
chore: Remove postgres group from pyproject.toml install postgres dep…
dexters1 Oct 22, 2024
d30c337
refactor: Use SQLAlchemyAdapter create_database
dexters1 Oct 22, 2024
0e1533a
chore: Update how postgres dependencies are installed in integration …
dexters1 Oct 22, 2024
195929e
refactor: Fix typo
dexters1 Oct 22, 2024
dc46304
fix: Add missing await statement to LanceDBAdapter and PGVectorAdapter
dexters1 Oct 22, 2024
0c6f019
refactor: Remove broad exception handling from PGVectorAdapter
dexters1 Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,26 @@ GRAPHISTRY_PASSWORD=

SENTRY_REPORTING_URL=

GRAPH_DATABASE_PROVIDER="neo4j" # or "networkx"
# "neo4j" or "networkx"
GRAPH_DATABASE_PROVIDER="neo4j"
# Not needed if using networkx
GRAPH_DATABASE_URL=
GRAPH_DATABASE_USERNAME=
GRAPH_DATABASE_PASSWORD=

VECTOR_DB_PROVIDER="qdrant" # or "weaviate" or "lancedb"
# Not needed if using "lancedb"
# "qdrant", "pgvector", "weaviate" or "lancedb"
VECTOR_DB_PROVIDER="qdrant"
# Not needed if using "lancedb" or "pgvector"
VECTOR_DB_URL=
VECTOR_DB_KEY=

# Database provider
DB_PROVIDER="sqlite" # or "postgres"
# Relational Database provider "sqlite" or "postgres"
DB_PROVIDER="sqlite"

# Database name
DB_NAME=cognee_db

# Postgres specific parameters (Only if Postgres is run)
# Postgres specific parameters (Only if Postgres or PGVector is used)
DB_HOST=127.0.0.1
DB_PORT=5432
DB_USERNAME=cognee
Expand Down
67 changes: 67 additions & 0 deletions .github/workflows/test_pgvector.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: test | pgvector

on:
pull_request:
branches:
- main
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

env:
RUNTIME__LOG_LEVEL: ERROR

jobs:
get_docs_changes:
name: docs changes
uses: ./.github/workflows/get_docs_changes.yml

run_pgvector_integration_test:
name: test
needs: get_docs_changes
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true'
runs-on: ubuntu-latest
defaults:
run:
shell: bash
services:
postgres:
image: pgvector/pgvector:pg17
env:
POSTGRES_USER: cognee
POSTGRES_PASSWORD: cognee
POSTGRES_DB: cognee_db
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432

steps:
- name: Check out
uses: actions/checkout@master

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11.x'

- name: Install Poetry
uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true

- name: Install dependencies
run: poetry install -E postgres --no-interaction

- name: Run default PGVector
env:
ENV: 'dev'
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: poetry run python ./cognee/tests/test_pgvector.py
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,11 +190,11 @@ Cognee supports a variety of tools and services for different operations:

- **Local Setup**: By default, LanceDB runs locally with NetworkX and OpenAI.

- **Vector Stores**: Cognee supports Qdrant and Weaviate for vector storage.
- **Vector Stores**: Cognee supports LanceDB, Qdrant, PGVector and Weaviate for vector storage.

- **Language Models (LLMs)**: You can use either Anyscale or Ollama as your LLM provider.

- **Graph Stores**: In addition to LanceDB, Neo4j is also supported for graph storage.
- **Graph Stores**: In addition to NetworkX, Neo4j is also supported for graph storage.

- **User management**: Create individual user graphs and manage permissions

Expand Down
2 changes: 1 addition & 1 deletion cognee/api/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ class LLMConfigDTO(InDTO):
api_key: str

class VectorDBConfigDTO(InDTO):
provider: Union[Literal["lancedb"], Literal["qdrant"], Literal["weaviate"]]
provider: Union[Literal["lancedb"], Literal["qdrant"], Literal["weaviate"], Literal["pgvector"]]
url: str
api_key: str

Expand Down
7 changes: 5 additions & 2 deletions cognee/api/v1/add/add.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,18 @@
from cognee.modules.ingestion import get_matched_datasets, save_data_to_file
from cognee.shared.utils import send_telemetry
from cognee.base_config import get_base_config
from cognee.infrastructure.databases.relational import get_relational_engine, create_db_and_tables
from cognee.infrastructure.databases.relational import get_relational_engine
from cognee.modules.users.methods import get_default_user
from cognee.tasks.ingestion import get_dlt_destination
from cognee.modules.users.permissions.methods import give_permission_on_document
from cognee.modules.users.models import User
from cognee.modules.data.methods import create_dataset
from cognee.infrastructure.databases.relational import create_db_and_tables as create_relational_db_and_tables
from cognee.infrastructure.databases.vector.pgvector import create_db_and_tables as create_pgvector_db_and_tables

async def add(data: Union[BinaryIO, List[BinaryIO], str, List[str]], dataset_name: str = "main_dataset", user: User = None):
await create_db_and_tables()
await create_relational_db_and_tables()
await create_pgvector_db_and_tables()
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize Database Initialization Based on Configuration

Initializing both databases every time add is called may introduce unnecessary overhead, especially if only one database is in use. Consider initializing databases conditionally based on the configuration or the specific requirements of the operation.

Would you like assistance in modifying the code to initialize databases conditionally based on configuration settings?


⚠️ Potential issue

Consider Adding Error Handling for Database Initialization

Initializing both databases without error handling might lead to unhandled exceptions if a database fails to initialize. Consider wrapping the initialization calls in try-except blocks to handle potential exceptions gracefully.

Apply this diff to include exception handling:

 async def add(data: Union[BinaryIO, List[BinaryIO], str, List[str]], dataset_name: str = "main_dataset", user: User = None):
     try:
-        await create_relational_db_and_tables()
+        await create_relational_db_and_tables()
     except Exception as e:
+        # Handle or log the exception for the relational database
+        logger.error(f"Failed to initialize relational database: {e}")
+        raise

     try:
-        await create_pgvector_db_and_tables()
+        await create_pgvector_db_and_tables()
     except Exception as e:
+        # Handle or log the exception for the pgvector database
+        logger.error(f"Failed to initialize pgvector database: {e}")
+        raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
await create_relational_db_and_tables()
await create_pgvector_db_and_tables()
try:
await create_relational_db_and_tables()
except Exception as e:
# Handle or log the exception for the relational database
logger.error(f"Failed to initialize relational database: {e}")
raise
try:
await create_pgvector_db_and_tables()
except Exception as e:
# Handle or log the exception for the pgvector database
logger.error(f"Failed to initialize pgvector database: {e}")
raise


if isinstance(data, str):
if "data://" in data:
Expand Down
6 changes: 4 additions & 2 deletions cognee/api/v1/add/add_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
from cognee.modules.users.methods import get_default_user
from cognee.modules.pipelines import run_tasks, Task
from cognee.tasks.ingestion import save_data_to_storage, ingest_data
from cognee.infrastructure.databases.relational import create_db_and_tables
from cognee.infrastructure.databases.relational import create_db_and_tables as create_relational_db_and_tables
from cognee.infrastructure.databases.vector.pgvector import create_db_and_tables as create_pgvector_db_and_tables

async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None):
await create_db_and_tables()
await create_relational_db_and_tables()
await create_pgvector_db_and_tables()

if user is None:
user = await get_default_user()
Expand Down
24 changes: 24 additions & 0 deletions cognee/api/v1/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,30 @@ def set_vector_db_provider(vector_db_provider: str):
vector_db_config = get_vectordb_config()
vector_db_config.vector_db_provider = vector_db_provider

@staticmethod
def set_relational_db_config(config_dict: dict):
"""
Updates the relational db config with values from config_dict.
"""
relational_db_config = get_relational_config()
for key, value in config_dict.items():
if hasattr(relational_db_config, key):
object.__setattr__(relational_db_config, key, value)
else:
raise AttributeError(f"'{key}' is not a valid attribute of the config.")

@staticmethod
def set_vector_db_config(config_dict: dict):
"""
Updates the vector db config with values from config_dict.
"""
vector_db_config = get_vectordb_config()
for key, value in config_dict.items():
if hasattr(vector_db_config, key):
object.__setattr__(vector_db_config, key, value)
else:
raise AttributeError(f"'{key}' is not a valid attribute of the config.")

@staticmethod
def set_vector_db_key(db_key: str):
vector_db_config = get_vectordb_config()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,8 @@ async def delete_database(self):
self.db_path = None
else:
async with self.engine.begin() as connection:
# Load the schema information into the MetaData object
await connection.run_sync(Base.metadata.reflect)
for table in Base.metadata.sorted_tables:
drop_table_query = text(f"DROP TABLE IF EXISTS {table.name} CASCADE")
await connection.execute(drop_table_query)
Expand Down
21 changes: 21 additions & 0 deletions cognee/infrastructure/databases/vector/create_vector_engine.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from typing import Dict

from ..relational.config import get_relational_config

class VectorConfig(Dict):
vector_db_url: str
vector_db_key: str
Expand All @@ -26,6 +28,25 @@ def create_vector_engine(config: VectorConfig, embedding_engine):
api_key = config["vector_db_key"],
embedding_engine = embedding_engine
)
elif config["vector_db_provider"] == "pgvector":
from .pgvector.PGVectorAdapter import PGVectorAdapter

# Get configuration for postgres database
relational_config = get_relational_config()
db_username = relational_config.db_username
db_password = relational_config.db_password
db_host = relational_config.db_host
db_port = relational_config.db_port
db_name = relational_config.db_name

connection_string: str = (
f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
)
Comment on lines +42 to +44
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use SQLAlchemy's URL object for safer connection strings

Leveraging SQLAlchemy's URL object improves readability and ensures proper handling of credentials and connection parameters.

Refactor the code as follows:

+from sqlalchemy.engine import URL
+
         # Get configuration for postgres database
         relational_config = get_relational_config()
         db_username = relational_config.db_username
         db_password = relational_config.db_password
         db_host = relational_config.db_host
         db_port = relational_config.db_port
         db_name = relational_config.db_name

-        connection_string: str = (
-            f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
-        )
+        connection_url = URL.create(
+            drivername="postgresql+asyncpg",
+            username=db_username,
+            password=db_password,
+            host=db_host,
+            port=db_port,
+            database=db_name
+        )

Update the adapter initialization:

-        return PGVectorAdapter(connection_string,
+        return PGVectorAdapter(connection_url,
                               config["vector_db_key"],
                               embedding_engine
        )

Committable suggestion was skipped due to low confidence.


⚠️ Potential issue

Ensure special characters in credentials are URL-encoded

When constructing the connection string, special characters in db_username and db_password could lead to connection issues or security vulnerabilities if not properly URL-encoded.

Apply this diff to URL-encode the credentials:

+from urllib.parse import quote_plus
+
         # Get configuration for postgres database
         relational_config = get_relational_config()
         db_username = relational_config.db_username
         db_password = relational_config.db_password
         db_host = relational_config.db_host
         db_port = relational_config.db_port
         db_name = relational_config.db_name

         connection_string: str = (
-            f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
+            f"postgresql+asyncpg://{quote_plus(db_username)}:{quote_plus(db_password)}@{db_host}:{db_port}/{db_name}"
         )

Committable suggestion was skipped due to low confidence.


return PGVectorAdapter(connection_string,
config["vector_db_key"],
embedding_engine
)
else:
from .lancedb.LanceDBAdapter import LanceDBAdapter

Expand Down
Loading
Loading