AI Research Assistant with Semantic Document Search System

This is a submission for the Open Source AI Challenge with pgai and Ollama

What I Built

This is an AI based research assistant with a semantic document search system for smart document storage and retrieval using natural language queries. Ollama is integrated into the assistant to summarise, and generate sentiment analysis, key points, related topics for provided content. Streamlit is used to provide a minimalistic user interface.

System Architecture

flowchart TD
    subgraph Frontend["User Interface Layer"]
        UI[Streamlit Interface]
        Upload[Document Upload]
        Search[Natural Language Search]
        Analysis[Document Analysis]
        style Frontend fill:#f5f5f5,stroke:#333,stroke-width:2px
    end

    subgraph ML["Machine Learning Layer"]
        ST[Sentence Transformer<br/>all-MiniLM-L6-v2]
        OL[Ollama LLM<br/>Document Analysis]
        style ML fill:#e8f4f8,stroke:#333,stroke-width:2px
    end

    subgraph Core["Processing Layer"]
        DP[Document Processor]
        EP[Embedding Generator<br/>384-dimensional]
        SP[Search Engine<br/>Cosine Similarity]
        AP[Analysis Engine]
        style Core fill:#f0f9ef,stroke:#333,stroke-width:2px
    end

    subgraph DB["Database Layer"]
        PG[(TimescaleDB/PostgreSQL)]
        subgraph Extensions["Database Extensions"]
            PV[pgvector<br/>Vector Operations]
            PA[pgai<br/>AI Features]
            IV[IVFFlat Index<br/>Search Optimization]
            JB[JSONB<br/>Metadata Storage]
            style Extensions fill:#fff3e6,stroke:#333,stroke-width:2px
        end
        style DB fill:#fcf2f2,stroke:#333,stroke-width:2px
    end

    %% Connections
    UI --> Upload & Search & Analysis
    Upload --> DP
    Search --> SP
    Analysis --> AP
    
    DP --> EP
    EP --> ST --> EP
    SP --> ST
    AP --> OL --> AP
    
    EP & SP & AP --> PG
    
    PV & PA & IV & JB -.-> PG

classDef primary fill:#4a90e2,stroke:#357abd,stroke-width:2px,color:white
classDef secondary fill:#57b894,stroke:#45937a,stroke-width:2px,color:white
classDef ai fill:#f39c12,stroke:#d68910,stroke-width:2px,color:white
classDef db fill:#e74c3c,stroke:#c0392b,stroke-width:2px,color:white

class UI,Upload,Search,Analysis primary
class DP,EP,SP,AP secondary
class ST,OL ai
class PG,PV,PA,IV,JB db

System Components

1. User Interface Layer

Streamlit Interface: User-friendly web interface for all operations
Document Upload: Supports single files and batch CSV uploads
Natural Language Search: Semantic search interface
Document Analysis: AI-powered content analysis tools

2. Machine Learning Layer

Sentence Transformer: Generates 384-dimensional embeddings using all-MiniLM-L6-v2 model
Ollama LLM: Performs advanced document analysis
- Document summarization
- Sentiment analysis
- Key points extraction
- Related topics generation

3. Processing Layer

Document Processor: Handles document parsing and preparation
Embedding Generator: Creates and manages vector embeddings
Search Engine: Implements cosine similarity search
Analysis Engine: Coordinates AI-powered document analysis

4. Database Layer

TimescaleDB/PostgreSQL: Primary database system
Database Extensions:
- pgvector: Vector similarity operations
- pgai: Database AI capabilities
- IVFFlat Index: Search optimization
- JSONB: Flexible metadata storage

You can use natural language to search data stored in the PostgreSQL database. Uses pgvector for vector similarity search, pgai through TimescaleDB for search AI features. It is very helpful in cases where you have to manage and search through large collections of documents based on meaning rather than just keywords.

Key Features:

Uses Ollama to summarise docs, and generate sentiment analysis, key points, and related topics
Semantic search capability using document embeddings, powered by pgai
Batch document processing (directly upload CSV files)
User-friendly interface built with Streamlit
Document addition and indexing from GUI
Rich metadata support for categorization
Simple table view and a detailed view for data
Scalable vector search using both pgvector's IVFFlat indexing and the pgvectorscale extension

Although initially the idea was to develop a semantic document search system, later on I decided to extend this to an AI research assistant featuring the same document search system along with Ollama integration.

Demo

Because of problems with hosting Ollama along with the assistant app, only the semantic document search tool demo is hosted.

Tools Used

Ollama + pgvector + pgai + Streamlit

Ollama is integrated into the assistant to summarise, and generate sentiment analysis, key points, related topics for provided content.
TimescaleDB (PostgreSQL) for primary database (can be configured for self hosted psql as well)
pgvector for efficient vector similarity search
pgai through TimescaleDB for AI
Streamlit for the web interface

Key Technologies

Database Layer
- pgvector extension for vector operations
- pgai extension for AI features
- IVFFlat indexing for efficient similarity search
- JSONB data type for flexible metadata storage
Machine Learning
- Sentence-Transformers (all-MiniLM-L6-v2 model)
- 384-dimensional embeddings for semantic representation
Backend
- Python 3.12+
- psycopg2 for PostgreSQL interaction
- Vector similarity calculations using cosine distance
Frontend
- Streamlit for the web interface
- Pandas for data display
- Download data as CSV files

Installation

Using Timescale Cloud

Create a Timescale Service
- Open Timescale Cloud Console and create a service
- In the AI tab, enable ai, vector extensions
- Pick Python app and copy the database connection URL

Configure Environment Edit the src/.env file with the copied URL

PSQL_URL=postgres://username:password@hostname:port/dbname?sslmode=require

Install Ollama and any model (make sure its added to script) for assistant

curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral
ollama serve

Install Requirements

pip install -r requirements.txt
# or if you have poetry
poetry install && poetry shell

Run the Assistant

cd src
streamlit run assistant.py

Run the Document Search Tool

cd src
streamlit run main.py

Self-Hosted PostgreSQL

Install PostgreSQL and Extensions

# Install PostgreSQL
sudo apt-get install postgresql postgresql-common

# Install pgvector
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-12-pgvector

# Install pgai
# https://github.com/timescale/pgai/tree/main?tab=readme-ov-file#install-from-source

Configure Database

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS ai CASCADE;

Configure Environment

PSQL_URL=postgresql://user:password@localhost:5432/dbname

or configure within script

 db_params = {
     'dbname': 'dbname',
     'user': 'postgres',
     'password': 'your_password',
     'host': 'localhost',
     'port': '5432'
 }

Final Thoughts

This project is about integrating AI vector search features with traditional databases (which are hard to get used to). The same tool is used to create an AI research assistant with Ollama integration. This is a very helpful tool for content management systems where you need to manage and search through large collections of documents. Integration of pgvector and pgai provides a strong solution.

TODO

Better visualization of results using charts and stuff
Batch document processing (import CSV)
Delete, update documents functionality
Filtering based on metadata as well
More use cases of pgai

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.devcontainer		.devcontainer
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Research Assistant with Semantic Document Search System

What I Built

System Architecture

System Components

1. User Interface Layer

2. Machine Learning Layer

3. Processing Layer

4. Database Layer

Demo

Tools Used

Ollama + pgvector + pgai + Streamlit

Key Technologies

Installation

Using Timescale Cloud

Self-Hosted PostgreSQL

Final Thoughts

TODO

About

Releases

Packages

Languages

rahulsamant37/AI-Research-Assistant

Folders and files

Latest commit

History

Repository files navigation

AI Research Assistant with Semantic Document Search System

What I Built

System Architecture

System Components

1. User Interface Layer

2. Machine Learning Layer

3. Processing Layer

4. Database Layer

Demo

Tools Used

Ollama + pgvector + pgai + Streamlit

Key Technologies

Installation

Using Timescale Cloud

Self-Hosted PostgreSQL

Final Thoughts

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages