feat: Telegram and Text Files connectors (vana-com#20)

vana-com · Feb 28, 2024 · 02cc326 · 02cc326
1 parent c9d3913
commit 02cc326
Show file tree

Hide file tree

Showing 21 changed files with 354 additions and 53 deletions.
diff --git a/README.md b/README.md
@@ -10,17 +10,20 @@
 
 Selfie personalizes text generation, augmenting both local and hosted Large Language Models (LLMs) with your personal data. Selfie exposes an OpenAI-compatible API that wraps the LLM of your choice, and automatically injects relevant context into each text generation request.
 
+<img alt="selfie-augmentation" src="./docs/images/playground-augmentation.png" width="100%">
+
 ## Features
 
 * Automatically mix your data into chat and text completions using OpenAI-compatible clients like [OpenAI SDKs](https://platform.openai.com/docs/libraries), [SillyTavern](https://sillytavernai.com), and [Instructor](https://github.com/jxnl/instructor)* (untested).
 * Quickly drop in personal messaging data exported from WhatsApp and Google Messages.
 * Runs locally by default to keep your data private.
 * Unopinionated compatibility with your LLM or provider of choice.
 * Easily switch to vanilla text generation modes.
+* Directly and selectively query loaded data.
 
 On the roadmap:
-* Load data using any [LlamaHub loader](https://llamahub.ai/?tab=loaders) (partial support is available through the API).
-* Directly and selectively query loaded data.
+
+[//]: # (* Load data using any [LlamaHub loader]&#40;https://llamahub.ai/?tab=loaders&#41;.)
 * Easy deployment with Docker and pre-built executables.
 
 ## Overview
@@ -60,56 +63,29 @@ This starts a local web server and should launch the UI in your browser at http:
 
 > Note: You can host selfie at a publicly-accessible URL with [ngrok](https://ngrok.com). Add your ngrok token (and optionally, ngrok domain) in `selfie/.env` and run `poetry run python -m selfie --share`.
 
-### Step 1:  Gather Messaging Data
-
-Future versions of Selfie will support loading any text data. For now, you can import chat logs from popular messaging platforms.
-
-> Note: If you don't have any chat logs or want to try the app first, you can use the example chat logs provided in the `example-chats` directory.)
-
-Export chats that you use frequently and contain information you want the LLM to know.
-
-#### Export Instructions
-
-The following links provide instructions for exporting chat logs from popular messaging platforms:
-
-* [WhatsApp](https://faq.whatsapp.com/1180414079177245/?cms_platform=android)
-* [Google](https://takeout.google.com/settings/takeout) (select Messages from the list)
-
-These platforms are not yet supported, but you can create a parser in selfie/parsers/chats to support them (please contribute!):
-
-* [Instagram](https://help.instagram.com/181231772500920)
-* [Facebook Messenger](https://www.facebook.com/help/messenger-app/713635396288741/?cms_platform=iphone-app&helpref=platform_switcher)
-* [Telegram](https://www.maketecheasier.com/export-telegram-chat-history/)
+### Step 1: Import Your Data
 
-Ensure you ask permission of the friends who are also in the chats you export.
+Selfie supports importing text data, with special processing for certain data formats, like chat logs from WhatsApp and ChatGPT.
 
-[//]: # (You can also redact their name, messages, and other personal information in later steps.)
+> Note: You can try the example files in the `example-chats` directory if you want to try a specific data format that you don't have ready for import.
 
-### Step 2: Import Messaging Data
+To import data into Selfie:
 
-1. Place your exported chat logs in a directory on your computer, e.g. `/home/alice/chats`.
-2. Open the UI at http://localhost:8181.
-3. Add your directory as a Data Source. Give it a name (e.g. My Chats), enter the **absolute** path, and click `Add Directory`. This must be a directory (i.e. folder), not a file. Example absolute path would be: `/Users/{you}/Projects/selfie/example-chats`
-4. In the Documents table, select the exported chat logs you want to import, and click `Index`.
+1. **Open the Add Data Page**: Access the UI and locate the Add Data section.
+2. **Select Data Source**: Choose the type of data you are uploading (e.g., WhatsApp, Text Files). Choose the type that most closely matches your data format.
+3. **Upload Files**: Choose your files and submit them for upload.
 
-If this process is successful, your selected chat logs will show as indexed in the table. You can now use the API to connect to your LLM and generate personalized text completions.
+Ensure you obtain consent from participants in the chats you wish to export.
 
-[//]: # (1. Open http://localhost:8181/docs)
-[//]: # (2. Find `POST /v1/index_documents/chat-processor`)
-[//]: # (3. Upload one or more exported chat log files. To get these files, export them from platforms that you use frequently and contain information you want the LLM to know. Exports: [WhatsApp]&#40;https://faq.whatsapp.com/1180414079177245/?cms_platform=android&#41; | [Google]&#40;https://takeout.google.com/settings/takeout&#41; | [Instagram]&#40;https://help.instagram.com/181231772500920&#41; | [Facebook Messenger]&#40;https://www.facebook.com/help/messenger-app/713635396288741/?cms_platform=iphone-app&helpref=platform_switcher&#41; | [Telegram]&#40;https://www.maketecheasier.com/export-telegram-chat-history/&#41;. Ensure you ask permission of the friend who is also in the chat you export. You can also redact their name, messages, and other personal information in later steps.)
-[//]: # (4. Copy, paste, and edit the example parser_configs JSON. Include one configuration object in the list for each file you upload.)
-[//]: # ()
-[//]: # (![chat-processor.png]&#40;docs/images/chat-processor.png&#41;)
-[//]: # ()
-[//]: # (Setting `extract_importance` to `true` will give you better query results, but usually causes the import to take a while.)
+Support for new types of data can be added by creating new data connectors in `selfie/connectors/` (instructions [here](./selfie/connectors/README.md), please contribute!).
 
-### Step 3: Generate Personalized Text
+### Step 2: Engage with Your Data
 
-You can quickly verify if everything is in order by visiting the summarization endpoint in your browser: http://localhost:8181/v1/index_documents/summary?topic=travel ([docs](http://localhost:8181/docs#/default/get_index_documents_summary_v1_index_documents_summary_get)).
+The Playground page includes a chat interface and a Search feature. Write an LLM persona by entering a name and bio, and try interacting with your data through conversation. You can also search your data for specific topics under Search.
 
-Next, scroll down to the Playground section in the UI. Enter your name and a simple bio, and try asking some questions whose answers are in your chat logs.
+You can interact with your data via the API directly, for instance, try viewing this link in your web browser: http://localhost:8181/v1/index_documents/summary?topic=travel. Detailed API documentation is available [here](http://localhost:8181/docs).
 
-## Usage Guide
+## API Usage Guide
 
 By default, Selfie augments text completions with local models using llama.cpp and a local txtai embeddings database.
 

diff --git a/docs/images/playground-augmentation.png b/docs/images/playground-augmentation.png
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -7,6 +7,7 @@ readme = "README.md"
 
 [tool.poetry.dependencies]
 python = ">=3.11,<3.12"
+beautifulsoup4 = "^4.12.3"
 fastapi = "^0.109.0"
 uvicorn = "^0.27.0"
 humanize = "^4.9.0"

diff --git a/selfie-ui/src/app/components/Markdown.tsx b/selfie-ui/src/app/components/Markdown.tsx
@@ -5,7 +5,7 @@ import rehypeSanitize from 'rehype-sanitize';
 export const Markdown = ({ content }: { content: string }) => {
   return (
     <ReactMarkdown
-      className="prose prose-sm"
+      className="prose prose-sm max-w-full"
       rehypePlugins={[rehypeRaw, rehypeSanitize]}
       // rehypePlugins={[rehypeHighlight]}
     >

diff --git a/selfie/connectors/factory.py b/selfie/connectors/factory.py
@@ -1,12 +1,18 @@
+from selfie.connectors.text_files.connector import TextFilesConnector
+from selfie.connectors.google_messages.connector import GoogleMessagesConnector
+from selfie.connectors.telegram.connector import TelegramConnector
 from selfie.connectors.whatsapp.connector import WhatsAppConnector
 from selfie.connectors.chatgpt.connector import ChatGPTConnector
 
 
 class ConnectorFactory:
     # Register all document connectors here
     connector_registry = [
+        ChatGPTConnector,
+        GoogleMessagesConnector,
+        TelegramConnector,
+        TextFilesConnector,
         WhatsAppConnector,
-        ChatGPTConnector
     ]
 
     connector_map = {}

diff --git a/selfie/connectors/google_messages/connector.py b/selfie/connectors/google_messages/connector.py
@@ -0,0 +1,53 @@
+from abc import ABC
+from typing import Any, List
+
+from selfie.connectors.base_connector import BaseConnector
+from selfie.database import BaseModel
+from selfie.embeddings import EmbeddingDocumentModel, DataIndex
+from selfie.parsers.chat import ChatFileParser
+from selfie.types.documents import DocumentDTO
+from selfie.utils import data_uri_to_string
+
+
+class GoogleMessagesConfiguration(BaseModel):
+    files: List[str]
+
+
+class GoogleMessagesConnector(BaseConnector, ABC):
+    def __init__(self):
+        super().__init__()
+        self.id = "google_messages"
+        self.name = "Google Messages"
+
+    def load_document(self, configuration: dict[str, Any]) -> List[DocumentDTO]:
+        config = GoogleMessagesConfiguration(**configuration)
+
+        return [
+            DocumentDTO(
+                content=data_uri_to_string(data_uri),
+                content_type="application/json",
+                name="todo",
+                size=len(data_uri_to_string(data_uri).encode('utf-8'))
+            )
+            for data_uri in config.files
+        ]
+
+    def validate_configuration(self, configuration: dict[str, Any]):
+        # TODO: check if file can be read from path
+        pass
+
+    def transform_for_embedding(self, configuration: dict[str, Any], documents: List[DocumentDTO]) -> List[EmbeddingDocumentModel]:
+        return [
+            embeddingDocumentModel
+            for document in documents
+            for embeddingDocumentModel in DataIndex.map_share_gpt_data(
+                ChatFileParser().parse_document(
+                    document=document.content,
+                    parser_type="google_messages",
+                    mask=False,
+                    document_name=document.name,
+                ).conversations,
+                source="google_messages",
+                source_document_id=document.id
+            )
+        ]
diff --git a/selfie/connectors/google_messages/documentation.md b/selfie/connectors/google_messages/documentation.md
@@ -0,0 +1,9 @@
+## Export Instructions
+
+Google Takeout is a service that allows you to download a copy of your data stored within Google products. To export your Google Hangouts chat history, follow the instructions below.
+
+1. Go to <a href="https://takeout.google.com" target="_blank">Google Takeout</a> and log in to your Google account.
+2. Select "Deselect all" and then scroll down to select "Messages" from the list of Google products. (note: `Messages` may not appear in the list if you have not used Google Messages in the past)
+3. Click "Next step" and choose your delivery method, frequency, and file type.
+4. Click "Create export" to start the process. Once completed, you will receive an email with a link to download your exported data.
+5. Download the .zip file and extract the `.json` files in the `Messages` folder to access your chat files.
diff --git a/selfie/connectors/google_messages/schema.json b/selfie/connectors/google_messages/schema.json
@@ -0,0 +1,14 @@
+{
+  "title": "Upload Google Messages Conversations",
+  "type": "object",
+  "properties": {
+    "files": {
+      "type": "array",
+      "title": "Files",
+      "description": "Upload .json files exported from Google Messages",
+      "items": {
+        "type": "object"
+      }
+    }
+  }
+}
diff --git a/selfie/connectors/google_messages/uischema.json b/selfie/connectors/google_messages/uischema.json
@@ -0,0 +1,8 @@
+{
+  "files": {
+    "ui:widget": "nativeFile",
+    "ui:options": {
+      "accept": ".json"
+    }
+  }
+}