Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Telegram and Text Files connectors #20

Merged
merged 4 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 17 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,19 @@

Selfie personalizes text generation, augmenting both local and hosted Large Language Models (LLMs) with your personal data. Selfie exposes an OpenAI-compatible API that wraps the LLM of your choice, and automatically injects relevant context into each text generation request.

<img alt="selfie-augmentation" src="./docs/images/playground-augmentation.png" width="100%">

## Features

* Automatically mix your data into chat and text completions using OpenAI-compatible clients like [OpenAI SDKs](https://platform.openai.com/docs/libraries), [SillyTavern](https://sillytavernai.com), and [Instructor](https://github.com/jxnl/instructor)* (untested).
* Quickly drop in personal messaging data exported from WhatsApp and Google Messages.
* Runs locally by default to keep your data private.
* Unopinionated compatibility with your LLM or provider of choice.
* Easily switch to vanilla text generation modes.
* Directly and selectively query loaded data.

On the roadmap:
* Load data using any [LlamaHub loader](https://llamahub.ai/?tab=loaders) (partial support is available through the API).
* Directly and selectively query loaded data.
* Load data using any [LlamaHub loader](https://llamahub.ai/?tab=loaders).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this on the roadmap? I thought we were going with the approach of "one connector for each data source" vs "support all llamahub loaders", because the E2E flow for each connector varies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a general-purpose LlamaHub loader should be possible, probably with some limitations (some loaders need additional python dependencies), but I've removed it until we can clarify further.

tnunamak marked this conversation as resolved.
Show resolved Hide resolved
* Easy deployment with Docker and pre-built executables.

## Overview
Expand Down Expand Up @@ -60,56 +62,29 @@ This starts a local web server and should launch the UI in your browser at http:

> Note: You can host selfie at a publicly-accessible URL with [ngrok](https://ngrok.com). Add your ngrok token (and optionally, ngrok domain) in `selfie/.env` and run `poetry run python -m selfie --share`.

### Step 1: Gather Messaging Data

Future versions of Selfie will support loading any text data. For now, you can import chat logs from popular messaging platforms.

> Note: If you don't have any chat logs or want to try the app first, you can use the example chat logs provided in the `example-chats` directory.)

Export chats that you use frequently and contain information you want the LLM to know.

#### Export Instructions

The following links provide instructions for exporting chat logs from popular messaging platforms:

* [WhatsApp](https://faq.whatsapp.com/1180414079177245/?cms_platform=android)
* [Google](https://takeout.google.com/settings/takeout) (select Messages from the list)

These platforms are not yet supported, but you can create a parser in selfie/parsers/chats to support them (please contribute!):

* [Instagram](https://help.instagram.com/181231772500920)
* [Facebook Messenger](https://www.facebook.com/help/messenger-app/713635396288741/?cms_platform=iphone-app&helpref=platform_switcher)
* [Telegram](https://www.maketecheasier.com/export-telegram-chat-history/)
### Step 1: Import Your Data

Ensure you ask permission of the friends who are also in the chats you export.
Selfie supports importing text data, with special processing for certain data formats, like chat logs from WhatsApp and ChatGPT.

[//]: # (You can also redact their name, messages, and other personal information in later steps.)
> Note: You can try the example files in the `example-chats` directory if you want to try a specific data format that you don't have ready for import.

### Step 2: Import Messaging Data
To import data into Selfie:

1. Place your exported chat logs in a directory on your computer, e.g. `/home/alice/chats`.
2. Open the UI at http://localhost:8181.
3. Add your directory as a Data Source. Give it a name (e.g. My Chats), enter the **absolute** path, and click `Add Directory`. This must be a directory (i.e. folder), not a file. Example absolute path would be: `/Users/{you}/Projects/selfie/example-chats`
4. In the Documents table, select the exported chat logs you want to import, and click `Index`.
1. **Open the Add Data Page**: Access the UI and locate the Add Data section.
2. **Select Data Source**: Choose the type of data you are uploading (e.g., WhatsApp, Text Files). Choose the type that most closely matches your data format.
3. **Upload Files**: Choose your files and submit them for upload.

If this process is successful, your selected chat logs will show as indexed in the table. You can now use the API to connect to your LLM and generate personalized text completions.
Ensure you obtain consent from participants in the chats you wish to export.

[//]: # (1. Open http://localhost:8181/docs)
[//]: # (2. Find `POST /v1/index_documents/chat-processor`)
[//]: # (3. Upload one or more exported chat log files. To get these files, export them from platforms that you use frequently and contain information you want the LLM to know. Exports: [WhatsApp]&#40;https://faq.whatsapp.com/1180414079177245/?cms_platform=android&#41; | [Google]&#40;https://takeout.google.com/settings/takeout&#41; | [Instagram]&#40;https://help.instagram.com/181231772500920&#41; | [Facebook Messenger]&#40;https://www.facebook.com/help/messenger-app/713635396288741/?cms_platform=iphone-app&helpref=platform_switcher&#41; | [Telegram]&#40;https://www.maketecheasier.com/export-telegram-chat-history/&#41;. Ensure you ask permission of the friend who is also in the chat you export. You can also redact their name, messages, and other personal information in later steps.)
[//]: # (4. Copy, paste, and edit the example parser_configs JSON. Include one configuration object in the list for each file you upload.)
[//]: # ()
[//]: # (![chat-processor.png]&#40;docs/images/chat-processor.png&#41;)
[//]: # ()
[//]: # (Setting `extract_importance` to `true` will give you better query results, but usually causes the import to take a while.)
Support for new types of data can be added by creating new data connectors in `selfie/connectors/` (instructions [here](./selfie/connectors/README.md), please contribute!).

### Step 3: Generate Personalized Text
### Step 2: Engage with Your Data

You can quickly verify if everything is in order by visiting the summarization endpoint in your browser: http://localhost:8181/v1/index_documents/summary?topic=travel ([docs](http://localhost:8181/docs#/default/get_index_documents_summary_v1_index_documents_summary_get)).
The Playground page includes a chat interface and a Search feature. Write an LLM persona by entering a name and bio, and try interacting with your data through conversation. You can also search your data for specific topics under Search.

Next, scroll down to the Playground section in the UI. Enter your name and a simple bio, and try asking some questions whose answers are in your chat logs.
You can interact with your data via the API directly, for instance, try viewing this link in your web browser: http://localhost:8181/v1/index_documents/summary?topic=travel. Detailed API documentation is available [here](http://localhost:8181/docs).

## Usage Guide
## API Usage Guide

By default, Selfie augments text completions with local models using llama.cpp and a local txtai embeddings database.

Expand Down
Binary file added docs/images/playground-augmentation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 4 additions & 9 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.11,<3.12"
beautifulsoup4 = "^4.12.3"
fastapi = "^0.109.0"
uvicorn = "^0.27.0"
humanize = "^4.9.0"
Expand Down
2 changes: 1 addition & 1 deletion selfie-ui/src/app/components/Markdown.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import rehypeSanitize from 'rehype-sanitize';
export const Markdown = ({ content }: { content: string }) => {
return (
<ReactMarkdown
className="prose prose-sm"
className="prose prose-sm max-w-full"
rehypePlugins={[rehypeRaw, rehypeSanitize]}
// rehypePlugins={[rehypeHighlight]}
>
Expand Down
8 changes: 7 additions & 1 deletion selfie/connectors/factory.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
from selfie.connectors.text_files.connector import TextFilesConnector
from selfie.connectors.google_messages.connector import GoogleMessagesConnector
from selfie.connectors.telegram.connector import TelegramConnector
from selfie.connectors.whatsapp.connector import WhatsAppConnector
from selfie.connectors.chatgpt.connector import ChatGPTConnector


class ConnectorFactory:
# Register all document connectors here
connector_registry = [
ChatGPTConnector,
GoogleMessagesConnector,
TelegramConnector,
TextFilesConnector,
WhatsAppConnector,
ChatGPTConnector
]

connector_map = {}
Expand Down
53 changes: 53 additions & 0 deletions selfie/connectors/google_messages/connector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from abc import ABC
from typing import Any, List

from selfie.connectors.base_connector import BaseConnector
from selfie.database import BaseModel
from selfie.embeddings import EmbeddingDocumentModel, DataIndex
from selfie.parsers.chat import ChatFileParser
from selfie.types.documents import DocumentDTO
from selfie.utils import data_uri_to_string


class GoogleMessagesConfiguration(BaseModel):
files: List[str]


class GoogleMessagesConnector(BaseConnector, ABC):
def __init__(self):
super().__init__()
self.id = "google_messages"
self.name = "Google Messages"

def load_document(self, configuration: dict[str, Any]) -> List[DocumentDTO]:
config = GoogleMessagesConfiguration(**configuration)

return [
DocumentDTO(
content=data_uri_to_string(data_uri),
content_type="text/plain",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be application/json?

tnunamak marked this conversation as resolved.
Show resolved Hide resolved
name="todo",
size=len(data_uri_to_string(data_uri).encode('utf-8'))
)
for data_uri in config.files
]

def validate_configuration(self, configuration: dict[str, Any]):
# TODO: check if file can be read from path
pass

def transform_for_embedding(self, configuration: dict[str, Any], documents: List[DocumentDTO]) -> List[EmbeddingDocumentModel]:
return [
embeddingDocumentModel
for document in documents
for embeddingDocumentModel in DataIndex.map_share_gpt_data(
ChatFileParser().parse_document(
document=document.content,
parser_type="google_messages",
mask=False,
document_name=document.name,
).conversations,
source="google_messages",
source_document_id=document.id
)
]
9 changes: 9 additions & 0 deletions selfie/connectors/google_messages/documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## Export Instructions

Google Takeout is a service that allows you to download a copy of your data stored within Google products. To export your Google Hangouts chat history, follow the instructions below.

1. Go to <a href="https://takeout.google.com" target="_blank">Google Takeout</a> and log in to your Google account.
2. Select "Deselect all" and then scroll down to select "Messages" from the list of Google products. (note: `Messages` may not appear in the list if you have not used Google Messages in the past)
3. Click "Next step" and choose your delivery method, frequency, and file type.
4. Click "Create export" to start the process. Once completed, you will receive an email with a link to download your exported data.
5. Download the .zip file and extract the `.json` files in the `Messages` folder to access your chat files.
14 changes: 14 additions & 0 deletions selfie/connectors/google_messages/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"title": "Upload Google Messages Conversations",
"type": "object",
"properties": {
"files": {
"type": "array",
"title": "Files",
"description": "Upload .json files exported from Google Messages",
"items": {
"type": "object"
}
}
}
}
8 changes: 8 additions & 0 deletions selfie/connectors/google_messages/uischema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"files": {
"ui:widget": "nativeFile",
"ui:options": {
"accept": ".json"
}
}
}
Loading