Research and analysis software for Telegram
- Project Description
- Frontend
- Backend (this repository)
Rename .env.example
to .env
and edit it according to your needs. The defaults are listed below:
Name | Value |
---|---|
COMPOSE_PROJECT_NAME |
Project name. |
MONGO_HOST |
Hostname for MongoDB. Default: "mongo" (forwards to the "mongo" docker container). |
MONGO_USER |
Username for MongoDB. |
MONGO_PASSWORD |
Password for MongoDB. |
MONGO_DB_NAME |
Name of MongoDB database. |
FLOWER_HOST |
Hostname for Flower (for debugging). Default: "flower" (forwards to the "flower" docker container) |
FLOWER_PORT |
Port for Flower. Default: "5555" |
FLOWER_USER |
User to access Flower. |
FLOWER_PASSWORD |
Password to access Flower. |
SCRAPE_CHATS_MAX_DAYS |
Number of days content in chats will be scraped backwards (when scraping for the first time). Set to 0 to scrape all content. Warning: Scraping all content of a chat can take several days. Default: "7" |
SCRAPE_CHATS_INTERVAL_MINUTES |
Interval in minutes new messages of chats will be scraped. Default: "30" |
SAVE_ATTACHMENT_TYPES |
Attachments that will be downloaded and stored. Default: ["photo","audio","document","animation","video","voice","video_note","sticker"] |
KEEP_ATTACHMENT_FILES_DAYS |
Number of days attachments will be deleted after automatically. Set to 0 to keep files. |
STORAGE_ENDPOINT |
Endpoint for S3-compatible object storage e.g. MinIO. Default: "host.docker.internal:9000" (forwards to the minio docker container) |
STORAGE_ACCESS_KEY |
API username for object storage. |
STORAGE_SECRET_KEY |
API key for object storage. |
JWT_SECRET |
JWT secret (token) |
JWT_LIFETIME_SECONDS |
Lifetime of JWT in seconds. Default: "3600" (1 hour) |
OCR_ASR_FALLBACK_LANGUAGE |
Fallback language code (ISO 639-1) for text and speech recognition if language of chat can't be detected automatically. Default: "en" |
OCR_ENABLED |
Whether text recognition (OCR) for images using tesseract is enabled. If enabled make sure SAVE_ATTACHMENT_TYPES includes "photo". Pretrained models are available for 120+ languages and will be downloaded automatically if OCR_MODEL_TYPE is "fast" or "best". Default: "fast" |
OCR_MODEL_TYPE |
The model type being used for OCR. Can be "fast", "best" or "custom". Fast models are fast and need less ressoures but are less accurate. Best models need more ressources, take longer but are more accurate. |
ASR_ENABLED |
Whether speech recognition (ASR) using vosk is enabled. If enabled make sure SAVE_ATTACHMENT_TYPES includes "voice". |
ASR_LANGUAGE |
Language code (ISO 639-1) for the language speech recognition (ASR) should be performed. Currently only one language is supported at once. Default: "en" |
ASR_MODEL_NAME |
Model name for speech recognition. Pretrained models are available for 20+ languages and will be downloaded automatically. For these languages usually exist "small" and "big" models. Small models are fast and need less ressoures but are less accurate. Big models need more ressources, take longer but are more accurate. Note: Big models require up to 16 GB memory. Default: "vosk-model-small-en-us-0.15" (small english model) |
API_ALLOW_ORIGINS |
From which domain the API will be accessible. Default: "["http://localhost:3000"]" (only accessible from localhost) |
Note: Scraping for the first time can take several hours to days to download and process all content depending on the configuration settings, available ressources and number of telegram clients used.
- Object Storage (S3 compatible) e.g. MinIO Server (see below),
- Docker
Clone this repository, change configuration accordingly (see above) and run docker-compose up
.
Make sure to change JWT_SECRET
, FLOWER_USER
, FLOWER_PASSWORD
, MONGO_USER
and MONGO_PASSWORD
as the mongodb instance will be accessible remotely by default.
The database can be exported in different ways:
There's a bash script that takes a collection name and exports that collection to a specified location ("outfile path") as a JSON-file. Usage:
$ ./scripts/export/export.sh <collection name> <outfile path>
Alternatively, you can export the database via the GUI MongoDB Compass as JSON or CSV.
Use the command-line MinIO client (e. g. mc cp --recursive minio/[SOURCE] [TARGET]
), the web interface MinIO console (by default available at http://localhost:9001
) or Cyberduck (using a generic S3 profile) if you want to download all attachments from the object storage.
We use VS Code Remote-Containers for the development setup. To start developing in a Docker container, run the Remote-Containers: Open Folder in Container-Command from the Command Palette and select the worker or api-folder.
$ uvicorn api.main:app --reload
Once the API is up and running you can open the interactive documentation of the API at http://127.0.0.1:8000/docs.
The default username is [email protected] (password: 12345678). Make sure to change this after deployment.
Start Celery by running the VSCode task "Celery workers".
To test the scraping tasks, execute in a python shell:
>>> from worker.tasks import init_scrapers
>>> init_scrapers.delay()
We use MinIO for object storage in development.
To setup a local MinIO instance with persistent storage, run:
$ mkdir -p ~/minio/data
$ docker run \
-p 9000:9000 \
-p 9001:9001 \
--name minio1 \
-v ~/minio/data:/data \
-e "MINIO_ROOT_USER=username" \
-e "MINIO_ROOT_PASSWORD=password" \
quay.io/minio/minio server /data --console-address ":9001"
Don't forget to update MINIO_ROOT_USER
and MINIO_ROOT_PASSWORD
. They must have the same values as STORAGE_ACCESS_KEY
and STORAGE_SECRET_KEY
in the .env
-file.
For more info check out the MinIO Quickstart Guide.
If you want to access MinIO via CLI follow the MinIO Client Guide.
Then:
$ mc alias set <ALIAS> <YOUR-S3-ENDPOINT> [YOUR-ACCESS-KEY] [YOUR-SECRET-KEY] [--api API-SIGNATURE]
Alias is simply a short name to your cloud storage service.
$ mc rb minio/[BUCKET_NAME] minio/[BUCKET_NAME] ... --force
$ scripts/open-mongosh.sh
The file scripts/mongo-init.js
is executed once when Mongodb starts for the first time. To recreate all indexes (e.g. after dropping a collection), run:
$ scripts/rerun-mongo-init.sh
$ docker exec -it <container id or name> /bin/bash
As root:
$ docker exec -it --user root <container id or name> /bin/bash
For scraping you need to register at least one (Telegram) client and obtain api_id and api_hash.
There's no official Vosk-package for Apple Silicon (arm64) CPUs, but a workaround. To make Vosk work on M1/M2 open /worker/requirements.txt
and replace vosk
with https://github.com/alphacep/vosk-api/releases/download/v0.3.32/vosk-0.3.32-py3-none-linux_aarch64.whl
- Add documentation for setup and deployment instructions
- Add documentation for frontend
- Improve full text search (implement fuzzy search)
- Create JWT refresh endpoint for API
- Implement Role Based Access Control (RBAC)
- Handle failed downloads and clean up tmp directory
- Extract and store meta data for media files
- Extract and store text from documents
- Write tests
Please cite Teledash in your publications if you used it for your research:
@misc{teledash_2022,
title={Teledash – analysis and research software for Telegram},
url={https://github.com/democ-de/teledash},
author={Weichbrodt, Gregor and Stanjek, Grischa},
year={2022}
}