Skip to content

maemo-leste-extras/kotki

Repository files navigation

kotki

License: MPL v2

High-performance language translations without using the cloud.

  • C/C++ 17 implementation
  • x86_64, ARM
  • Runs on the CPU
  • AVX intrinsics support for x86 architectures
  • NEON intrinsics support for ARM architectures
  • Language models from the Mozilla extension Firefox Translations
  • FOSS (OpenBLAS)
  • Linux only

Quick start

Requirements

For Ubuntu:

sudo apt update && sudo apt upgrade
sudo apt install -y cmake ccache build-essential git pkg-config rapidjson-dev pybind11-dev libyaml-cpp-dev python3-dev python3-virtualenv libopenblas-dev libpcre2-dev libprotobuf-dev protobuf-compiler libsqlite3-dev

Python

  1. pip install kotki -v
  2. Install language translation models

Programmatically

import kotki
kotki.scan()  # auto-find language translation models
# kotki.scan("/path/to/registry.json")  # or supply the path

# English -> German
kotki.translate("Whenever I am at the office, I like to drink coffee.", "ende")
'Wann immer ich im büro bin, trinke ich gerne kaffee.'

# Bulgarian -> English
kotki.translate("Румънците получиха дълго чакани новини: пенсиите и минималната заплата ще бъдат увеличени от 2023 г.", "bgen")
'Romanians have received long-awaited news: pensions and minimum wages will be increased from 2023'

# Dutch -> English
>>> kotki.translate("Auto begeeft het nadat man benzine steelt in Breda, blijkt dieselauto te zijn", "nlen")
'Car breaks after man steals gas in Breda, turns out to be diesel car'

# English -> Polish
>>> kotki.translate("I am going outside to buy some Pierogi.", "enpl")
'Jadę na zewnątrz, żeby kupić Pierogi.'

CLI

$ kotki-cli --help
Usage: kotki-cli [OPTIONS]

  Translate some text.

Options:
  -i, --input TEXT         Text to translate  [required]
  -m, --model TEXT         Model names. Use -l to list. Leave empty to guess
                           the input language automatically.
  -r, --registry FILENAME  Path to registry.json. Leave empty for auto-
                           detection of translation models.
  -l, --list               List available models.
  -d, --debug              Print debug log.
  --help                   Show this message and exit.

Self-hosted web-interface

Example: kotki.kroket.io

$ kotki-web --help
Usage: kotki-web [OPTIONS]

  Exposes kotki via HTTP web interface and provide an API.

Options:
  -h, --host TEXT          bind host (default: 127.0.0.1)  [required]
  -p, --port INTEGER       bind port (default: 7000)  [required]
  -d, --debug              run Quart web-framework in debug
  -r, --registry FILENAME  Path to registry.json. Leave empty for auto-
                           detection of translation models.
  --help                   Show this message and exit.

C++

Link against kotki-lib (CMake target, see src/demo/ for reference).

#include <string>
#include "kotki/kotki.h"

using namespace std;
int main(int argc, char *argv[]) {
  auto *kotki = new Kotki();
  kotki->scan();
  // auto loadedModels = kotki->listModels();  // show currently loaded language models
  cout << kotki->translate("This should work, in theory.", "ende");  // English to German
  return 0;
}

why

Kotki is aimed at developers who "just want to translate some text" in their C++ or Python applications without too much headache, as other translation frameworks are often big, difficult to compile, non-performant, etc.

Producing libkotki

libkotki.so or libkotki.a

Via CMake

Install marian-lite (and its dependencies) manually (and if you are lazy, you can let kotki download the dependencies automatically via -DVENDORED_LIBS=ON - though your mileage may vary).

  • STATIC - Produce static binary (TODO: doesn't work yet)
  • SHARED - Produce shared binary
  • BUILD_DEMO - Produce example demo application(s)
cmake -DBUILD_DEMO=ON -DSTATIC=OFF -DSHARED=ON -Bbuild .
make -Cbuild -j6
sudo make -Cbuild install  # install into /usr/local/...

Via debian packaging

sudo apt install -y debhelper
dpkg-buildpackage -b -uc

Library usage (CMake)

cmake_minimum_required(VERSION 3.16)
find_package(kotki REQUIRED)
target_link_libraries(my_app PRIVATE kotki::kotki-lib)

Models

The translation models are borrowed from the Mozilla Firefox Translations extension. You need to manually download these models. They are conveniently packaged into a single archive that can be downloaded over at kotki/releases.

Extract to ~/.config/kotki/models/ for automatic detection:

mkdir -p ~/.config/kotki/models/
wget https://github.com/kroketio/kotki/releases/download/v0.4.5/kotki_models_0.3.3.zip
unzip kotki_models_0.3.3.zip -d ~/.config/kotki/models

Or supply your own path scan("/path/to/registry.json").

Performance / footprint

Translations are fast - Translating a simple sentence is generally under 10ms (except the first time, due to model loading). Note that translation models are loaded on-demand. This means that model loading does not happen during scan() but during the first use of translate(). In addition, translations are done synchronously (and thus 'blocking').

Acknowledgements

This project was made possible through the combined effort of all researchers and partners in the Bergamot project (Jerin Philip, et al). The translation models are prepared as part of the Mozilla project. The translation engine used is bergamot-translator which is based on marian.

Bergamot-Translator

Kotki differs from Bergamot-Translator. The changes are specified below:

  • Removed async/blocking worker pools
  • Removed async/callback style translations
  • Removed code related to parsing of HTML
  • Work from a single JSON config file (registry.json)
  • Dynamically generate marian configs 'on-the-fly'
  • Simplified the example C++ CLI program (src/demo/kotki.cpp).
  • Switch from marian-dev to marian-lite
  • Simplified Python bindings
  • Simplified the build system (cleaned up various CMakeLists.txt)
  • Introduced automatic use of ccache for compilations
  • Supply CMake configs for kotki (and its dependencies)
  • Supply debian packaging for kotki (and its dependencies)
  • Removed support for Apple, Microsoft, WASM (rip)
  • Removed usage of proprietary libraries like CUDA, Intel MKL
  • Removed unit tests
  • Removed CI/CD definitions
  • Introduced new dependency: rapidjson
  • Doxygen, and other documentation removed

License

MPL 2.0