src
‒ main source code with model and dataset implementations and code to train, test or infer model.notebooks
‒ notebooks with experiments and visualizations.scripts
‒ different useful scripts, e.g. print dataset examples or evaluate existing models.tests
‒ unit tests.
Create virtual environment with venv
or conda
and install requirements:
pip install -r requirements.txt
For proper contributions, also use dev requirements:
pip install -r requirements-dev.txt
For data, we are using .jsonl
format.
Each line is a JSON object with the following fields: text
, label
.
For example:
{"text": "скотина! что сказать", "label": 1}
To train tokenizer on your data, use scripts.train_tokenizer
script:
python -m scripts.train_tokenizer \
--data resources/data/dataset.jsonl \
--save-to resources/tokenizer
To generate noisy dataset, i.e. replace visually similar characters, use scripts.generate_noisy_dataset
script
(see it for details about arguments):
python -m scripts.generate_noisy_dataset \
--data resources/data/dataset.jsonl \
--save-to resources/data/noisy_dataset.jsonl
For noisy dataset, each sample also contains information about class of each word. For example:
{"text": [["cкотина", 0], ["!", 0], ["что", 0], ["сказать", 0]], "label": 1}
There are 4 levels of replacements:
- Replace characters w/ visually similar numbers, e.g. "o" -> "0". Full mapping:
letters1.json
. - Replace characters w/ visually similar symbols or symbols from another language, e.g. "a" -> "@". Full mapping:
letters2.json
. - Replace characters w/ sequence of symbols, e.g. "ж" -> "}|{". Full mapping:
letters3.json
. - Replace characters w/ character from the same cluster.
Clustering is based on visual similarity between characters in the specified font.
Use
scripts.clusterization
to build clusters before applying augmentation to data.
Download dataset from Kaggle: Toxic Russian Comments.
It is better to put it in resources/data
folder.
Use scripts.prepare_ok_dataset
to convert dataset to .jsonl
format:
python -m scripts.prepare_ok_dataset \
--data resources/data/dataset.txt \
--save-to resources/data/dataset.jsonl
Example:
From: __label__INSULT скотина! что сказать
To: {"text": "скотина! что сказать", "toxic": 1}
For now, we are supporting 2 models:
- Vanilla BERT model, see
src.models.transformer_encoder
for implementation details. - VTR-based Transformer model, see
src.models.vtr
for implementation details. This model uses convolutions to extract features from visual token representations and passes them as embeddings for the vanilla Transformer.
Each model has 2 variants:
- Sequence classification via
[CLS]
token and MLP. - Sequence labeling (suffix
SL
), where each token is passed to MLP.
You can study how convolutions on visual tokens works in src.models.vtr.embedder
.
First of all, we are using wandb
to log metrics and artifacts, so you need to create an account and login:
wandb login
To run training, use src.main
:
python -m src.main --vtr --sl \
--train-data resources/data/noisy_dataset.jsonl \
--tokenizer resources/tokenizer
See
src.main
for basic arguments, e.g. data paths, model type.src.utils.config
for training, model, and vtr configurations.
- Sequence classification:
Model | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|
BERT | ||||
VTR |
- Sequence labeling:
Model | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|
BERT-SL | ||||
VTR-SL |