Skip to content

Latest commit

 

History

History
93 lines (68 loc) · 4.32 KB

README.md

File metadata and controls

93 lines (68 loc) · 4.32 KB

© [2023] Ubisoft Entertainment. All Rights Reserved

ToxBuster

We aim to use language models to identify and classify toxicity inside in-game chat.

Project Standards

For better collaboration and understanding of the project code and what has been done, the following sections outline what standards / loose rules are followed.

Project File Layout

Follow something similar to under src packaging-projects.

> src
   > module_1
      __init__.py
   > module_2
   __init__.py
> tests
main.py
poetry.lock

Code Simple Guide

  1. Keep it simple stupid.
  2. Don't add features / over-engineer until you need it.

Code Standards

Name Value Links Notes
Language Python 3.x
Package Manager Poetry Docs ; Useful TLDR On Windows, you may have to restart comp after installing to work with VSCode.
Python Env conda
Code Linter Pep8 Enable for VSCode
Docstring Pep8 Follow Numpy's Style
Unit Tests Python 3.x Sample

Poetry

  1. Test if poetry package manager is up to date:

    poetry run python .\main.py Train --config ".\train\train_on_CONDA_no_context.json" --max_epochs_to_train 1

    Note: Poetry installs torch with CPU support and no CUDA support.

    Issue 4231 -> User may have to separately install PyTorch with CUDA.

  2. Use poetry add to add missing packages to pyproject.toml & poetry.lock

  3. Use poetry export -f requirements.txt > requirements.txt to update requirements.txt.

Understanding our model

We want our model to be able to classify span of words as non-toxic / specific categories of toxicity.   For this use case, the model is currently a token classification.  

Basic information:

Collate Function

Trainer Logic / Terminology

  1. Epoch: One run of the training dataset.
  2. Batch Size: Number of samples to train on limited by memory size of CPU / GPU.
    • per_gpu_batch_size: number of samples to run on each gpu if more than one. Batch size will be num_gpu * per_gpu_batch_size
  3. Global Step: Number of batches before the model will calculate gradient & perform back propagation.
  4. Validation Loop:
    • We run validation at every X epochs. If we follow the paper, it was run 10 times per epoch.
    • Push metrics to TensorBoard
    • In normal ML models, we run validation every epoch or even every X epochs.
  5. Save model at the end of every X epoch:
    • changed from global step since this is dependent on two config variables and can be inconsistent.
    • can be changed back if we save all the configs.

Other Useful Links / Info

  1. Trainer Logic Code samples:

© [2023] Ubisoft Entertainment. All Rights Reserved