A template for SLU projects at skit.ai.
- Built using dialogy
- This template is automatically to create an slu microservice.
- Don't clone this project for building microservices, use the
dialogy create <project_name>
to create projects.
- XLMRWorkflow uses "xlm-roberta-base" for both classification and ner tasks.
- Flask APIs.
- Sentry for error monitoring.
File | Description |
---|---|
config | A directory that contains yaml files. |
data | Version controlled by dvc . |
data/0.0.1 | A directory that would contain these directories: datasets, metrics, models. |
slu/dev | Programs not required for development, might not be useful in production. |
slu/src | Houses the prediction API. |
slu/utils | Programs that offer assitance in either dev or src belong here. |
tests/ | Test cases for your project. |
CHANGELOG.md | Track changes in the code, datasets, etc. |
Dockerfile | Containerize the application for production use. |
LICENSE | Depending on your usage choose the correct copy, don't keep the default! |
Makefile | Helps maintain hygiene before deploying code. |
pyproject.toml | Track dependencies here. Also, this means you would be using poetry. |
README.md | This must ring a bell. |
uwsgi.ini | Modify as per use. |
Make sure you have git
, python==^3.8
, poetry
installed. Preferably within a virtual environment. You would also need to have cmake
installed if you are on Mac OS. Run the command brew install cmake
to install cmake.
To create a project using this template, run:
pip install dialogy
dialogy create hello-world
The questions here help:
- Populate your
pyproject.toml
since we usepoetry
for managing dependencies. - Create a repository and python package with the scaffolding you need.
- Remove
poetry.lock
it may affect package installation. It will be removed in a previous version.
cd hello-world
poetry install
make lint
git init
git add .
git commit -m "add: initial commit."
Please look at "languages"
key in config.yaml
. Update this with supported languages to prevent hiccups!
The poetry install
step takes care of dvc installation. You need to create a project on github, gitlab, bitbucket, etc.
set the remote. Once you are done with the installation, you can perform slu -h
.
> slu -h
usage: slu [-h] {setup-dirs,split-data,combine-data,train,test,release,repl} ...
positional arguments:
{setup-dirs,split-data,combine-data,train,test,release,repl}
Project utilities.
setup-dirs Create base directory structure.
setup-prompts Create mapping between nls_labels and prompts.
split-data Split a dataset into train-test datasets for given ratio.
combine-data Combine datasets into a single file.
train Train a workflow.
test Test a workflow.
release Release a version of the project.
repl Read Eval Print Loop for a trained workflow.
optional arguments:
-h, --help show this help message and exit
Let's start with dataset, model and report management command slu setup-dirs --version=0.0.1
.
slu setup-dirs -h
usage: slu setup-dirs [-h] [--version VERSION]
optional arguments:
-h, --help show this help message and exit
--version VERSION The version of the dataset, model, metrics to use. Defaults to the latest version.
This creates a data directory with the following structure:
data
+---0.0.1
+---classification
+---datasets
+---metrics
+---models
We use dvc
for dataset and model versioning.
s3 is the preferred remote to save project level data that are not fit for tracking via git.
# from project root.
dvc init
dvc add data
dvc remote add -d myremote s3://bucket/path/to/some/dir
git add data.dvc
Assuming we have a labeled dataset, we are ready to execute the next command slu split-data
,
this puts a train.csv
and test.csv
at a desired --dest
or the project default places within
data/0.0.1/classification/datasets
.
slu split-data -h
usage: slu split-data [-h] [--version VERSION] --file FILE (--train-size TRAIN_SIZE | --test-size TEST_SIZE)
[--stratify STRATIFY] [--dest DEST]
optional arguments:
-h, --help show this help message and exit
--version VERSION The version for dataset paths.
--file FILE A dataset to be split into train, test datasets.
--train-size TRAIN_SIZE
The proportion of the dataset to include in the train split
--test-size TEST_SIZE
The proportion of the dataset to include in the test split.
--stratify STRATIFY Data is split in a stratified fashion, using the class labels. Provide the column-name in
the dataset that contains class names.
--dest DEST The destination directory for the split data.
data
+---0.0.1
+---classification
+---datasets
| +---train.csv
| +---test.csv
+---metrics
+---models
To train an classifier, we run slu train
.
slu train -h
usage: slu train [-h] [--file FILE] [--lang LANG] [--project PROJECT] [--version VERSION]
optional arguments:
-h, --help show this help message and exit
--file FILE A csv dataset containing utterances and labels.
--lang LANG The language of the dataset.
--project PROJECT The project scope to which the dataset belongs.
--version VERSION The dataset version, which will also be the model's version.
Not providing the --file
argument will pick a train.csv
from data/0.0.1/classification/datasets
.
Once the training is complete, you would notice the models would be populated:
data
+---0.0.1
+---classification
+---datasets
| +---train.csv
| +---test.csv
+---metrics
+---models
+---config.json
+---eval_results.txt
+---labelencoder.pkl
+---model_args.json
+---pytorch_model.bin
+---sentencepiece.bpe.model
+---special_tokens_map.json
+---tokenizer_config.json
+---training_args.bin
+---training_progress_scores.csv
We evaluate all the plugins in the workflow using slu test --lang=LANG
.
Not providing the --file
argument will pick a test.csv
from data/0.0.1/classification/datasets
.
slu test -h
usage: slu test [-h] [--file FILE] --lang LANG [--project PROJECT] [--version VERSION]
optional arguments:
-h, --help show this help message and exit
--file FILE A csv dataset containing utterances and labels.
--lang LANG The language of the dataset.
--project PROJECT The project scope to which the dataset belongs.
--version VERSION The dataset version, which will also be the report's version.
Reports are saved in the data/0.0.1/classification/metrics
directory. We save:
-
A classification report that shows the f1-score for all the labels in the
test.csv
or--file
. -
A confusion matrix between a select intents.
-
A collection of all the data-points where the predictions don't match the ground truth.
To run your models to see how they perform on live inputs, you have two options:
-
slu repl
slu repl -h usage: slu repl [-h] [--version VERSION] [--lang LANG] optional arguments: -h, --help show this help message and exit --version VERSION The version of the dataset, model, metrics to use. Defaults to the latest version. --lang LANG Run the models and pre-processing for the given language code.
The multi-line input catches people off-guard.
ESC
+ENTER
to submit an input to the repl. -
task serve
This is a uwsgi server that provides the same interface as your production applications.
Once the model performance achieves a satisfactory metric, we want to release and persist the dataset, models and reports.
To do this, we meet the final command slu release --version VERSION
.
slu release -h
usage: slu release [-h] --version VERSION
optional arguments:
-h, --help show this help message and exit
--version VERSION The version of the dataset, model, metrics to use. Defaults to the latest version.
This command takes care of the following acts:
-
Stages
data
dir for dvc. -
Requires a changelog input.
-
Stages changes within CHANGELOG.md, data.dvc, config.yaml, pyproject.toml for content updates and version changes.
-
Creates a commit.
-
Creates a tag for the given
--version=VERSION
. -
Pushes the data to dvc remote.
-
Pushes the code and tag to git remote.
Finally, we are ready to build a Docker image for our service for production runs. We use Makefiles to ensure a bit of hygiene checks.
Run make <image-name>
to check if the image builds in your local environment. If you have CI-CD enabled, that should do it for you.
CI/CD automates the entire Docker Image build and deployment steps to staging & production. Pipeline is triggered whenever a new tag is released (recommendeded way to create and push tags is slu release --version VERSION
).
.gitlab-ci.yml pipeline includes the following stages.
publish_image
# build docker image and push to registryupdate_chart_and_deploy_to_staging
# deploy the tagged dockerimage to staging clusterupdate_chart_and_deploy_to_production
# deploy the tagged dockerimage to production cluster
update_chart_and_deploy_to_production
stage requires manual approval for running.
For a clean CI/CD setup, following conditions should be met.
-
Project name should be same for Gitlab Repository and Amazon ECR folder.
-
k8s-configs/ai/clients project folder should follow the following file structure:
- values-staging.yaml #values for staging
- values-production.yaml #values for prod
- application-production.yaml # deploys app to prod
- application-staging.yaml #deploys to staging
-
dvc shouldn't be a dev-dependencies.
replace this:
[tool.poetry.dev-dependencies.dvc] extras = [ "s3",] version = "^2.6.4"
with:
[tool.poetry.dependencies.dvc] extras = [ "s3",] version = "^2.6.4"
in pyproject.toml.
-
poetry.lock should be a git tracked file. Ensure it is not present inside
.gitignore
. -
Remove
.dvc
if present inside.dockerignore
and replace it with.dvc/cache/
.
The config manages paths for artifacts, arguments for models and rules for plugins.
calibration: {}
languages:
- en
model_name: slu
slots: # Arbitrary slot filing rule to serve as an example.
_cancel_:
number_slot:
- number
tasks:
classification:
alias: {}
format: ''
model_args:
production:
best_model_dir: data/0.0.1/classification/models
dynamic_quantize: true
eval_batch_size: 1
max_seq_length: 128
no_cache: true
output_dir: data/0.0.1/classification/models
reprocess_input_data: true
silent: true
thread_count: 1
use_multiprocessing: false
test:
best_model_dir: data/0.0.1/classification/models
output_dir: data/0.0.1/classification/models
reprocess_input_data: true
silent: true
train:
best_model_dir: data/0.0.1/classification/models
early_stopping_consider_epochs: true
early_stopping_delta: 0.01
early_stopping_metric: eval_loss
early_stopping_metric_minimize: true
early_stopping_patience: 3
eval_batch_size: 8
evaluate_during_training_steps: 1080
fp16: false
num_train_epochs: 1
output_dir: data/0.0.1/classification/models
overwrite_output_dir: true
reprocess_input_data: true
save_eval_checkpoints: false
save_model_every_epoch: false
save_steps: -1
use_early_stopping: true
skip: # Remove these intents from training data.
- silence
- audio_noisy
threshold: 0.1
use: true
version: 0.0.1
Model args help maintain the configuration of models in a single place, here is a full list, for classification or NER model configuration.
These are the APIs which are being used.
-
Health check - To check if the service is running.
@app.route("/", methods=["GET"]) def health_check(): return jsonify( status="ok", response={"message": "Server is up."}, )
-
Predict - This is the main production API.
@app.route("/predict/<lang>/slu/", methods=["POST"])
We have already covered commands for training, evaluating and interacting with an intent classifier using this project. Covering the types of entities that are supported with the project here.
Entity Type | Plugin | Entity Description |
---|---|---|
NumericalEntity | DucklingPlugin | Numbers and numerals, like: 4, four, 35th and sixth |
TimeEntity | DucklingPlugin | Now, Today, Tomorrow, Yesterday, 25th September, four January, 3 o clock, 5 pm |
DurationEntity | DucklingPlugin | for 2h |
TimeIntervalEntity | DucklingPlugin | after 8 pm, before 6 am, 2 to 3 pm |
PeopleEntity | DucklingPlugin | 5 people, a couple |
CurrencyEntity | DucklingPlugin | $45, 80 rupees |
KeywordEntity | ListEntityPlugin | Any pattern based entity r"(pine)?apple" |
We have provided both DucklingPlugin and ListEntityPlugin readily initialized as processors but these are not opted into the list of plugin objects that the function returns.
To use these plugins:
# If no entities are required:
def get_plugins(purpose, config: Config, debug=False) -> List[Plugin]:
...
return [merge_asr_output, xlmr_clf, slot_filler] # this list must change
# If only duckling plugin is required:
def get_plugins(purpose, config: Config, debug=False) -> List[Plugin]:
...
return [merge_asr_output, duckling_plugin, xlmr_clf, slot_filler] # this list must change
# If only list entity plugin is required:
def get_plugins(purpose, config: Config, debug=False) -> List[Plugin]:
...
return [merge_asr_output, list_entity_plugin, xlmr_clf, slot_filler] # this list must change
# If both duckling_plugin and list entity plugin are required.
def get_plugins(purpose, config: Config, debug=False) -> List[Plugin]:
...
return [merge_asr_output, duckling_plugin, list_entity_plugin, xlmr_clf, slot_filler] # this list must change
These plugins come with scoring and aggregation logic that can be utilised by their threshold property. The threshold here is the proportion of the entity with respect to transcripts.
-
If only one entity is detected over 3 transcripts, then the score for the entity is 0.33. As long as the
score > threshold
, the entity is produced. -
If entities with same value and type are produced in the same transcript multiple times, they are counted only once. Assuming the speaker is repeating the entity.
-
If entities with same value and type are produced in across different transcripts then they are once per transcript.