Construe: An LLM Benchmark Utility

An LLM inferencing benchmark tool focusing on device-specific latency and memory usage.

Quick Start

This package is intended to be installed with pip and it will create a command line program construe on your $PATH to execute benchmarking comamnds:

$ pip install construe
$ which construe
$ construe --help

There are several top-level configurations that you can specify either as an environment variable or a command line option before the command. The environment variables are as follows:

$CONSTRUE_ENV or $ENV: specify the name of the experimental environment for comparison purposes.
$CONSTRUE_DEVICE or $TORCH_DEVICE: specify the name of the default device to use with PyTorch e.g. cpu, mps, or cuda.

The command line utility help is as follows:

Usage: construe [OPTIONS] COMMAND [ARGS]...

Options:
  --version          Show the version and exit.
  -d, --device TEXT  specify the pytorch device to run on e.g. cpu, mps or
                     cuda
  -e, --env TEXT     name of the experimental environment for comparison
                     (default is hostname)
  -h, --help         Show this message and exit.

Commands:
  basic
  moondream

Basic Benchmarks

The basic benchmarks implement dot product benchmarks from the PyTorch documentation. These benchmarks can be run using construe basic; for example by running:

$ construe -e "MacBook Pro 2022 M1" basic -o results-macbook.pickle

The -e flag specifies the environment for comparison purposes and the -o flag saves the measurements out to disk as a Pickle file that can be loaded for comparison to other environments later.

Command usage is as follows:

Usage: construe basic [OPTIONS]

Options:
  -e, --env TEXT             name of the experimental environment for
                             comparison (default is hostname)
  -o, --saveto TEXT          path to write the measurements pickle data to
  -t, --num-threads INTEGER  specify number of threads for benchmark (default
                             to maximum)
  -F, --fuzz / --no-fuzz     fuzz the tensor sizes of the inputs to the
                             benchmark
  -S, --seed INTEGER         set the random seed for random generation
  -h, --help                 Show this message and exit.

Moondream Benchmarks

The moondream package contains small image-to-text computer vision models that can be used in the first step of a content moderation workflow (e.g. image to text, moderate text). This benchmark executes the model for encoding and inferencing on a small number of images and reports the average time for both operations and the line-by-line memory usage of the model.

It can be run as follows:

$ construe moondream

Command usage is as follows:

Usage: construe moondream [OPTIONS]

Options:
  -h, --help  Show this message and exit.

Model References

Image to Text: Moondream (vikhyatk/moondream2)
Speech to Text: Whisper (openai/whisper-tiny.en)
Image Classification: MobileNet (google/mobilenet_v2_1.0_224)
Object Detection: MobileViT (apple/mobilevit-xx-small)
NSFW Image Classification Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification (Falconsai/nsfw_image_detection)
Image Enhancement LoL MIRNet (keras-io/lowlight-enhance-mirnet)
Text Classification: Offensive Speech Detector (KoalaAI/OffensiveSpeechDetector)
Token Classification: GLiNER (knowledgator/gliner-bi-small-v1.0)

Dataset References

AEGIS AI Content Safety v1.0: Text data that is used to show examples of content safety (e.g. harmful text) described by Nvidia's content safety taxonomy.
LoL (Low-Light) Dataset: Contains 500 low-light and normal-light image pairs for image enhancement.
English Dialects: Contains 31 hours of audo from 120 individuals speaking with different accents of the British Isles and is used for speech to text.
Reddit Posts Comments: A text dataset of comments on Reddit posts that can be used for NER and content moderation tasks on short form text.
Student and LLM Essays: A text dataset of essays written by students (and LLMs) that can be used for NER and content moderation tasks on longer form text.
NSFW Detection: An image dataset that contains NSFW and SFW images used for content moderation.
Movie Scenes: An image dataset that contains stills from commercial movies and can be used for image classification and content-moderation tasks.

Releases

To release the construe library and deploy to PyPI run the following commands:

$ python -m build
$ twine upload dist/*

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
construe		construe
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Construe: An LLM Benchmark Utility

Quick Start

Basic Benchmarks

Moondream Benchmarks

Model References

Dataset References

Releases

About

Releases 1

Packages

Languages

License

rotationalio/llm-benchmark

Folders and files

Latest commit

History

Repository files navigation

Construe: An LLM Benchmark Utility

Quick Start

Basic Benchmarks

Moondream Benchmarks

Model References

Dataset References

Releases

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages