-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #26 from ShiboSoftwareDev/main
updated readme
- Loading branch information
Showing
2 changed files
with
72 additions
and
140 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,79 @@ | ||
# Prompt Benchmarks | ||
# tscircuit Prompt Benchmarks | ||
|
||
[Docs](https://docs.tscircuit.com) · [Website](https://tscircuit.com) · [Twitter](https://x.com/tscircuit) · [discord](https://tscircuit.com/community/join-redirect) · [Quickstart](https://docs.tscircuit.com/quickstart) · [Online Playground](https://tscircuit.com/playground) | ||
[Docs](https://docs.tscircuit.com) · [Website](https://tscircuit.com) · [Twitter](https://x.com/tscircuit) · [Discord](https://tscircuit.com/community/join-redirect) · [Quickstart](https://docs.tscircuit.com/quickstart) · [Online Playground](https://tscircuit.com/playground) | ||
|
||
This repository contains benchmarks for evaluating and improving the quality of system prompts used to generate tscircuit code. It includes components for: | ||
|
||
- **Code Runner** (in `lib/code-runner`): Safely transpiles, evaluates, and renders TSX code for circuit generation. | ||
- **AI Integration** (in `lib/ai`): Interfaces with Anthropic’s Claude models for prompt completions and error correction. | ||
- **Utility Modules** (in `lib/utils`): Provide logging, snapshot management, and type-checking of generated circuits. | ||
- **Prompt Templates** (in `lib/prompt-templates`): Define various prompt structures for generating different circuit types. | ||
- **Benchmarking & Scoring** (using evalite and custom scorers in `benchmarks/scorers`): Run multiple tests to ensure circuit validity and quality. | ||
|
||
This repo contains benchmarks for tscircuit system prompts used for | ||
automatically generating tscircuit code. | ||
|
||
## Running Benchmarks | ||
|
||
You can use `bun run benchmark` to select and run a benchmark. A single prompt takes about 10s-15s to | ||
run when run with `sonnet`. We have a set of samples (see the [tests/samples](./tests/samples) directory) | ||
that the benchmarks run against. When you change a prompt, you must run the benchmark | ||
for that prompt to update the benchmark snapshot. This is how we record degradation | ||
or improvement in the response quality. Each sample is run 5 times and two tests | ||
are run: | ||
To run the benchmarks using evalite, use: | ||
```bash | ||
bun start | ||
``` | ||
Each prompt is processed multiple times to test: | ||
1. Whether the output compiles without errors. | ||
2. Whether the output meets the expected circuit specifications. | ||
|
||
After modifying prompts or system components, evalite reruns automatically, you should skip the benchmarks you don't want to run. | ||
|
||
### Problem Sets | ||
|
||
This project uses TOML files to define problem sets for circuit generation. Each problem is defined using a TOML array of tables with the following format: | ||
|
||
```toml | ||
[[problems]] | ||
prompt = """ | ||
Your circuit prompt description goes here. | ||
""" | ||
title = "Sample Problem Title" | ||
questions = [ | ||
{ text = "Question text", answer = true }, | ||
{ text = "Another question text", answer = false } | ||
] | ||
``` | ||
|
||
In each problem: | ||
- The `prompt` field must contain the circuit description that instructs the AI. | ||
- The `title` gives a short title for the problem. | ||
- The `questions` array contains objects with a `text` property (the question) and an `answer` property (a boolean) used to validate the generated circuit. | ||
|
||
To add a new problem set, create a new TOML file in the `problem-sets` directory following this format. Each new file can contain one or more problems defined with the `[[problems]]` header. | ||
|
||
## Build, Test, and Start | ||
|
||
- **Build**: `bun run build` | ||
- **Test**: `bun run test` | ||
- **Start**: `bun start` | ||
|
||
## Benchmarks Directory | ||
|
||
The benchmarks directory contains various files to help evaluate and score circuit‐generating prompts: | ||
|
||
• benchmarks/prompt-logs/ | ||
These are text files (e.g., prompt-2025-02-05T14-07-18-242Z.txt, prompt-2025-02-05T14-10-53-144Z.txt, etc.) that log each prompt attempt and its output. They serve as a history of interactions. | ||
|
||
• benchmarks/benchmark-local-circuit-error-correction.eval.ts | ||
Runs local circuit evaluation with an error correction workflow. It repeatedly calls the AI (up to a set maximum) until the circuit output meets expectations, logging each attempt. | ||
|
||
• benchmarks/benchmark-local-circuit.eval.ts | ||
Evaluates a local circuit by running a specific user prompt and checking that the generated circuit compiles and meets expected behaviors. | ||
|
||
• benchmarks/benchmark-local-circuit-random.eval.ts | ||
Generates random prompts using an AI-powered prompt generator and evaluates their corresponding circuit outputs. This file is useful for stress-testing and assessing the robustness of circuit generation. | ||
|
||
• benchmarks/scorers/ai-circuit-scorer.ts | ||
Uses an AI model to assign a score (from 0 to 1) based on correctness, appropriate use of components, circuit complexity, and code quality. | ||
|
||
• benchmarks/scorers/circuit-scorer.ts | ||
A basic scorer that checks each generated circuit against predefined questions and answers from problem sets. | ||
|
||
1. Does the output from the prompt compile? | ||
2. Does the output produce the expected circuit? | ||
## License | ||
|
||
The benchmark shows the percentage of samples that pass (1) and (2) | ||
MIT License |
This file was deleted.
Oops, something went wrong.