Merge pull request #26 from ShiboSoftwareDev/main

updated readme
tscircuit · Feb 6, 2025 · 2695c84 · 2695c84
2 parents ff4d967 + 3b3bf15
commit 2695c84
Show file tree

Hide file tree

Showing 2 changed files with 72 additions and 140 deletions.
diff --git a/README.md b/README.md
@@ -1,20 +1,79 @@
-# Prompt Benchmarks
+# tscircuit Prompt Benchmarks
 
-[Docs](https://docs.tscircuit.com) &middot; [Website](https://tscircuit.com) &middot; [Twitter](https://x.com/tscircuit) &middot; [discord](https://tscircuit.com/community/join-redirect) &middot; [Quickstart](https://docs.tscircuit.com/quickstart) &middot; [Online Playground](https://tscircuit.com/playground)
+[Docs](https://docs.tscircuit.com) · [Website](https://tscircuit.com) · [Twitter](https://x.com/tscircuit) · [Discord](https://tscircuit.com/community/join-redirect) · [Quickstart](https://docs.tscircuit.com/quickstart) · [Online Playground](https://tscircuit.com/playground)
+
+This repository contains benchmarks for evaluating and improving the quality of system prompts used to generate tscircuit code. It includes components for:
+
+- **Code Runner** (in `lib/code-runner`): Safely transpiles, evaluates, and renders TSX code for circuit generation.
+- **AI Integration** (in `lib/ai`): Interfaces with Anthropic’s Claude models for prompt completions and error correction.
+- **Utility Modules** (in `lib/utils`): Provide logging, snapshot management, and type-checking of generated circuits.
+- **Prompt Templates** (in `lib/prompt-templates`): Define various prompt structures for generating different circuit types.
+- **Benchmarking & Scoring** (using evalite and custom scorers in `benchmarks/scorers`): Run multiple tests to ensure circuit validity and quality.
 
-This repo contains benchmarks for tscircuit system prompts used for
-automatically generating tscircuit code.
 
 ## Running Benchmarks
 
-You can use `bun run benchmark` to select and run a benchmark. A single prompt takes about 10s-15s to
-run when run with `sonnet`. We have a set of samples (see the [tests/samples](./tests/samples) directory)
-that the benchmarks run against. When you change a prompt, you must run the benchmark
-for that prompt to update the benchmark snapshot. This is how we record degradation
-or improvement in the response quality. Each sample is run 5 times and two tests
-are run:
+To run the benchmarks using evalite, use:
+```bash
+bun start
+```
+Each prompt is processed multiple times to test:
+1. Whether the output compiles without errors.
+2. Whether the output meets the expected circuit specifications.
+
+After modifying prompts or system components, evalite reruns automatically, you should skip the benchmarks you don't want to run.
+
+### Problem Sets
+
+This project uses TOML files to define problem sets for circuit generation. Each problem is defined using a TOML array of tables with the following format:
+
+```toml
+[[problems]]
+prompt = """
+Your circuit prompt description goes here.
+"""
+title = "Sample Problem Title"
+questions = [
+  { text = "Question text", answer = true },
+  { text = "Another question text", answer = false }
+]
+```
+
+In each problem:
+- The `prompt` field must contain the circuit description that instructs the AI.
+- The `title` gives a short title for the problem.
+- The `questions` array contains objects with a `text` property (the question) and an `answer` property (a boolean) used to validate the generated circuit.
+
+To add a new problem set, create a new TOML file in the `problem-sets` directory following this format. Each new file can contain one or more problems defined with the `[[problems]]` header.
+
+## Build, Test, and Start
+
+- **Build**: `bun run build`
+- **Test**: `bun run test`
+- **Start**: `bun start`
+
+## Benchmarks Directory
+
+The benchmarks directory contains various files to help evaluate and score circuit‐generating prompts:
+
+• benchmarks/prompt-logs/  
+  These are text files (e.g., prompt-2025-02-05T14-07-18-242Z.txt, prompt-2025-02-05T14-10-53-144Z.txt, etc.) that log each prompt attempt and its output. They serve as a history of interactions.
+
+• benchmarks/benchmark-local-circuit-error-correction.eval.ts  
+  Runs local circuit evaluation with an error correction workflow. It repeatedly calls the AI (up to a set maximum) until the circuit output meets expectations, logging each attempt.
+
+• benchmarks/benchmark-local-circuit.eval.ts  
+  Evaluates a local circuit by running a specific user prompt and checking that the generated circuit compiles and meets expected behaviors.
+
+• benchmarks/benchmark-local-circuit-random.eval.ts  
+  Generates random prompts using an AI-powered prompt generator and evaluates their corresponding circuit outputs. This file is useful for stress-testing and assessing the robustness of circuit generation.
+
+• benchmarks/scorers/ai-circuit-scorer.ts  
+  Uses an AI model to assign a score (from 0 to 1) based on correctness, appropriate use of components, circuit complexity, and code quality.
+
+• benchmarks/scorers/circuit-scorer.ts  
+  A basic scorer that checks each generated circuit against predefined questions and answers from problem sets.
 
-1. Does the output from the prompt compile?
-2. Does the output produce the expected circuit?
+## License
 
-The benchmark shows the percentage of samples that pass (1) and (2)
+MIT License
diff --git a/scripts/run-benchmark.ts b/scripts/run-benchmark.ts