Legal Case Outcome Prediction Benchmark

Overview

This project implements a comprehensive benchmark system for evaluating large language models' (LLMs) ability to predict legal case outcomes in contract law. The benchmark uses a dataset of 100 contract dispute cases from the California jurisdiction, focusing on marketing service agreements between tech startups and marketing firms. The dataset was generated using Claude and tested on GPT-4o-Mini to make things interesting.

Dataset Structure

Contract Case Summaries

The dataset (contract_case_summaries.json) contains detailed case information including:

Case titles and jurisdictions
Party information (tech startups vs marketing firms)
Contract details
- Delivery dates
- Payment terms
- Performance obligations
Timeline of events
Legal issues
Arguments from both parties
Actual outcomes and reasoning
Metadata (complexity, statutes, precedents)

Example of dataset structure:

{
        "case_title": "Case Title 4",
        "jurisdiction": "United States District Court for California",
        "facts": {
            "parties": {
                "plaintiff": "Plaintiff Company 4, a tech startup.",
                "defendant": "Defendant Agency 4, a marketing firm."
            },
            "contract_details": {
                "type": "Service Agreement for Marketing Campaign",
                "key_terms": {
                    "delivery_date": "December 5, 2022",
                    "payment": "$30000 upon delivery of services",
                    "performance_obligations": "Execution of a targeted advertising campaign for product 4."
                }
            },
            "timeline_of_events": [
                {
                    "date": "July 5, 2022",
                    "event": "Contract signed."
                },
                {
                    "date": "September 14, 2022",
                    "event": "Defendant requests delay; plaintiff refuses."
                },
                {
                    "date": "December 5, 2022",
                    "event": "Defendant fails to deliver services."
                },
                {
                    "date": "December 9, 2022",
                    "event": "Plaintiff files lawsuit."
                }
            ]
        },
        "legal_issues": {
            "primary_issues": [
                "Did Defendant Agency 4 breach the contract by failing to deliver services by the specified date?",
                "Is Plaintiff Company 4 entitled to damages for losses incurred?"
            ],
            "secondary_issues": [
                "Does the defendant's staffing issues qualify as a valid excuse under force majeure?"
            ]
        },
        "plaintiff_arguments": [
            "Defendant failed to deliver on time, constituting a material breach.",
            "Plaintiff seeks damages for business losses incurred.",
            "Rejection of extension was reasonable given the time sensitivity."
        ],
        "defendant_arguments": [
            "Force majeure excuses delays due to unforeseen circumstances.",
            "Plaintiff failed to mitigate damages.",
            "Losses were not directly caused by the breach."
        ],
        "outcome": {
            "decision": "Judgment in favor of Plaintiff Company 4.",
            "legal_reasoning": [
                "Defendant breached the contract by failing to deliver.",
                "Force majeure was deemed inapplicable.",
                "Losses were foreseeable and directly linked to the breach."
            ],
            "remedies_awarded": {
                "damages": "$90000 for lost business opportunities.",
                "attorney_fees": "Defendant to pay legal costs.",
                "interest": "Pre-judgment interest applied."
            }
        },
        "metadata": {
            "complexity": "Moderate",
            "relevant_statutes": [
                "California Commercial Code \u00a7 3300"
            ],
            "precedents_cited": [
                "Precedent Case 4"
            ],
            "date": "February 14, 2023"
    }
}

Project Components

1. Environment Setup

The project uses Poetry for dependency management with the following key dependencies:

[tool.poetry.dependencies]
python = "^3.13"
python-dotenv = "^1.0.1"
openai = "^1.56.0"
pandas = "^2.2.3"
matplotlib = "^3.9.3"
seaborn = "^0.13.2"
scikit-learn = "^1.5.2"

2. Data Processing Pipeline

The benchmark notebook (benchmark.ipynb) implements the following workflow:

Data loading and preprocessing
Prompt generation
Model evaluation
Results analysis and visualization

3. Evaluation Metrics

The benchmark evaluates LLM performance on multiple dimensions:

Decision accuracy (binary outcome prediction)
Legal reasoning alignment
Damages amount prediction accuracy
Statutory citation relevance
Precedent application accuracy

Case Pattern Analysis

Common Characteristics

Timeline Pattern:
- Contracts signed: July 2022
- Delay requests: September 2022
- Service failures: December 2022
- Lawsuits filed: Within 4-5 days of failure
Dispute Pattern:
- Consistent force majeure defense
- Similar breach patterns
- Standardized damage calculations

Legal Principles Tested

Contract breach determination
Force majeure applicability
Damage calculation methodology
Duty to mitigate
Foreseeability of losses

Technical Implementation

Data Processing

The project uses pandas for data manipulation and analysis:

import openai
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
with open('contract_case_summaries.json', 'r') as file:
    ground_truth_cases = json.load(file)

ground_truth_df = pd.DataFrame(ground_truth_cases)

Visualization

Implements matplotlib and seaborn for result visualization:

Decision distribution plots
Damages correlation analysis
Timeline pattern visualization
Legal reasoning consistency metrics

Research Applications

Primary Use Cases

Legal AI Development:
- Training data for legal prediction models
- Benchmark for legal reasoning capabilities
- Testing for bias in legal AI systems
Legal Education:
- Case study material
- Pattern recognition training
- Legal reasoning assessment
Legal Practice Analysis:
- Contract dispute pattern identification
- Risk assessment metrics
- Settlement value estimation

Limitations and Considerations

Dataset Constraints

Jurisdictional Limitation:
- California-specific cases
- Single industry focus (tech/marketing)
- Limited time period (2022-2023)
Case Complexity:
- Moderate complexity level
- Similar fact patterns
- Limited variety in legal issues

Ethical Considerations

AI Decision Making:
- Not intended for autonomous legal decisions
- Supplementary tool only
- Requires human oversight
Bias Mitigation:
- Regular dataset audits
- Diversity in case selection
- Transparent evaluation metrics

Future Development

Planned Enhancements

Dataset Expansion:
- Multiple jurisdictions
- Diverse industry sectors
- Various complexity levels
Model Integration:
- Multiple LLM support
- Custom model training
- Hybrid evaluation systems
Analysis Tools:
- Advanced visualization
- Statistical analysis
- Pattern recognition

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
benchmark.ipynb		benchmark.ipynb
benchmark.py		benchmark.py
contract_case_summaries.json		contract_case_summaries.json
generate_prompts.py		generate_prompts.py
generated_prompts.txt		generated_prompts.txt
get_prompts_from_json.py		get_prompts_from_json.py
llm_evaluation_results.csv		llm_evaluation_results.csv
llm_predictions.csv		llm_predictions.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal Case Outcome Prediction Benchmark

Overview

Dataset Structure

Contract Case Summaries

Project Components

1. Environment Setup

2. Data Processing Pipeline

3. Evaluation Metrics

Case Pattern Analysis

Common Characteristics

Legal Principles Tested

Technical Implementation

Data Processing

Visualization

Research Applications

Primary Use Cases

Limitations and Considerations

Dataset Constraints

Ethical Considerations

Future Development

Planned Enhancements

About

Releases

Packages

Languages

zjandali/benchmark

Folders and files

Latest commit

History

Repository files navigation

Legal Case Outcome Prediction Benchmark

Overview

Dataset Structure

Contract Case Summaries

Project Components

1. Environment Setup

2. Data Processing Pipeline

3. Evaluation Metrics

Case Pattern Analysis

Common Characteristics

Legal Principles Tested

Technical Implementation

Data Processing

Visualization

Research Applications

Primary Use Cases

Limitations and Considerations

Dataset Constraints

Ethical Considerations

Future Development

Planned Enhancements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages