diff --git a/docs/main_index.html b/docs/main_index.html deleted file mode 100644 index 62cff96..0000000 --- a/docs/main_index.html +++ /dev/null @@ -1,575 +0,0 @@ - - - - - - - - DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ - Hand-Drawn Math Images - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
-
-
-

- Logo - DrawEduMath -

-

- Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images -

-
- - Sami Baral*1, - - Lucy Li*2, - - Ryan Knight3, - - Alice Ng4, - - Luca Soldaini5, - - Neil Heffernan1, - - Kyle Lo5 -
- -
- 1Worcester Polytechnic Institute, - 2University of California, Berkeley,
- 3Insource Services Inc, - 4Teaching Lab, - 5Allen Institute for AI
- NeurIps 2024, Math AI Workshop -
- - -
-
-
-
-
- - - - - - -
-
- -
-
- - DrawEduMath dataset creation -

Logo - DrawEduMath is a dataset of images of student's handwritten responses to math problems, each with a teacher's description. - Each image in our dataset is a concatenation of a math problem on the left with a student response on the right. Teachers describe the student's response to the problem, and then a model, such as GPT-4o shown here, writes QA pairs extracted from facets of the description. -

-

Introduction

- -
-

- In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. - For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. - To assess the potential of VLMs to support educators in settings like this one, we introduce Logo DrawEduMath, - an English-language dataset of 2030 images of students' handwritten responses to K-12 math problems. -

- -

- Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. - These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, - as well as 4,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). - We show that even state-of-the-art VLMs leave much room for improvement on Logo DrawEduMath questions. - We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. - - We release LogoDrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind. -

-
-
-
- -
-
- - -
-
- -
-
- -

Leaderboard on DrawEduMath

-
-

Accuracy Scores on the - Logo - DrawEduMath dataset. -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#ModelDateSynthetic QATeacher QA
1GPT-4o2024-10-150.7220.628
2Claude 3.5 Sonnet2024-10-150.7150.657
3Gemini 1.5 Pro2024-10-110.6460.490
4Llama 3.2-11B V2024-10-150.3880.296
- - - -
-

The leaderboard scores are based on the judgements using Mixtral 8x22B model.

-

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

-

-
-
- -
-
- -
-
- - -
-
-

- Logo - DrawEduMath Dataset -

-
-
- -
-
-
- -
-

Overview

-
-

- Logo - DrawEduMath consists of 2,030 images of U.S.based students’ handwritten math responses to - 188 math problems spanning Grade 2 through high school. - - These images were initially collected on the LogoASSISTments - online learning platform, where students receive feedback from teachers on assigned work. - The problems that accompany each student response are drawn from three overlapping1 open educational resources (OER): Eureka Math, Open Up - Resources, and Illustrative Math. - -

- - - - -

- You can download the dataset on Hugging Face Dataset. -

- -
-
-
-
-
-
- data-overview -

- Key data statistics pertaining to students' math images
- included in Logo - DrawEduMath.
-

-
-
-
-
- data-composition -

- Key data statistics pertaining to the collection of
- teachers’ language for Logo - DrawEduMath. Word counts
- and text lengths are determined using white-space delineated tokens. -

-
-
-
- -
-
-

Examples

-

Examples of teacher’s answers to a question asking about possible errors in students’ responses to math - problems. All three examples of students’ hand-drawn responses are for the same math problem asking students to - draw and shade units on fraction strips to show 4 thirds, shown on the left. -

- Example of teachers' answers to question about erro - - -
-
- -
-
-

Statistics

- Overall question types in our VQA benchmark -

The most common question types in our Logo - DrawEduMath benchmark, along with examples of questions - categorized within each type.
- The percentages shown are the proportion of questions across all images within each - QA-writing (Claude-generated, GPT-4o-generated,
or teacher-written) workflow.

-
-
- -
-
- - -
-
-

Experiment Results

-
-
- -
-
- -
-
-

Results on Existing Vision Language Models

- -
-
- -
-
- - - - - - -
-
-

BibTeX

-
@inproceedings{baral2024drawedumath,
-  author    = {Baral, Sami and Li, Lucy and Knight, Ryan and Ng, Alice and Soldainin, Luca and Heffernan, Neil and Lo, Kyle},
-  title     = {DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images},
-  booktitle = {The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24},
-  year      = {2024}
-}
-
-
- -
-
- - - - - - - - - - - - - - - -
-
- - - - - - diff --git a/plots/bar_vlm_performance.py b/plots/bar_vlm_performance.py index ed91f1d..7c79c5d 100644 --- a/plots/bar_vlm_performance.py +++ b/plots/bar_vlm_performance.py @@ -10,7 +10,6 @@ import matplotlib.pyplot as plt import numpy as np -import pandas as pd # Create DataFrame from the data metrics = [