-
- In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts.
- For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work.
- To assess the potential of VLMs to support educators in settings like this one, we introduce DrawEduMath,
- an English-language dataset of 2030 images of students' handwritten responses to K-12 math problems.
-
-
-
- Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs.
- These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs,
- as well as 4,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs).
- We show that even state-of-the-art VLMs leave much room for improvement on DrawEduMath questions.
- We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs.
-
- We release DrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.
-
-