MathScape

This is the codebase for the paper MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark.

Environments

To set up the environment, use the following commands:

conda create -n MathScape python=3.10
conda activate MathScape

Then, install the required Python packages:

pip install -r requirements.txt

Data Description

The original dataset comprised 1,369 entries. After manually removing erroneous data, 1,325 entries remain. The images in these 1,325 entries have been renumbered sequentially from 1 to 1,325.

math_data.json
All the math data is in JSON form.
math_question_solution_ans.json
This file contains math questions along with their corresponding image IDs, detailed solution processes, and reference standard answers.
math_with_class.jsonl
This file decomposes each question into multiple sub-questions (e.g., a single question may consist of 2-3 sub-questions). It includes type labels for each sub-question, knowledge point labels, solution processes for each sub-question, and reference standard answers for each sub-question.
question_knowledge.json
This file maps the image IDs of the questions to their corresponding knowledge point classifications.

You can download the dataset from this link

MathScape Evaluation

1. Model Evaluation

To evaluate multiple models, you can configure the models in the model.py file. This script is responsible for loading the models, processing the questions, and generating the results.

model.py

2. Running Evaluations

To evaluate the models' performance, run:

eval.py

This script compares the model-generated solutions against the standard answers, evaluates the results, and saves the evaluation data in a jsonl file.

3. Assessing Model Performance

Once all models have been evaluated and the jsonl files generated, you can assess the performance of each model.

3.1 Judge all the Performances

To assess overall performance, run:

judge_all.py

This script determines the correctness of each answer, calculates the overall average accuracy for each model, and includes functions like judge_by_xx to read and evaluate the results from the jsonl file. It also allows for evaluating accuracy based on various dimensions such as knowledge points, educational stages, and question types.

3.2 Judge by Knowledge Points

To evaluate performance based on specific knowledge points, run:

python judge_by_knowledge.py

3.3 Judge by Educational Stages

To evaluate performance across different educational stages, run:

python judge_by_stage.py

3.4 Judge by Question Types

To evaluate performance based on question types, run:

python judge_by_type.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathScape

Environments

Data Description

MathScape Evaluation

1. Model Evaluation

2. Running Evaluations

3. Assessing Model Performance

3.1 Judge all the Performances

3.2 Judge by Knowledge Points

3.3 Judge by Educational Stages

3.4 Judge by Question Types

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
config		config
utils		utils
README.md		README.md
eval.py		eval.py
geminipro_api.py		geminipro_api.py
judge_all.py		judge_all.py
judge_by_knowledge.py		judge_by_knowledge.py
judge_by_stage.py		judge_by_stage.py
judge_by_type.py		judge_by_type.py
model.py		model.py
qwen_vl_api.py		qwen_vl_api.py
requirements.txt		requirements.txt

PKU-Baichuan-MLSystemLab/MathScape

Folders and files

Latest commit

History

Repository files navigation

MathScape

Environments

Data Description

MathScape Evaluation

1. Model Evaluation

2. Running Evaluations

3. Assessing Model Performance

3.1 Judge all the Performances

3.2 Judge by Knowledge Points

3.3 Judge by Educational Stages

3.4 Judge by Question Types

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages