🏆 Korean SAT LLM Leaderboard

📊 Test Your Own Model on the 2023 Korean SAT Sample Dataset

🗂️ Index

한국어 설명 바로가기

Take advantage of this unique opportunity to compare human academic ability with the performance of large language models (LLMs) based on the highly reputable College Scholastic Ability Test (CSAT)!

Test how well your fine-tuned Korean LLM performs on a 10-year benchmark of the Korean CSAT and see what score it would achieve right now!

🚨 Notice: 2025 SAT (1-Year) Model Performance Comparison Results

The current benchmark has been completed using GPT-based models. Future performance evaluations for other models will be conducted as additional resources and funding become available.
Please note that the current grade thresholds are estimated, and they will be updated once the official thresholds are announced.

Rank	Model Name	Standard Score	Raw Score	Common Subject Score	Elective Subject Score	Estimated Grade Cut (CruxTable Standard)
🥇 1st	o1-Preview	133	97	73	24	Grade 1
🥈 2nd	o1-mini	115	78	57	21	Grade 4
🥉 3rd	gemini_2.0_experimental_advanced	114	77	55	22	Grade 4
4th	gpt-4o	112	75	56	19	Grade 4
5th	claude-3-5-sonnet-20241022	108	70	54	16	Grade 4
6th	HyperClovaX	108	61	48	24	Grade 4
7th	gpt-4o-mini	97	59	44	15	Grade 5
8th	gpt-3.5-turbo	56	16	10	6	Grade 9

cf)

The incorrectly answered question on the o1-preview was question 8 (3 points) from the CSAT Korean section, a non-literary text question!
For those curious about the analysis of the incorrect question and a detailed explanation of the experiment, please refer to this link.

🎯 What is the Korean SAT LLM Leaderboard?

The Korean SAT LLM leaderboard is a leaderboard benchmarking 10 years of Korean CSAT (College Scholastic Ability Test) exams, developed by the reputable KICE (Korea Institute for Curriculum and Evaluation).
The CSAT consists of a wide range of question types depending on the difficulty level, designed to assess reading comprehension, critical thinking, and sentence interpretation skills.

If you want to know more about Korean SAT (Korean College entrance exam), please refer this!

💯 Leaderboard

Leaderboard Rank	Model Name	Submitter Name	Avg. std Score	Avg. Grade	2024 SAT	2023 SAT	2022 SAT	2021 SAT	2020 SAT	2019 SAT	2018 SAT	2017 SAT	2016 SAT	2015 SAT	URL
🥇 1st	gpt-4o-2024-08-06	OpenAI	114.9	3.6	65 (4)	81 (4)	70 (4)	69 (4)	76 (4)	74 (3)	77 (4)	86 (2)	84 (3)	77 (4)	Link
🥈 2nd	Meta-Llama-3.1-405B-Instruct-Turbo	meta-llama	113.8	3.8	77 (3)	87 (3)	69 (4)	70 (4)	65 (5)	68 (4)	78 (4)	80 (3)	87 (3)	68 (5)	Link
🥉 3rd	Qwen2.5-72B-Instruct-Turbo	Qwen	105.8	4.6	61 (5)	78 (4)	52 (6)	60 (5)	60 (5)	64 (4)	74 (4)	70 (5)	74 (4)	79 (4)	Link
4th	Meta-Llama-3.1-70B-Instruct-Turbo	meta-llama	103.7	4.8	50 (6)	72 (5)	73 (3)	61 (5)	79 (3)	51 (5)	58 (6)	66 (5)	71 (5)	70 (5)	Link
5th	claude-3-5-sonnet-20241022	Antropic	102.6	5	60 (5)	61 (6)	69 (4)	58 (5)	72 (4)	63 (4)	71 (5)	70 (5)	58 (6)	55 (6)	Link
6th	Qwen2-72B-Instruct	Qwen	98	5.2	53 (5)	57 (6)	59 (5)	45 (6)	57 (5)	56 (5)	76 (4)	69 (5)	58 (6)	63 (5)	Link
7th	gpt-4o-mini-2024-07-18	OpenAI	93.9	5.6	57 (5)	53 (6)	50 (6)	55 (5)	50 (6)	46 (6)	62 (5)	58 (6)	64 (5)	57 (6)	Link
8th	gemma-2-27b-it	Google	91	5.9	51 (6)	54 (6)	51 (6)	51 (6)	50 (6)	37 (7)	50 (6)	71 (4)	54 (6)	56 (6)	Link
9th	solar-mini-ja	Upstage	85.9	6.2	46 (6)	58 (6)	43 (6)	41 (7)	46 (6)	51 (5)	49 (6)	48 (7)	40 (7)	52 (6)	Link
10th	solar-mini	Upstage	85.5	6.4	33 (7)	57 (6)	48 (6)	42 (7)	46 (6)	50 (6)	43 (7)	55 (6)	42 (7)	56 (6)	Link
11th	Mixtral-8x22B-Instruct-v0.1	MistralAI	83.4	6.6	40 (7)	44 (7)	47 (6)	31 (8)	38 (7)	35 (7)	65 (5)	57 (6)	50 (6)	44 (7)	Link
12th	WizardLM-2-8x22B	Microsoft	83.3	6.6	37 (7)	56 (6)	47 (6)	30 (8)	52 (6)	29 (8)	51 (6)	47 (7)	51 (6)	53 (6)	Link
13th	Qwen2.5-7B-Instruct-Turbo	Qwen	80.3	6.8	40 (7)	40 (7)	39 (7)	35 (7)	35 (7)	35 (7)	58 (6)	53 (6)	44 (7)	42 (7)	Link
14th	Meta-Llama-3.1-8B-Instruct-Turbo	meta-llama	74.7	7.1	46 (6)	31 (8)	36 (7)	34 (7)	36 (7)	24 (8)	38 (7)	38 (7)	37 (7)	45 (7)	Link
15th	gpt-3.5-turbo-0125	OpenAI	68.7	7.7	29 (8)	39 (7)	26 (8)	17 (9)	36 (7)	24 (8)	38 (7)	25 (8)	45 (7)	27 (8)	Link
16th	Mixtral-8x7B-Instruct-v0.1	MistralAI	63.4	8.3	19 (9)	25 (8)	40 (7)	20 (9)	27 (8)	19 (9)	37 (7)	16 (9)	30 (8)	19 (9)	Link
17th	gemma-2-9b-it	Google	61.2	8.4	24 (8)	20 (9)	16 (9)	22 (9)	17 (9)	29 (8)	24 (8)	25 (8)	25 (8)	29 (8)	Link
18th	Llama-3.2-3B-Instruct-Turbo	meta-llama	60.6	8.7	28 (8)	18 (9)	27 (8)	23 (9)	16 (9)	17 (9)	21 (9)	29 (8)	22 (9)	23 (9)	Link
19th	Mistral-7B-Instruct-v0.3	MistralAI	57.2	8.9	17 (9)	11 (9)	22 (9)	12 (9)	18 (9)	21 (9)	19 (9)	27 (8)	23 (9)	21 (9)	Link

Ranking Criteria: Average of the standard scores over 10 years of CSAT <The standard scores reflect the difficulty level of each year's exam in the overall score.>
Avg. Std Score: The average of standard scores calculated using the KICE (Korea Institute for Curriculum and Evaluation) method.
Avg. Grade: Average grade
CSAT Scores by Year: Raw score (grade)

Here’s the English translation:

Click here for details on the scoring system

i.e.)

Donating GPU resources would be greatly appreciated and would help with the evaluations. Thank you!
Welcome any and all feedback! Feel free to share your thoughts anytime!

📗 Notes: Performance comparison of models on the 2024 CSAT (1-year)

o1-preview: 88 points (Grade 1, Top 4%)
o1-mini: 60 points (Grade 5)

i.e) The gpt o1 model is scheduled for a benchmark update when the official version of o1 is released!

⭐️ Purpose of the Korean SAT LLM Leaderboard Project

Sharing Benchmark Data: Provide benchmarks that allow for the comparison of human performance and LLM performance.
Reliable Evaluation Dataset: Utilize the highly authoritative benchmark dataset curated by KICE (Korea Institute for Curriculum and Evaluation) to assess Korean language proficiency.
Annual Updates with Leakage Prevention: Update the CSAT Korean benchmark dataset annually to prevent data leakage.
Open-Source LLM Advancement: Enable open-source LLMs, independent of any specific country or company, to achieve top-tier (1st-grade) performance on the Korean CSAT.

🏅 Submission Guidelines

If you prefer to keep your model’s performance private and not appear on the public leaderboard, feel free to leave a note in the "Comments" section.
⭐️ Your model must have a minimum context length of 8K tokens to solve the Korean SAT questions!

Model Submission:
- Submit via the Form: Fill in the survey form to submit your model!
  - Link: https://moaform.com/q/X6xfGE
- Email Submission: Send the Hugging Face URL of your fine-tuned model along with your nickname.
  - Submission email: [email protected]
- Submit via GitHub Issue: Post your model’s Hugging Face URL and nickname in a GitHub issue.
```
<Example for email or GitHub issue submission>
Submitter Name: Elon musk
Hugging Face Submission URL: https://huggingface.co/Elon_model
Comments: Let's go Mars!
```
Check the Leaderboard: View your rank on GitHub or Hugging Face.
Climb the Ranks: Improve your score and claim the Slayer Champion title!

Notice: Evaluation may take 1-3 weeks depending on available GPU resources and submission volume.

🪑 Benchmark Dataset

This competition utilizes CSAT Korean questions from the 10-year period between 2015 and 2024.
For the elective subjects introduced in 2022, Speech and Composition will be used as the elective subject for benchmarking.
The key evaluation metrics for the benchmark dataset include: Language Comprehension, Core Content Understanding, Logical Reasoning, Critical Thinking, Creative Thinking, and Multimedia Interpretation Skills. <Source: 2024 KICE CSAT Korean Evaluation Criteria>
The comprehension of LLMs can be assessed by examining their performance on problems from various fields such as humanities, social sciences, science, technology, and the arts.

📑 Category of Benchmark Dataset

(1) 📚 Reading Comprehension

Content: Passages are drawn from a wide range of fields, including humanities, social sciences, science, technology, and the arts.

Humanities passages: Cover topics such as philosophy, argumentation, and history.
Social sciences passages: Focus on subjects like economics, politics, law, and culture.
Science passages: Include content related to physics, chemistry, biology, and earth sciences.
Technology passages: Feature topics like computers, machinery, and everyday science.
Arts passages: Encompass a variety of artistic themes, including visual arts, music, architecture, and film.
Integrated passages: Complex, long passages that combine multiple fields are also frequently included.
Evaluation focus: Assessment of understanding across diverse domains, evaluation of reasoning and critical thinking skills.

(2) 🧑‍🎤 Literature

Content: Covers a variety of literary genres such as classical and modern novels, classical and modern poetry, and classical essays.

Evaluation focus: Assessment of emotional and stylistic comprehension, evaluation of understanding of various literary periods and genres.

(3) 🗣️ Speech and Writing

Content: Includes problems dealing with dialogue and writing.

Evaluation focus: Assessment of understanding in conversational contexts and evaluation of writing skills.

♾️ Metric

Evaluation Method

In this competition, the answers submitted by each model are evaluated based on whether they match the actual correct answers.
Scores are graded for each year's questions, and the final ranking is determined by the average of the standardized scores.

Explanation of Leaderboard Score

Raw Score: The score achieved out of a maximum of 100 points on the test.
Standardized Score: A score that measures how far the test taker's raw score deviates from the average.
Grade: Based on standardized scores, test takers are classified into 9 grades. For the Korean, Math, and Inquiry subjects, the top 4% of all test takers receive a grade of 1, the next 7% (cumulative 11%) receive a grade of 2, and the next 12% (cumulative 23%) receive a grade of 3. Refer to EBSI

📗 Helpful Reference

Nomadamas Experimental Records

🤷 More About CSAT(Korean college entrance exam)

The Korean CSAT (College Scholastic Ability Test) Korean Language section is a university entrance exam, with questions meticulously selected by renowned scholars from KICE (Korea Institute for Curriculum and Evaluation), who are confined to a hotel for this process. It is one of the most reputable exams in Korea. Evaluating the performance of LLMs (Large Language Models) based on the metrics used to assess test-takers provides an excellent opportunity to compare human proficiency with the language capabilities of LLMs.

📰 Notice

Due to copyright concerns, the CSAT benchmark dataset will not be made publicly available. The evaluation data spans from the 2015 CSAT to the 2024 CSAT, and elective subjects for the years 2022 to 2024 will be limited to Speech and Composition.
To ensure fairness, the prompts will not be disclosed.
In the future, the leaderboard will be updated to reflect models submitted for all subjects, including Korean, English, Math, Science, and Social Studies, on the day of the CSAT.
This Korean-SAT benchmarking system powered by AutoRAG. AutoRAG is awesome!! (AutoRAG is an automatic RAG optimization tool that can also be used for LLM performance comparison and prompt engineering.) Since the implementation of elective subjects in 2022, the standard score formula has been derived using the Crux Table provided by Crux Consulting.

📬 Contact Us

For questions, errors, or support, feel free to reach out:
Email: [email protected]

Are you ready to become the next KO-SAT Slayer Champion? 💪

License

This dataset is sourced from the Korea Institute for Curriculum Evaluation (KICE).

This benchmark leaderboard is a non-profit project that aims to provide information on LLM performance with SAT benchmarks!

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
autorag_project_dir		autorag_project_dir
data		data
korean_sat_mini_test		korean_sat_mini_test
Korean_README.md		Korean_README.md
LICENSE		LICENSE
README.md		README.md
questions_en.md		questions_en.md
questions_korean.md		questions_korean.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏆 Korean SAT LLM Leaderboard

📊 Test Your Own Model on the 2023 Korean SAT Sample Dataset

🗂️ Index

한국어 설명 바로가기

🚨 Notice: 2025 SAT (1-Year) Model Performance Comparison Results

🎯 What is the Korean SAT LLM Leaderboard?

💯 Leaderboard

📗 Notes: Performance comparison of models on the 2024 CSAT (1-year)

⭐️ Purpose of the Korean SAT LLM Leaderboard Project

🏅 Submission Guidelines

🪑 Benchmark Dataset

📑 Category of Benchmark Dataset

(1) 📚 Reading Comprehension

(2) 🧑‍🎤 Literature

(3) 🗣️ Speech and Writing

♾️ Metric

Evaluation Method

Explanation of Leaderboard Score

📗 Helpful Reference

🤷 More About CSAT(Korean college entrance exam)

📰 Notice

📬 Contact Us

License

About

Releases

Packages

Contributors 3

Languages

License

Marker-Inc-Korea/Korean-SAT-LLM-Leaderboard

Folders and files

Latest commit

History

Repository files navigation

🏆 Korean SAT LLM Leaderboard

📊 Test Your Own Model on the 2023 Korean SAT Sample Dataset

🗂️ Index

한국어 설명 바로가기

🚨 Notice: 2025 SAT (1-Year) Model Performance Comparison Results

🎯 What is the Korean SAT LLM Leaderboard?

💯 Leaderboard

📗 Notes: Performance comparison of models on the 2024 CSAT (1-year)

⭐️ Purpose of the Korean SAT LLM Leaderboard Project

🏅 Submission Guidelines

🪑 Benchmark Dataset

📑 Category of Benchmark Dataset

(1) 📚 Reading Comprehension

(2) 🧑‍🎤 Literature

(3) 🗣️ Speech and Writing

♾️ Metric

Evaluation Method

Explanation of Leaderboard Score

📗 Helpful Reference

🤷 More About CSAT(Korean college entrance exam)

📰 Notice

📬 Contact Us

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages