- π― What is the Korean SAT LLM Leaderboard?
- π― Leaderboard
- π Submission Guidelines
- πͺ About benchmark Datadataset
- βΎοΈ Metric
- π Helpful Reference
- π€· More About CSAT(Korean college entrance exam)
- π° Notice
- π¬ Contact Us
Take advantage of this unique opportunity to compare human academic ability with the performance of large language models (LLMs) based on the highly reputable College Scholastic Ability Test (CSAT)!
Test how well your fine-tuned Korean LLM performs on a 10-year benchmark of the Korean CSAT and see what score it would achieve right now!
The current benchmark has been completed using GPT-based models. Future performance evaluations for other models will be conducted as additional resources and funding become available.
Please note that the current grade thresholds are estimated, and they will be updated once the official thresholds are announced.
Rank | Model Name | Standard Score | Raw Score | Common Subject Score | Elective Subject Score | Estimated Grade Cut (CruxTable Standard) |
---|---|---|---|---|---|---|
π₯ 1st | o1-Preview | 133 | 97 | 73 | 24 | Grade 1 |
π₯ 2nd | o1-mini | 115 | 78 | 57 | 21 | Grade 4 |
π₯ 3rd | gemini_2.0_experimental_advanced | 114 | 77 | 55 | 22 | Grade 4 |
4th | gpt-4o | 112 | 75 | 56 | 19 | Grade 4 |
5th | claude-3-5-sonnet-20241022 | 108 | 70 | 54 | 16 | Grade 4 |
6th | HyperClovaX | 108 | 61 | 48 | 24 | Grade 4 |
7th | gpt-4o-mini | 97 | 59 | 44 | 15 | Grade 5 |
8th | gpt-3.5-turbo | 56 | 16 | 10 | 6 | Grade 9 |
cf)
The incorrectly answered question on the o1-preview was question 8 (3 points) from the CSAT Korean section, a non-literary text question!
For those curious about the analysis of the incorrect question and a detailed explanation of the experiment, please refer to this link.
The Korean SAT LLM leaderboard is a leaderboard benchmarking 10 years of Korean CSAT (College Scholastic Ability Test)
exams, developed by the reputable KICE (Korea Institute for Curriculum and Evaluation).
The CSAT consists of a wide range of question types depending on the difficulty level, designed to assess reading
comprehension, critical thinking, and sentence interpretation skills.
If you want to know more about Korean SAT (Korean College entrance exam), please refer this!
Leaderboard Rank | Model Name | Submitter Name | Avg. std Score | Avg. Grade | 2024 SAT | 2023 SAT | 2022 SAT | 2021 SAT | 2020 SAT | 2019 SAT | 2018 SAT | 2017 SAT | 2016 SAT | 2015 SAT | URL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
π₯ 1st | gpt-4o-2024-08-06 | OpenAI | 114.9 | 3.6 | 65 (4) | 81 (4) | 70 (4) | 69 (4) | 76 (4) | 74 (3) | 77 (4) | 86 (2) | 84 (3) | 77 (4) | Link |
π₯ 2nd | Meta-Llama-3.1-405B-Instruct-Turbo | meta-llama | 113.8 | 3.8 | 77 (3) | 87 (3) | 69 (4) | 70 (4) | 65 (5) | 68 (4) | 78 (4) | 80 (3) | 87 (3) | 68 (5) | Link |
π₯ 3rd | Qwen2.5-72B-Instruct-Turbo | Qwen | 105.8 | 4.6 | 61 (5) | 78 (4) | 52 (6) | 60 (5) | 60 (5) | 64 (4) | 74 (4) | 70 (5) | 74 (4) | 79 (4) | Link |
4th | Meta-Llama-3.1-70B-Instruct-Turbo | meta-llama | 103.7 | 4.8 | 50 (6) | 72 (5) | 73 (3) | 61 (5) | 79 (3) | 51 (5) | 58 (6) | 66 (5) | 71 (5) | 70 (5) | Link |
5th | claude-3-5-sonnet-20241022 | Antropic | 102.6 | 5 | 60 (5) | 61 (6) | 69 (4) | 58 (5) | 72 (4) | 63 (4) | 71 (5) | 70 (5) | 58 (6) | 55 (6) | Link |
6th | Qwen2-72B-Instruct | Qwen | 98 | 5.2 | 53 (5) | 57 (6) | 59 (5) | 45 (6) | 57 (5) | 56 (5) | 76 (4) | 69 (5) | 58 (6) | 63 (5) | Link |
7th | gpt-4o-mini-2024-07-18 | OpenAI | 93.9 | 5.6 | 57 (5) | 53 (6) | 50 (6) | 55 (5) | 50 (6) | 46 (6) | 62 (5) | 58 (6) | 64 (5) | 57 (6) | Link |
8th | gemma-2-27b-it | 91 | 5.9 | 51 (6) | 54 (6) | 51 (6) | 51 (6) | 50 (6) | 37 (7) | 50 (6) | 71 (4) | 54 (6) | 56 (6) | Link | |
9th | solar-mini-ja | Upstage | 85.9 | 6.2 | 46 (6) | 58 (6) | 43 (6) | 41 (7) | 46 (6) | 51 (5) | 49 (6) | 48 (7) | 40 (7) | 52 (6) | Link |
10th | solar-mini | Upstage | 85.5 | 6.4 | 33 (7) | 57 (6) | 48 (6) | 42 (7) | 46 (6) | 50 (6) | 43 (7) | 55 (6) | 42 (7) | 56 (6) | Link |
11th | Mixtral-8x22B-Instruct-v0.1 | MistralAI | 83.4 | 6.6 | 40 (7) | 44 (7) | 47 (6) | 31 (8) | 38 (7) | 35 (7) | 65 (5) | 57 (6) | 50 (6) | 44 (7) | Link |
12th | WizardLM-2-8x22B | Microsoft | 83.3 | 6.6 | 37 (7) | 56 (6) | 47 (6) | 30 (8) | 52 (6) | 29 (8) | 51 (6) | 47 (7) | 51 (6) | 53 (6) | Link |
13th | Qwen2.5-7B-Instruct-Turbo | Qwen | 80.3 | 6.8 | 40 (7) | 40 (7) | 39 (7) | 35 (7) | 35 (7) | 35 (7) | 58 (6) | 53 (6) | 44 (7) | 42 (7) | Link |
14th | Meta-Llama-3.1-8B-Instruct-Turbo | meta-llama | 74.7 | 7.1 | 46 (6) | 31 (8) | 36 (7) | 34 (7) | 36 (7) | 24 (8) | 38 (7) | 38 (7) | 37 (7) | 45 (7) | Link |
15th | gpt-3.5-turbo-0125 | OpenAI | 68.7 | 7.7 | 29 (8) | 39 (7) | 26 (8) | 17 (9) | 36 (7) | 24 (8) | 38 (7) | 25 (8) | 45 (7) | 27 (8) | Link |
16th | Mixtral-8x7B-Instruct-v0.1 | MistralAI | 63.4 | 8.3 | 19 (9) | 25 (8) | 40 (7) | 20 (9) | 27 (8) | 19 (9) | 37 (7) | 16 (9) | 30 (8) | 19 (9) | Link |
17th | gemma-2-9b-it | 61.2 | 8.4 | 24 (8) | 20 (9) | 16 (9) | 22 (9) | 17 (9) | 29 (8) | 24 (8) | 25 (8) | 25 (8) | 29 (8) | Link | |
18th | Llama-3.2-3B-Instruct-Turbo | meta-llama | 60.6 | 8.7 | 28 (8) | 18 (9) | 27 (8) | 23 (9) | 16 (9) | 17 (9) | 21 (9) | 29 (8) | 22 (9) | 23 (9) | Link |
19th | Mistral-7B-Instruct-v0.3 | MistralAI | 57.2 | 8.9 | 17 (9) | 11 (9) | 22 (9) | 12 (9) | 18 (9) | 21 (9) | 19 (9) | 27 (8) | 23 (9) | 21 (9) | Link |
- Ranking Criteria: Average of the standard scores over 10 years of CSAT <The standard scores reflect the difficulty level of each year's exam in the overall score.>
- Avg. Std Score: The average of standard scores calculated using the KICE (Korea Institute for Curriculum and Evaluation) method.
- Avg. Grade: Average grade
- CSAT Scores by Year: Raw score (grade)
Hereβs the English translation:
Click here for details on the scoring system
i.e.)
- Donating GPU resources would be greatly appreciated and would help with the evaluations. Thank you!
- Welcome any and all feedback! Feel free to share your thoughts anytime!
- o1-preview: 88 points (Grade 1, Top 4%)
- o1-mini: 60 points (Grade 5)
i.e) The gpt o1 model is scheduled for a benchmark update when the official version of o1 is released!
- Sharing Benchmark Data: Provide benchmarks that allow for the comparison of human performance and LLM performance.
- Reliable Evaluation Dataset: Utilize the highly authoritative benchmark dataset curated by KICE (Korea Institute for Curriculum and Evaluation) to assess Korean language proficiency.
- Annual Updates with Leakage Prevention: Update the CSAT Korean benchmark dataset annually to prevent data leakage.
- Open-Source LLM Advancement: Enable open-source LLMs, independent of any specific country or company, to achieve top-tier (1st-grade) performance on the Korean CSAT.
- If you prefer to keep your modelβs performance private and not appear on the public leaderboard, feel free to leave a note in the "Comments" section.
- βοΈ Your model must have a minimum context length of 8K tokens to solve the Korean SAT questions!
-
Model Submission:
- Submit via the Form: Fill in the survey form to submit your model!
- Email Submission: Send the Hugging Face URL of your fine-tuned model along with your nickname.
- Submission email: [email protected]
- Submit via GitHub Issue: Post your modelβs Hugging Face URL and nickname in a GitHub issue.
<Example for email or GitHub issue submission> Submitter Name: Elon musk Hugging Face Submission URL: https://huggingface.co/Elon_model Comments: Let's go Mars!
-
Check the Leaderboard: View your rank on GitHub or Hugging Face.
-
Climb the Ranks: Improve your score and claim the Slayer Champion title!
Notice: Evaluation may take 1-3 weeks depending on available GPU resources and submission volume.
- This competition utilizes CSAT Korean questions from the 10-year period between 2015 and 2024.
- For the elective subjects introduced in 2022, Speech and Composition will be used as the elective subject for benchmarking.
- The key evaluation metrics for the benchmark dataset include: Language Comprehension, Core Content Understanding, Logical Reasoning, Critical Thinking, Creative Thinking, and Multimedia Interpretation Skills. <Source: 2024 KICE CSAT Korean Evaluation Criteria>
- The comprehension of LLMs can be assessed by examining their performance on problems from various fields such as humanities, social sciences, science, technology, and the arts.
Content: Passages are drawn from a wide range of fields, including humanities, social sciences, science, technology, and the arts.
-
Humanities passages: Cover topics such as philosophy, argumentation, and history.
-
Social sciences passages: Focus on subjects like economics, politics, law, and culture.
-
Science passages: Include content related to physics, chemistry, biology, and earth sciences.
-
Technology passages: Feature topics like computers, machinery, and everyday science.
-
Arts passages: Encompass a variety of artistic themes, including visual arts, music, architecture, and film.
-
Integrated passages: Complex, long passages that combine multiple fields are also frequently included.
-
Evaluation focus: Assessment of understanding across diverse domains, evaluation of reasoning and critical thinking skills.
Content: Covers a variety of literary genres such as classical and modern novels, classical and modern poetry, and classical essays.
- Evaluation focus: Assessment of emotional and stylistic comprehension, evaluation of understanding of various literary periods and genres.
Content: Includes problems dealing with dialogue and writing.
- Evaluation focus: Assessment of understanding in conversational contexts and evaluation of writing skills.
- In this competition, the answers submitted by each model are evaluated based on whether they match the actual correct answers.
- Scores are graded for each year's questions, and the final ranking is determined by the average of the standardized scores.
- Raw Score: The score achieved out of a maximum of 100 points on the test.
- Standardized Score: A score that measures how far the test taker's raw score deviates from the average.
- Grade: Based on standardized scores, test takers are classified into 9 grades. For the Korean, Math, and Inquiry subjects, the top 4% of all test takers receive a grade of 1, the next 7% (cumulative 11%) receive a grade of 2, and the next 12% (cumulative 23%) receive a grade of 3. Refer to EBSI
The Korean CSAT (College Scholastic Ability Test) Korean Language section is a university entrance exam, with questions meticulously selected by renowned scholars from KICE (Korea Institute for Curriculum and Evaluation), who are confined to a hotel for this process. It is one of the most reputable exams in Korea. Evaluating the performance of LLMs (Large Language Models) based on the metrics used to assess test-takers provides an excellent opportunity to compare human proficiency with the language capabilities of LLMs.
- Due to copyright concerns, the CSAT benchmark dataset will not be made publicly available. The evaluation data spans from the 2015 CSAT to the 2024 CSAT, and elective subjects for the years 2022 to 2024 will be limited to Speech and Composition.
- To ensure fairness, the prompts will not be disclosed.
- In the future, the leaderboard will be updated to reflect models submitted for all subjects, including Korean, English, Math, Science, and Social Studies, on the day of the CSAT.
- This Korean-SAT benchmarking system powered by AutoRAG. AutoRAG is awesome!! (AutoRAG is an automatic RAG optimization tool that can also be used for LLM performance comparison and prompt engineering.) Since the implementation of elective subjects in 2022, the standard score formula has been derived using the Crux Table provided by Crux Consulting.
-
For questions, errors, or support, feel free to reach out:
-
Email: [email protected]
Are you ready to become the next KO-SAT Slayer Champion? πͺ
- This dataset is sourced from the Korea Institute for Curriculum Evaluation (KICE).
This benchmark leaderboard is a non-profit project that aims to provide information on LLM performance with SAT benchmarks!