Skip to content

Scezaquer/ranking_LLMs

Repository files navigation

Rankings of LLMs as a function of the rating algorithm used

Note: The mElo2k learning rates are the same across all datasets and all values of k

Dataset 1

https://drive.google.com/file/d/1I4Q_Mzs43jg2vSHnWWkQA0xGpY7RWolu/view?usp=sharing from https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing#scrollTo=cWr1RMsqxWGg from https://lmsys.org/blog/2023-05-03-arena/

83938 matches

Model name BayesElo Elo Sequential Elo Glicko Glicko 2 TrueSkill mElo2 mElo4 mElo10 mElo20 Winrate Drawrate Loserate WDL Wins Draws Losses Total played
gpt-4 212.7 8470.0 235.4 280.1 185.6 29.5 0.04797 0.06507 0.03813 0.04818 64.7 24.5 10.8 4235 5080 1924 845 7849
claude-v1 171.6 5550.0 159.5 220.7 145.3 27.9 0.03543 0.04362 0.02918 0.02834 57.6 27.3 15.1 2775 3759 1783 984 6526
claude-instant-v1 148.4 3452.0 177.1 185.6 120.6 28.9 0.02475 0.0331 0.02002 0.02158 52.8 30.1 17.1 1726 2550 1453 824 4827
gpt-3.5-turbo 128.7 5048.0 120.9 169.7 111.3 26.9 0.02637 0.04147 0.02152 0.02071 52.4 27.8 19.7 2524 4048 2148 1524 7720
vicuna-13b 50.0 12014.0 69.3 141.1 93.6 27.1 0.0506 0.04165 0.02344 0.02702 48.9 29.4 21.7 6007 10809 6488 4802 22099
vicuna-33b 103.3 1132.0 114.2 109.2 65.5 27.5 0.01692 0.01215 0.01762 0.01223 42.8 35.5 21.7 566 1151 954 585 2690
mpt-30b-chat 59.6 332.0 64.5 34.6 16.9 26.7 0.00849 0.00165 0.00491 0.0011 34.6 37.5 27.9 166 861 933 695 2489
palm-2 19.7 -268.0 41.3 -13.2 -7.3 26.2 -0.00645 -0.00121 -0.00669 -0.00239 32.1 33.2 34.7 -134 1694 1751 1828 5273
wizardlm-13b 48.6 484.0 38.5 28.0 15.4 25.7 0.00653 0.00765 -0.00077 0.00198 31.1 43.2 25.7 242 1394 1938 1152 4484
koala-13b -7.0 -1972.0 -15.2 -40.3 -25.4 25.1 -0.0048 -0.00608 -0.00533 0.00111 29.7 32.9 37.4 -986 3773 4179 4759 12711
guanaco-33b 42.8 42.0 42.1 2.5 1.3 26.1 0.00904 -0.0053 0.00314 0.00186 29.5 41.5 29.0 21 1286 1806 1265 4357
vicuna-7b 9.3 -948.0 -9.4 -43.2 -25.6 25.0 -0.01066 -0.01042 -0.00613 -0.0034 28.1 35.4 36.5 -474 1603 2016 2077 5696
chatglm-6b -62.1 -2712.0 -64.6 -95.8 -61.3 23.9 -0.01641 -0.01343 -0.01629 -0.01323 24.5 32.6 42.9 -1356 1801 2393 3157 7351
alpaca-13b -65.1 -3620.0 -86.5 -102.9 -66.6 23.6 -0.0205 -0.01541 -0.01262 -0.01366 24.5 31.2 44.3 -1810 2238 2849 4048 9135
mpt-7b-chat -39.9 -2106.0 -38.5 -93.2 -58.8 25.0 -0.01364 -0.01249 -0.01284 -0.01032 23.5 35.1 41.4 -1053 1377 2057 2430 5864
RWKV-4-Raven-14B -52.1 -3080.0 -83.9 -110.5 -71.2 23.8 -0.01601 -0.0274 -0.00577 -0.02157 22.0 34.7 43.3 -1540 1594 2508 3134 7236
oasst-pythia-12b -75.3 -4510.0 -98.8 -123.4 -80.5 23.4 -0.02352 -0.03063 -0.02149 -0.02212 21.6 32.9 45.4 -2255 2054 3126 4309 9489
oasst-sft-1-pythia-12b -130.5 -30.0 -24.4 -203.9 -4.7 22.3 -0.001 -0.00109 -0.00046 -0.0011 19.4 19.4 61.1 -15 7 7 22 36
gpt4all-13b-snoozy -23.6 -1158.0 -37.4 -97.1 -58.3 24.5 -0.01291 -0.01323 -0.00027 -0.00713 18.3 44.8 37.0 -579 565 1385 1144 3094
fastchat-t5-3b -93.1 -3736.0 -99.3 -155.1 -100.9 23.5 -0.02403 -0.03152 -0.01309 -0.01849 17.6 34.9 47.5 -1868 1100 2184 2968 6252
stablelm-tuned-alpha-7b -127.8 -3486.0 -152.9 -174.8 -113.6 22.3 -0.01882 -0.02451 -0.01604 -0.01602 17.0 32.3 50.7 -1743 881 1671 2624 5176
dolly-v2-12b -150.3 -4712.0 -171.3 -206.8 -135.6 22.1 -0.03045 -0.02367 -0.01953 -0.02246 15.3 29.6 55.1 -2356 904 1751 3260 5915
llama-13b -167.8 -4186.0 -180.5 -225.4 -147.6 22.2 -0.02691 -0.02997 -0.02065 -0.01221 14.3 27.9 57.8 -2093 691 1344 2784 4819
Rating method L2 distance with true winrates on test set
Sequential Elo 232.6
Bayeselo 236.4
Glicko 296.0
Glicko2 203.7
Trueskill 694.2
Melo2 188.9
Melo4 182.2
Melo10 186.4
Melo20 176.9

Dataset 2

https://huggingface.co/datasets/lmsys/chatbot_arena_conversations

33000 matches

Model name BayesElo Elo Sequential Elo Glicko Glicko 2 TrueSkill mElo2 mElo4 mElo10 mElo20 Winrate Drawrate Loserate WDL Wins Draws Losses Total played
gpt-4 229.2 4336.0 216.0 293.4 192.7 29.3 0.07884 0.06513 0.04558 0.02122 68.2 20.2 11.6 2168 2614 775 446 3835
claude-v1 185.3 3250.0 198.7 238.0 154.9 28.9 0.0502 0.0529 0.03434 0.02702 61.2 23.5 15.3 1625 2168 833 543 3544
claude-instant-v1 163.3 1892.0 176.6 205.8 130.8 28.6 0.0271 0.03909 0.01967 0.01721 57.1 25.5 17.4 946 1361 609 415 2385
gpt-3.5-turbo 132.4 2808.0 143.9 173.1 111.6 27.9 0.05612 0.03937 0.0239 0.01997 54.4 24.6 21.0 1404 2290 1034 886 4210
vicuna-13b 54.0 1974.0 74.4 96.2 60.5 26.8 0.03968 0.02743 0.01627 0.01036 44.6 29.3 26.1 987 2378 1559 1391 5328
guanaco-33b 69.4 212.0 77.4 58.4 21.0 26.7 0.00535 0.00387 0.00479 0.00669 41.6 28.1 30.3 106 391 264 285 940
wizardlm-13b 48.8 160.0 44.3 41.3 14.1 26.0 0.00403 0.00529 -0.00105 0.00119 38.0 32.0 30.0 80 381 321 301 1003
palm-2 44.5 292.0 36.4 28.6 13.9 25.7 0.00401 0.00575 0.00431 0.00595 37.8 30.0 32.3 146 1000 794 854 2648
koala-13b 0.4 20.0 -2.5 1.0 0.6 25.1 0.00633 0.00108 0.00523 0.00045 34.3 31.6 34.1 10 1730 1596 1720 5046
vicuna-7b 10.2 -356.0 27.5 -35.9 -17.8 25.5 -0.00503 -0.00859 -0.00481 -0.00135 30.0 33.1 36.9 -178 770 851 948 2569
alpaca-13b -72.4 -1558.0 -68.9 -101.4 -62.7 23.9 -0.02915 -0.02281 -0.01774 -0.00812 25.1 30.3 44.6 -779 999 1210 1778 3987
RWKV-4-Raven-14B -55.4 -1362.0 -74.9 -106.8 -65.3 23.4 -0.02028 -0.02391 -0.01237 -0.0128 24.0 31.4 44.6 -681 794 1040 1475 3309
mpt-7b-chat -46.9 -1048.0 -48.6 -106.3 -63.2 24.1 -0.018 -0.01506 -0.00843 -0.00078 23.4 32.8 43.9 -524 598 838 1122 2558
oasst-pythia-12b -73.4 -1896.0 -72.0 -111.5 -70.0 23.8 -0.0285 -0.02978 -0.01326 -0.00631 23.4 31.8 44.8 -948 1031 1404 1979 4414
gpt4all-13b-snoozy -46.0 -408.0 -42.8 -109.2 -51.8 24.1 -0.00815 -0.00731 -0.00504 -0.01292 23.3 32.2 44.4 -204 226 312 430 968
chatglm-6b -101.9 -1678.0 -126.4 -146.5 -91.7 22.8 -0.03316 -0.02288 -0.02208 -0.00877 19.3 33.3 47.5 -839 572 988 1411 2971
fastchat-t5-3b -89.6 -1610.0 -91.8 -144.7 -90.3 23.6 -0.02401 -0.02558 -0.017 -0.01196 19.0 34.1 46.9 -805 548 985 1353 2886
stablelm-tuned-alpha-7b -134.5 -1740.0 -150.6 -179.9 -113.4 22.2 -0.03479 -0.02908 -0.01249 -0.01204 16.8 31.7 51.5 -870 422 795 1292 2509
dolly-v2-12b -148.7 -1848.0 -143.0 -193.6 -122.7 22.6 -0.03678 -0.02826 -0.02101 -0.01789 15.9 30.9 53.2 -924 394 764 1318 2476
llama-13b -168.7 -1440.0 -173.8 -205.9 -128.6 21.7 -0.03383 -0.02666 -0.01882 -0.01713 15.1 30.1 54.8 -720 274 546 994 1814
Rating method L2 distance with true winrates on test set
Sequential Elo 211.6
Bayeselo 202.7
Glicko 249.1
Glicko2 189.7
Trueskill 580.7
Melo2 190.2
Melo4 201.0
Melo10 190.2
Melo20 199.1

Dataset 3

https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/viewer/default/human

3355 matches

Model name BayesElo Elo Sequential Elo Glicko Glicko 2 TrueSkill mElo2 mElo4 mElo10 mElo20 Winrate Drawrate Loserate WDL Wins Draws Losses Total played
gpt-4 172.0 870.0 169.3 244.1 146.4 28.2 0.02039 0.01436 0.00936 0.01089 62.9 21.2 15.8 435 581 196 146 923
claude-v1 114.6 564.0 123.8 164.7 89.0 27.8 0.01312 0.01015 0.00435 0.00662 54.5 22.9 22.7 282 483 203 201 887
gpt-3.5-turbo 93.8 802.0 110.4 156.1 90.4 27.3 0.01838 0.0148 0.01374 0.0034 51.6 27.0 21.5 401 687 359 286 1332
vicuna-13b-v1.2 -14.2 -82.0 -18.9 -21.2 -6.8 24.4 -0.00274 -0.0023 -0.00016 0.00345 34.7 26.5 38.8 -41 347 265 388 1000
alpaca-13b -125.5 -786.0 -128.8 -222.5 -130.9 23.0 -0.01651 -0.01207 -0.01108 -0.00918 16.9 23.2 59.9 -393 155 212 548 915
llama-13b -240.7 -1368.0 -255.8 -361.3 -229.0 20.4 -0.03264 -0.02495 -0.0162 -0.01519 6.8 16.6 76.6 -684 67 163 751 981
Rating method L2 distance with true winrates on test set
Sequential Elo 52.3
Bayeselo 52.4
Glicko 68.6
Glicko2 49.3
Trueskill 133.8
Melo2 117.3
Melo4 82.3
Melo10 51.6
Melo20 55.8

Dataset 4

https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/viewer/default/gpt4_pair

2400 matches

Model name BayesElo Elo Sequential Elo Glicko Glicko 2 TrueSkill mElo2 mElo4 mElo10 mElo20 Winrate Drawrate Loserate WDL Wins Draws Losses Total played
gpt-4 304.0 1044.0 274.8 373.3 232.7 30.6 0.02336 0.02335 0.0124 0.0151 77.6 16.9 5.5 522 562 122 40 724
claude-v1 181.7 612.0 145.4 221.6 124.9 27.4 0.01275 0.00966 0.01007 0.00726 56.9 29.0 14.1 306 407 207 101 715
gpt-3.5-turbo 89.7 328.0 77.3 116.8 50.0 26.2 0.00357 0.00865 0.0037 0.00584 47.0 28.5 24.5 164 342 207 178 727
vicuna-13b-v1.2 -33.2 -132.0 -23.6 -47.8 -13.6 24.2 -0.00193 -0.0032 -0.00363 -0.00109 29.9 30.9 39.2 -66 214 221 280 715
alpaca-13b -218.4 -766.0 -179.5 -277.3 -164.8 21.5 -0.01563 -0.01604 -0.01206 -0.01057 12.3 21.8 65.9 -383 88 156 471 715
llama-13b -323.9 -1086.0 -294.3 -388.3 -243.1 18.6 -0.02212 -0.02243 -0.01048 -0.01654 2.6 19.8 77.6 -543 19 143 562 724
Rating method L2 distance with true winrates on test set
Sequential Elo 42.0
Bayeselo 49.0
Glicko 60.4
Glicko2 43.5
Trueskill 124.2
Melo2 92.6
Melo4 89.7
Melo10 31.7
Melo20 31.7

Sum of all datasets

Not checking for redundancy

122693 matches

Model name BayesElo Elo Sequential Elo Glicko Glicko 2 TrueSkill mElo2 mElo4 mElo10 mElo20 Winrate Drawrate Loserate WDL Wins Draws Losses Total played
gpt-4 223.2 14630.0 248.5 285.7 190.1 29.5 0.05627 0.06043 0.04287 0.03176 66.2 22.7 11.2 7315 8798 3014 1483 13295
claude-v1 176.4 9924.0 162.6 220.3 146.0 27.7 0.0526 0.04219 0.04484 0.02953 58.1 26.1 15.7 4962 6800 3057 1838 11695
claude-instant-v1 151.0 5274.0 168.4 190.3 125.0 27.8 0.03308 0.04236 0.02888 0.01618 54.0 28.6 17.4 2637 3886 2057 1249 7192
gpt-3.5-turbo 131.2 9064.0 107.3 168.8 111.7 26.2 0.0386 0.02774 0.0241 0.02168 52.8 26.8 20.3 4532 7366 3739 2834 13939
vicuna-13b 49.0 14086.0 63.9 133.3 88.5 26.3 0.03242 0.02728 0.02019 0.0114 48.2 29.3 22.5 7043 13216 8052 6173 27441
vicuna-33b 103.8 1136.0 121.1 110.1 66.1 27.1 0.01052 0.01453 0.01714 0.01347 42.7 35.7 21.5 568 1144 957 576 2677
palm-2 28.0 96.0 36.1 3.1 1.8 25.3 -0.0008 0.00196 0.00539 -0.00293 34.3 32.0 33.7 48 2716 2528 2668 7912
mpt-30b-chat 58.0 312.0 42.0 32.5 15.8 25.4 0.00636 -4e-05 0.00363 0.00327 34.2 37.8 28.0 156 852 940 696 2488
wizardlm-13b 47.2 632.0 35.0 29.9 17.1 25.3 0.00849 -0.00046 0.00274 0.00156 32.3 41.2 26.5 316 1771 2263 1455 5489
guanaco-33b 47.8 302.0 55.7 14.8 8.2 25.7 0.00234 0.00948 0.00224 0.00247 31.9 39.0 29.1 151 1689 2063 1538 5290
vicuna-13b-v1.2 26.8 -244.0 22.8 -37.2 -16.1 25.3 -0.00097 0.00366 0.00598 -0.00205 31.7 29.3 38.9 -122 540 499 662 1701
koala-13b -7.3 -1970.0 -3.4 -28.8 -18.4 24.7 -0.00881 0.0055 -0.00023 -0.00542 31.0 32.4 36.6 -985 5501 5743 6486 17730
vicuna-7b 10.0 -1232.0 -1.6 -38.6 -23.6 24.7 -0.00088 0.00161 -0.00932 -0.00449 29.0 34.6 36.4 -616 2402 2864 3018 8284
alpaca-13b -74.0 -6790.0 -102.3 -119.1 -78.4 22.5 -0.02437 -0.02931 -0.02321 -0.01455 23.6 29.9 46.5 -3395 3491 4425 6886 14802
mpt-7b-chat -43.2 -3134.0 -39.3 -96.4 -62.1 24.1 -0.02003 -0.01251 -0.01253 -0.00255 23.5 34.5 42.1 -1567 1980 2908 3547 8435
chatglm-6b -75.2 -4358.0 -59.9 -109.2 -71.1 23.6 -0.01488 -0.02358 -0.0147 -0.00333 23.1 32.8 44.1 -2179 2392 3395 4571 10358
RWKV-4-Raven-14B -55.5 -4454.0 -75.7 -109.8 -71.6 23.6 -0.01199 -0.02313 -0.01378 -0.00526 22.6 33.7 43.7 -2227 2379 3546 4606 10531
oasst-pythia-12b -77.0 -6402.0 -109.4 -119.8 -78.8 22.4 -0.02957 -0.03031 -0.02215 -0.01148 22.2 32.5 45.3 -3201 3077 4512 6278 13867
gpt4all-13b-snoozy -30.4 -1602.0 -58.5 -101.7 -63.0 23.3 -0.01544 -0.0096 -0.0165 -0.00133 19.4 41.5 39.0 -801 795 1697 1596 4088
oasst-sft-1-pythia-12b -132.5 -30.0 -25.5 -203.9 -4.7 21.6 -0.00087 -0.00053 -0.00097 -0.00063 19.4 19.4 61.1 -15 7 7 22 36
fastchat-t5-3b -94.7 -5400.0 -92.3 -152.8 -100.3 23.1 -0.02959 -0.02534 -0.01233 -0.01453 18.0 34.5 47.5 -2700 1655 3161 4355 9171
stablelm-tuned-alpha-7b -132.0 -5200.0 -146.6 -175.6 -115.2 21.8 -0.02248 -0.02722 -0.01989 -0.01842 17.0 32.2 50.8 -2600 1307 2473 3907 7687
dolly-v2-12b -152.4 -6596.0 -178.0 -203.4 -134.2 21.4 -0.02594 -0.02444 -0.02394 -0.02013 15.4 30.0 54.6 -3298 1297 2525 4595 8417
llama-13b -178.0 -8044.0 -170.9 -250.9 -166.1 21.6 -0.03404 -0.03028 -0.02846 -0.02421 12.6 26.4 60.9 -4022 1050 2201 5072 8323
Rating method L2 distance with true winrates on test set
Sequential Elo 238.8
Bayeselo 244.4
Glicko 306.3
Glicko2 210.3
Trueskill 724.8
Melo2 208.9
Melo4 200.6
Melo10 188.6
Melo20 204.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages