Note: The mElo2k learning rates are the same across all datasets and all values of k
https://drive.google.com/file/d/1I4Q_Mzs43jg2vSHnWWkQA0xGpY7RWolu/view?usp=sharing from https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing#scrollTo=cWr1RMsqxWGg from https://lmsys.org/blog/2023-05-03-arena/
83938 matches
Model name | BayesElo | Elo | Sequential Elo | Glicko | Glicko 2 | TrueSkill | mElo2 | mElo4 | mElo10 | mElo20 | Winrate | Drawrate | Loserate | WDL | Wins | Draws | Losses | Total played |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt-4 | 212.7 | 8470.0 | 235.4 | 280.1 | 185.6 | 29.5 | 0.04797 | 0.06507 | 0.03813 | 0.04818 | 64.7 | 24.5 | 10.8 | 4235 | 5080 | 1924 | 845 | 7849 |
claude-v1 | 171.6 | 5550.0 | 159.5 | 220.7 | 145.3 | 27.9 | 0.03543 | 0.04362 | 0.02918 | 0.02834 | 57.6 | 27.3 | 15.1 | 2775 | 3759 | 1783 | 984 | 6526 |
claude-instant-v1 | 148.4 | 3452.0 | 177.1 | 185.6 | 120.6 | 28.9 | 0.02475 | 0.0331 | 0.02002 | 0.02158 | 52.8 | 30.1 | 17.1 | 1726 | 2550 | 1453 | 824 | 4827 |
gpt-3.5-turbo | 128.7 | 5048.0 | 120.9 | 169.7 | 111.3 | 26.9 | 0.02637 | 0.04147 | 0.02152 | 0.02071 | 52.4 | 27.8 | 19.7 | 2524 | 4048 | 2148 | 1524 | 7720 |
vicuna-13b | 50.0 | 12014.0 | 69.3 | 141.1 | 93.6 | 27.1 | 0.0506 | 0.04165 | 0.02344 | 0.02702 | 48.9 | 29.4 | 21.7 | 6007 | 10809 | 6488 | 4802 | 22099 |
vicuna-33b | 103.3 | 1132.0 | 114.2 | 109.2 | 65.5 | 27.5 | 0.01692 | 0.01215 | 0.01762 | 0.01223 | 42.8 | 35.5 | 21.7 | 566 | 1151 | 954 | 585 | 2690 |
mpt-30b-chat | 59.6 | 332.0 | 64.5 | 34.6 | 16.9 | 26.7 | 0.00849 | 0.00165 | 0.00491 | 0.0011 | 34.6 | 37.5 | 27.9 | 166 | 861 | 933 | 695 | 2489 |
palm-2 | 19.7 | -268.0 | 41.3 | -13.2 | -7.3 | 26.2 | -0.00645 | -0.00121 | -0.00669 | -0.00239 | 32.1 | 33.2 | 34.7 | -134 | 1694 | 1751 | 1828 | 5273 |
wizardlm-13b | 48.6 | 484.0 | 38.5 | 28.0 | 15.4 | 25.7 | 0.00653 | 0.00765 | -0.00077 | 0.00198 | 31.1 | 43.2 | 25.7 | 242 | 1394 | 1938 | 1152 | 4484 |
koala-13b | -7.0 | -1972.0 | -15.2 | -40.3 | -25.4 | 25.1 | -0.0048 | -0.00608 | -0.00533 | 0.00111 | 29.7 | 32.9 | 37.4 | -986 | 3773 | 4179 | 4759 | 12711 |
guanaco-33b | 42.8 | 42.0 | 42.1 | 2.5 | 1.3 | 26.1 | 0.00904 | -0.0053 | 0.00314 | 0.00186 | 29.5 | 41.5 | 29.0 | 21 | 1286 | 1806 | 1265 | 4357 |
vicuna-7b | 9.3 | -948.0 | -9.4 | -43.2 | -25.6 | 25.0 | -0.01066 | -0.01042 | -0.00613 | -0.0034 | 28.1 | 35.4 | 36.5 | -474 | 1603 | 2016 | 2077 | 5696 |
chatglm-6b | -62.1 | -2712.0 | -64.6 | -95.8 | -61.3 | 23.9 | -0.01641 | -0.01343 | -0.01629 | -0.01323 | 24.5 | 32.6 | 42.9 | -1356 | 1801 | 2393 | 3157 | 7351 |
alpaca-13b | -65.1 | -3620.0 | -86.5 | -102.9 | -66.6 | 23.6 | -0.0205 | -0.01541 | -0.01262 | -0.01366 | 24.5 | 31.2 | 44.3 | -1810 | 2238 | 2849 | 4048 | 9135 |
mpt-7b-chat | -39.9 | -2106.0 | -38.5 | -93.2 | -58.8 | 25.0 | -0.01364 | -0.01249 | -0.01284 | -0.01032 | 23.5 | 35.1 | 41.4 | -1053 | 1377 | 2057 | 2430 | 5864 |
RWKV-4-Raven-14B | -52.1 | -3080.0 | -83.9 | -110.5 | -71.2 | 23.8 | -0.01601 | -0.0274 | -0.00577 | -0.02157 | 22.0 | 34.7 | 43.3 | -1540 | 1594 | 2508 | 3134 | 7236 |
oasst-pythia-12b | -75.3 | -4510.0 | -98.8 | -123.4 | -80.5 | 23.4 | -0.02352 | -0.03063 | -0.02149 | -0.02212 | 21.6 | 32.9 | 45.4 | -2255 | 2054 | 3126 | 4309 | 9489 |
oasst-sft-1-pythia-12b | -130.5 | -30.0 | -24.4 | -203.9 | -4.7 | 22.3 | -0.001 | -0.00109 | -0.00046 | -0.0011 | 19.4 | 19.4 | 61.1 | -15 | 7 | 7 | 22 | 36 |
gpt4all-13b-snoozy | -23.6 | -1158.0 | -37.4 | -97.1 | -58.3 | 24.5 | -0.01291 | -0.01323 | -0.00027 | -0.00713 | 18.3 | 44.8 | 37.0 | -579 | 565 | 1385 | 1144 | 3094 |
fastchat-t5-3b | -93.1 | -3736.0 | -99.3 | -155.1 | -100.9 | 23.5 | -0.02403 | -0.03152 | -0.01309 | -0.01849 | 17.6 | 34.9 | 47.5 | -1868 | 1100 | 2184 | 2968 | 6252 |
stablelm-tuned-alpha-7b | -127.8 | -3486.0 | -152.9 | -174.8 | -113.6 | 22.3 | -0.01882 | -0.02451 | -0.01604 | -0.01602 | 17.0 | 32.3 | 50.7 | -1743 | 881 | 1671 | 2624 | 5176 |
dolly-v2-12b | -150.3 | -4712.0 | -171.3 | -206.8 | -135.6 | 22.1 | -0.03045 | -0.02367 | -0.01953 | -0.02246 | 15.3 | 29.6 | 55.1 | -2356 | 904 | 1751 | 3260 | 5915 |
llama-13b | -167.8 | -4186.0 | -180.5 | -225.4 | -147.6 | 22.2 | -0.02691 | -0.02997 | -0.02065 | -0.01221 | 14.3 | 27.9 | 57.8 | -2093 | 691 | 1344 | 2784 | 4819 |
Rating method | L2 distance with true winrates on test set |
---|---|
Sequential Elo | 232.6 |
Bayeselo | 236.4 |
Glicko | 296.0 |
Glicko2 | 203.7 |
Trueskill | 694.2 |
Melo2 | 188.9 |
Melo4 | 182.2 |
Melo10 | 186.4 |
Melo20 | 176.9 |
https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
33000 matches
Model name | BayesElo | Elo | Sequential Elo | Glicko | Glicko 2 | TrueSkill | mElo2 | mElo4 | mElo10 | mElo20 | Winrate | Drawrate | Loserate | WDL | Wins | Draws | Losses | Total played |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt-4 | 229.2 | 4336.0 | 216.0 | 293.4 | 192.7 | 29.3 | 0.07884 | 0.06513 | 0.04558 | 0.02122 | 68.2 | 20.2 | 11.6 | 2168 | 2614 | 775 | 446 | 3835 |
claude-v1 | 185.3 | 3250.0 | 198.7 | 238.0 | 154.9 | 28.9 | 0.0502 | 0.0529 | 0.03434 | 0.02702 | 61.2 | 23.5 | 15.3 | 1625 | 2168 | 833 | 543 | 3544 |
claude-instant-v1 | 163.3 | 1892.0 | 176.6 | 205.8 | 130.8 | 28.6 | 0.0271 | 0.03909 | 0.01967 | 0.01721 | 57.1 | 25.5 | 17.4 | 946 | 1361 | 609 | 415 | 2385 |
gpt-3.5-turbo | 132.4 | 2808.0 | 143.9 | 173.1 | 111.6 | 27.9 | 0.05612 | 0.03937 | 0.0239 | 0.01997 | 54.4 | 24.6 | 21.0 | 1404 | 2290 | 1034 | 886 | 4210 |
vicuna-13b | 54.0 | 1974.0 | 74.4 | 96.2 | 60.5 | 26.8 | 0.03968 | 0.02743 | 0.01627 | 0.01036 | 44.6 | 29.3 | 26.1 | 987 | 2378 | 1559 | 1391 | 5328 |
guanaco-33b | 69.4 | 212.0 | 77.4 | 58.4 | 21.0 | 26.7 | 0.00535 | 0.00387 | 0.00479 | 0.00669 | 41.6 | 28.1 | 30.3 | 106 | 391 | 264 | 285 | 940 |
wizardlm-13b | 48.8 | 160.0 | 44.3 | 41.3 | 14.1 | 26.0 | 0.00403 | 0.00529 | -0.00105 | 0.00119 | 38.0 | 32.0 | 30.0 | 80 | 381 | 321 | 301 | 1003 |
palm-2 | 44.5 | 292.0 | 36.4 | 28.6 | 13.9 | 25.7 | 0.00401 | 0.00575 | 0.00431 | 0.00595 | 37.8 | 30.0 | 32.3 | 146 | 1000 | 794 | 854 | 2648 |
koala-13b | 0.4 | 20.0 | -2.5 | 1.0 | 0.6 | 25.1 | 0.00633 | 0.00108 | 0.00523 | 0.00045 | 34.3 | 31.6 | 34.1 | 10 | 1730 | 1596 | 1720 | 5046 |
vicuna-7b | 10.2 | -356.0 | 27.5 | -35.9 | -17.8 | 25.5 | -0.00503 | -0.00859 | -0.00481 | -0.00135 | 30.0 | 33.1 | 36.9 | -178 | 770 | 851 | 948 | 2569 |
alpaca-13b | -72.4 | -1558.0 | -68.9 | -101.4 | -62.7 | 23.9 | -0.02915 | -0.02281 | -0.01774 | -0.00812 | 25.1 | 30.3 | 44.6 | -779 | 999 | 1210 | 1778 | 3987 |
RWKV-4-Raven-14B | -55.4 | -1362.0 | -74.9 | -106.8 | -65.3 | 23.4 | -0.02028 | -0.02391 | -0.01237 | -0.0128 | 24.0 | 31.4 | 44.6 | -681 | 794 | 1040 | 1475 | 3309 |
mpt-7b-chat | -46.9 | -1048.0 | -48.6 | -106.3 | -63.2 | 24.1 | -0.018 | -0.01506 | -0.00843 | -0.00078 | 23.4 | 32.8 | 43.9 | -524 | 598 | 838 | 1122 | 2558 |
oasst-pythia-12b | -73.4 | -1896.0 | -72.0 | -111.5 | -70.0 | 23.8 | -0.0285 | -0.02978 | -0.01326 | -0.00631 | 23.4 | 31.8 | 44.8 | -948 | 1031 | 1404 | 1979 | 4414 |
gpt4all-13b-snoozy | -46.0 | -408.0 | -42.8 | -109.2 | -51.8 | 24.1 | -0.00815 | -0.00731 | -0.00504 | -0.01292 | 23.3 | 32.2 | 44.4 | -204 | 226 | 312 | 430 | 968 |
chatglm-6b | -101.9 | -1678.0 | -126.4 | -146.5 | -91.7 | 22.8 | -0.03316 | -0.02288 | -0.02208 | -0.00877 | 19.3 | 33.3 | 47.5 | -839 | 572 | 988 | 1411 | 2971 |
fastchat-t5-3b | -89.6 | -1610.0 | -91.8 | -144.7 | -90.3 | 23.6 | -0.02401 | -0.02558 | -0.017 | -0.01196 | 19.0 | 34.1 | 46.9 | -805 | 548 | 985 | 1353 | 2886 |
stablelm-tuned-alpha-7b | -134.5 | -1740.0 | -150.6 | -179.9 | -113.4 | 22.2 | -0.03479 | -0.02908 | -0.01249 | -0.01204 | 16.8 | 31.7 | 51.5 | -870 | 422 | 795 | 1292 | 2509 |
dolly-v2-12b | -148.7 | -1848.0 | -143.0 | -193.6 | -122.7 | 22.6 | -0.03678 | -0.02826 | -0.02101 | -0.01789 | 15.9 | 30.9 | 53.2 | -924 | 394 | 764 | 1318 | 2476 |
llama-13b | -168.7 | -1440.0 | -173.8 | -205.9 | -128.6 | 21.7 | -0.03383 | -0.02666 | -0.01882 | -0.01713 | 15.1 | 30.1 | 54.8 | -720 | 274 | 546 | 994 | 1814 |
Rating method | L2 distance with true winrates on test set |
---|---|
Sequential Elo | 211.6 |
Bayeselo | 202.7 |
Glicko | 249.1 |
Glicko2 | 189.7 |
Trueskill | 580.7 |
Melo2 | 190.2 |
Melo4 | 201.0 |
Melo10 | 190.2 |
Melo20 | 199.1 |
https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/viewer/default/human
3355 matches
Model name | BayesElo | Elo | Sequential Elo | Glicko | Glicko 2 | TrueSkill | mElo2 | mElo4 | mElo10 | mElo20 | Winrate | Drawrate | Loserate | WDL | Wins | Draws | Losses | Total played |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt-4 | 172.0 | 870.0 | 169.3 | 244.1 | 146.4 | 28.2 | 0.02039 | 0.01436 | 0.00936 | 0.01089 | 62.9 | 21.2 | 15.8 | 435 | 581 | 196 | 146 | 923 |
claude-v1 | 114.6 | 564.0 | 123.8 | 164.7 | 89.0 | 27.8 | 0.01312 | 0.01015 | 0.00435 | 0.00662 | 54.5 | 22.9 | 22.7 | 282 | 483 | 203 | 201 | 887 |
gpt-3.5-turbo | 93.8 | 802.0 | 110.4 | 156.1 | 90.4 | 27.3 | 0.01838 | 0.0148 | 0.01374 | 0.0034 | 51.6 | 27.0 | 21.5 | 401 | 687 | 359 | 286 | 1332 |
vicuna-13b-v1.2 | -14.2 | -82.0 | -18.9 | -21.2 | -6.8 | 24.4 | -0.00274 | -0.0023 | -0.00016 | 0.00345 | 34.7 | 26.5 | 38.8 | -41 | 347 | 265 | 388 | 1000 |
alpaca-13b | -125.5 | -786.0 | -128.8 | -222.5 | -130.9 | 23.0 | -0.01651 | -0.01207 | -0.01108 | -0.00918 | 16.9 | 23.2 | 59.9 | -393 | 155 | 212 | 548 | 915 |
llama-13b | -240.7 | -1368.0 | -255.8 | -361.3 | -229.0 | 20.4 | -0.03264 | -0.02495 | -0.0162 | -0.01519 | 6.8 | 16.6 | 76.6 | -684 | 67 | 163 | 751 | 981 |
Rating method | L2 distance with true winrates on test set |
---|---|
Sequential Elo | 52.3 |
Bayeselo | 52.4 |
Glicko | 68.6 |
Glicko2 | 49.3 |
Trueskill | 133.8 |
Melo2 | 117.3 |
Melo4 | 82.3 |
Melo10 | 51.6 |
Melo20 | 55.8 |
https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/viewer/default/gpt4_pair
2400 matches
Model name | BayesElo | Elo | Sequential Elo | Glicko | Glicko 2 | TrueSkill | mElo2 | mElo4 | mElo10 | mElo20 | Winrate | Drawrate | Loserate | WDL | Wins | Draws | Losses | Total played |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt-4 | 304.0 | 1044.0 | 274.8 | 373.3 | 232.7 | 30.6 | 0.02336 | 0.02335 | 0.0124 | 0.0151 | 77.6 | 16.9 | 5.5 | 522 | 562 | 122 | 40 | 724 |
claude-v1 | 181.7 | 612.0 | 145.4 | 221.6 | 124.9 | 27.4 | 0.01275 | 0.00966 | 0.01007 | 0.00726 | 56.9 | 29.0 | 14.1 | 306 | 407 | 207 | 101 | 715 |
gpt-3.5-turbo | 89.7 | 328.0 | 77.3 | 116.8 | 50.0 | 26.2 | 0.00357 | 0.00865 | 0.0037 | 0.00584 | 47.0 | 28.5 | 24.5 | 164 | 342 | 207 | 178 | 727 |
vicuna-13b-v1.2 | -33.2 | -132.0 | -23.6 | -47.8 | -13.6 | 24.2 | -0.00193 | -0.0032 | -0.00363 | -0.00109 | 29.9 | 30.9 | 39.2 | -66 | 214 | 221 | 280 | 715 |
alpaca-13b | -218.4 | -766.0 | -179.5 | -277.3 | -164.8 | 21.5 | -0.01563 | -0.01604 | -0.01206 | -0.01057 | 12.3 | 21.8 | 65.9 | -383 | 88 | 156 | 471 | 715 |
llama-13b | -323.9 | -1086.0 | -294.3 | -388.3 | -243.1 | 18.6 | -0.02212 | -0.02243 | -0.01048 | -0.01654 | 2.6 | 19.8 | 77.6 | -543 | 19 | 143 | 562 | 724 |
Rating method | L2 distance with true winrates on test set |
---|---|
Sequential Elo | 42.0 |
Bayeselo | 49.0 |
Glicko | 60.4 |
Glicko2 | 43.5 |
Trueskill | 124.2 |
Melo2 | 92.6 |
Melo4 | 89.7 |
Melo10 | 31.7 |
Melo20 | 31.7 |
Not checking for redundancy
122693 matches
Model name | BayesElo | Elo | Sequential Elo | Glicko | Glicko 2 | TrueSkill | mElo2 | mElo4 | mElo10 | mElo20 | Winrate | Drawrate | Loserate | WDL | Wins | Draws | Losses | Total played |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt-4 | 223.2 | 14630.0 | 248.5 | 285.7 | 190.1 | 29.5 | 0.05627 | 0.06043 | 0.04287 | 0.03176 | 66.2 | 22.7 | 11.2 | 7315 | 8798 | 3014 | 1483 | 13295 |
claude-v1 | 176.4 | 9924.0 | 162.6 | 220.3 | 146.0 | 27.7 | 0.0526 | 0.04219 | 0.04484 | 0.02953 | 58.1 | 26.1 | 15.7 | 4962 | 6800 | 3057 | 1838 | 11695 |
claude-instant-v1 | 151.0 | 5274.0 | 168.4 | 190.3 | 125.0 | 27.8 | 0.03308 | 0.04236 | 0.02888 | 0.01618 | 54.0 | 28.6 | 17.4 | 2637 | 3886 | 2057 | 1249 | 7192 |
gpt-3.5-turbo | 131.2 | 9064.0 | 107.3 | 168.8 | 111.7 | 26.2 | 0.0386 | 0.02774 | 0.0241 | 0.02168 | 52.8 | 26.8 | 20.3 | 4532 | 7366 | 3739 | 2834 | 13939 |
vicuna-13b | 49.0 | 14086.0 | 63.9 | 133.3 | 88.5 | 26.3 | 0.03242 | 0.02728 | 0.02019 | 0.0114 | 48.2 | 29.3 | 22.5 | 7043 | 13216 | 8052 | 6173 | 27441 |
vicuna-33b | 103.8 | 1136.0 | 121.1 | 110.1 | 66.1 | 27.1 | 0.01052 | 0.01453 | 0.01714 | 0.01347 | 42.7 | 35.7 | 21.5 | 568 | 1144 | 957 | 576 | 2677 |
palm-2 | 28.0 | 96.0 | 36.1 | 3.1 | 1.8 | 25.3 | -0.0008 | 0.00196 | 0.00539 | -0.00293 | 34.3 | 32.0 | 33.7 | 48 | 2716 | 2528 | 2668 | 7912 |
mpt-30b-chat | 58.0 | 312.0 | 42.0 | 32.5 | 15.8 | 25.4 | 0.00636 | -4e-05 | 0.00363 | 0.00327 | 34.2 | 37.8 | 28.0 | 156 | 852 | 940 | 696 | 2488 |
wizardlm-13b | 47.2 | 632.0 | 35.0 | 29.9 | 17.1 | 25.3 | 0.00849 | -0.00046 | 0.00274 | 0.00156 | 32.3 | 41.2 | 26.5 | 316 | 1771 | 2263 | 1455 | 5489 |
guanaco-33b | 47.8 | 302.0 | 55.7 | 14.8 | 8.2 | 25.7 | 0.00234 | 0.00948 | 0.00224 | 0.00247 | 31.9 | 39.0 | 29.1 | 151 | 1689 | 2063 | 1538 | 5290 |
vicuna-13b-v1.2 | 26.8 | -244.0 | 22.8 | -37.2 | -16.1 | 25.3 | -0.00097 | 0.00366 | 0.00598 | -0.00205 | 31.7 | 29.3 | 38.9 | -122 | 540 | 499 | 662 | 1701 |
koala-13b | -7.3 | -1970.0 | -3.4 | -28.8 | -18.4 | 24.7 | -0.00881 | 0.0055 | -0.00023 | -0.00542 | 31.0 | 32.4 | 36.6 | -985 | 5501 | 5743 | 6486 | 17730 |
vicuna-7b | 10.0 | -1232.0 | -1.6 | -38.6 | -23.6 | 24.7 | -0.00088 | 0.00161 | -0.00932 | -0.00449 | 29.0 | 34.6 | 36.4 | -616 | 2402 | 2864 | 3018 | 8284 |
alpaca-13b | -74.0 | -6790.0 | -102.3 | -119.1 | -78.4 | 22.5 | -0.02437 | -0.02931 | -0.02321 | -0.01455 | 23.6 | 29.9 | 46.5 | -3395 | 3491 | 4425 | 6886 | 14802 |
mpt-7b-chat | -43.2 | -3134.0 | -39.3 | -96.4 | -62.1 | 24.1 | -0.02003 | -0.01251 | -0.01253 | -0.00255 | 23.5 | 34.5 | 42.1 | -1567 | 1980 | 2908 | 3547 | 8435 |
chatglm-6b | -75.2 | -4358.0 | -59.9 | -109.2 | -71.1 | 23.6 | -0.01488 | -0.02358 | -0.0147 | -0.00333 | 23.1 | 32.8 | 44.1 | -2179 | 2392 | 3395 | 4571 | 10358 |
RWKV-4-Raven-14B | -55.5 | -4454.0 | -75.7 | -109.8 | -71.6 | 23.6 | -0.01199 | -0.02313 | -0.01378 | -0.00526 | 22.6 | 33.7 | 43.7 | -2227 | 2379 | 3546 | 4606 | 10531 |
oasst-pythia-12b | -77.0 | -6402.0 | -109.4 | -119.8 | -78.8 | 22.4 | -0.02957 | -0.03031 | -0.02215 | -0.01148 | 22.2 | 32.5 | 45.3 | -3201 | 3077 | 4512 | 6278 | 13867 |
gpt4all-13b-snoozy | -30.4 | -1602.0 | -58.5 | -101.7 | -63.0 | 23.3 | -0.01544 | -0.0096 | -0.0165 | -0.00133 | 19.4 | 41.5 | 39.0 | -801 | 795 | 1697 | 1596 | 4088 |
oasst-sft-1-pythia-12b | -132.5 | -30.0 | -25.5 | -203.9 | -4.7 | 21.6 | -0.00087 | -0.00053 | -0.00097 | -0.00063 | 19.4 | 19.4 | 61.1 | -15 | 7 | 7 | 22 | 36 |
fastchat-t5-3b | -94.7 | -5400.0 | -92.3 | -152.8 | -100.3 | 23.1 | -0.02959 | -0.02534 | -0.01233 | -0.01453 | 18.0 | 34.5 | 47.5 | -2700 | 1655 | 3161 | 4355 | 9171 |
stablelm-tuned-alpha-7b | -132.0 | -5200.0 | -146.6 | -175.6 | -115.2 | 21.8 | -0.02248 | -0.02722 | -0.01989 | -0.01842 | 17.0 | 32.2 | 50.8 | -2600 | 1307 | 2473 | 3907 | 7687 |
dolly-v2-12b | -152.4 | -6596.0 | -178.0 | -203.4 | -134.2 | 21.4 | -0.02594 | -0.02444 | -0.02394 | -0.02013 | 15.4 | 30.0 | 54.6 | -3298 | 1297 | 2525 | 4595 | 8417 |
llama-13b | -178.0 | -8044.0 | -170.9 | -250.9 | -166.1 | 21.6 | -0.03404 | -0.03028 | -0.02846 | -0.02421 | 12.6 | 26.4 | 60.9 | -4022 | 1050 | 2201 | 5072 | 8323 |
Rating method | L2 distance with true winrates on test set |
---|---|
Sequential Elo | 238.8 |
Bayeselo | 244.4 |
Glicko | 306.3 |
Glicko2 | 210.3 |
Trueskill | 724.8 |
Melo2 | 208.9 |
Melo4 | 200.6 |
Melo10 | 188.6 |
Melo20 | 204.0 |