You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
had the same question. was able to pry this out of chatgpt
Why Multiple Samples?
When a model generates a single response, that output is influenced by inherent randomness due to sampling methods like temperature and top-p. This randomness means that a single response might not reliably represent the model's true performance. By generating multiple responses (e.g., 64) for each query, researchers can better account for this variability and more accurately estimate the likelihood that the model's first response is correct.
Supporting Research
The paper "Evaluating Large Language Models Trained on Code" arXiv
discusses this approach in detail. The authors explain that to evaluate pass@k, they generate multiple samples per task, count the number of correct samples, and calculate an unbiased estimator for pass@k. This method helps in providing a more reliable measure of the model's performance.
In general, model generates just one response per query to estimate pass@1, why generate 64 responses per query to estimate pass@1?
The text was updated successfully, but these errors were encountered: