Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In eval,why generate 64 responses per query to estimate pass@1? #22

Open
Hunter-P opened this issue Jan 21, 2025 · 2 comments
Open

In eval,why generate 64 responses per query to estimate pass@1? #22

Hunter-P opened this issue Jan 21, 2025 · 2 comments

Comments

@Hunter-P
Copy link

In general, model generates just one response per query to estimate pass@1, why generate 64 responses per query to estimate pass@1?

@kaiyliu
Copy link

kaiyliu commented Jan 21, 2025

Where does it say to generate 64 responses per query to estimate the pass@1?

@bdytx5
Copy link

bdytx5 commented Jan 21, 2025

had the same question. was able to pry this out of chatgpt

Why Multiple Samples?

When a model generates a single response, that output is influenced by inherent randomness due to sampling methods like temperature and top-p. This randomness means that a single response might not reliably represent the model's true performance. By generating multiple responses (e.g., 64) for each query, researchers can better account for this variability and more accurately estimate the likelihood that the model's first response is correct.

Supporting Research

The paper "Evaluating Large Language Models Trained on Code"
arXiv
discusses this approach in detail. The authors explain that to evaluate pass@k, they generate multiple samples per task, count the number of correct samples, and calculate an unbiased estimator for pass@k. This method helps in providing a more reliable measure of the model's performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants