Question about Reproducing FactScore for Llama-3-8b-Instruct #1

LuckyyySTA · 2024-10-23T02:45:23Z

Hi, thanks for your inspiring work!

I would like to ask if the Llama-3-8b-Chat in Table 1 refers to the original "meta-llama/Meta-Llama-3-8B-Instruct" model. When I attempted to reproduce the results from Table 1, I calculated a factscore of 35.53, and for llama-3-8b-instruct + factalign, the factscore was 37.62. I noticed a significant discrepancy compared to the values in Table 1 (Llama-3-8b-chat=54.96, Llama-3-8b-chat+factalign=62.84).

For calculating the factscore, I used the evaluation script provided by [1], and to save costs, I evaluated using the "retrieval+llama+npm" model. Although this differs from your "retrieval+ChatGPT" approach, based on the FactScore authors' results, the difference shouldn't be too large. Therefore, I suspect it might be due to decoding parameters. I used the default sampling decoding with temperature=1.0. What is your decoding strategy? What other reasons do you think could lead to this discrepancy?

Thank you!

[1] https://github.com/shmsw25/FActScore

chaoweihuang · 2024-11-05T07:06:07Z

Hi,

Thank you for your interest in our work!
I suspect that a few discrepancies might have caused this difference.

As you stated, we used GPT-3.5-turbo as the backbone for FactScore. The original FactScore used InstructGPT, which might be quite different from GPT-3.5.
We left most decoding parameters as the default.
We used an unofficial fork of FactScore, which supports GPT-3.5 (https://github.com/wj210/factscore).

We are trying to rerun to see if there are any bugs or discrepancies that might cause the difference. I'll share the update here.

Chao-Wei

LuckyyySTA · 2024-11-05T16:18:00Z

Thank you for your response!

Additionally, I have a few questions regarding the evaluation of LongFact. I noticed that there is a generate.sh script in the long-form-factuality section of your repository (https://github.com/MiuLab/FactAlign/blob/main/long-form-factuality/generate.sh). Could you please confirm if this script is the one you use for benchmarking LongFact? I observed that the prompts used in your script seem different from those in the official evaluation script (https://github.com/google-deepmind/long-form-factuality/blob/main/main/pipeline.py), specifically at (https://github.com/google-deepmind/long-form-factuality/blob/main/main/methods.py#L24). I am unsure which version to follow and would appreciate your guidance on this.

I am looking forward to your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

LuckyyySTA commented Oct 23, 2024

chaoweihuang commented Nov 5, 2024

LuckyyySTA commented Nov 5, 2024

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

Comments

LuckyyySTA commented Oct 23, 2024

chaoweihuang commented Nov 5, 2024

LuckyyySTA commented Nov 5, 2024