-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Save and load contexts and answers for eval #462
Conversation
Warning Rate limit exceeded@alekszievr has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 22 minutes and 57 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request enhances the evaluation process for QA datasets by introducing context and answer caching mechanisms. The modifications primarily affect Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
evals/eval_on_hotpot.py
(4 hunks)evals/promptfoo_metrics.py
(3 hunks)evals/qa_eval_parameters.json
(1 hunks)evals/run_qa_eval.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: docker-compose-test
🔇 Additional comments (3)
evals/run_qa_eval.py (1)
25-25
: LGTM! The out_path parameter is correctly propagated.The addition of the
out_path
parameter to both evaluation functions enables proper saving of contexts and answers, aligning with the PR's objective.Also applies to: 30-30
evals/eval_on_hotpot.py (2)
13-19
: LGTM! Good practice setting random seed for reproducibility.The addition of necessary imports and setting a random seed ensures reproducible results across runs.
113-133
: LGTM! Good improvements in sampling and file handling.
- Using
random.sample
is better than slicing for getting a random subset- File paths are properly constructed using
Path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
evals/eval_on_hotpot.py (2)
21-35
:⚠️ Potential issueFix race conditions and optimize file operations.
The implementation has two issues:
- File operations are not thread-safe, which could lead to race conditions.
- The file is written even when using preloaded context, which is inefficient.
Previous review already suggested using
filelock
. Additionally, optimize the file operations:+from filelock import FileLock + async def answer_qa_instance(instance, context_provider, contexts_filename): + lock = FileLock(f"{contexts_filename}.lock") preloaded_contexts = {} + with lock: if os.path.exists(contexts_filename): with open(contexts_filename, "r") as file: preloaded_contexts = json.load(file) if instance["_id"] in preloaded_contexts: context = preloaded_contexts[instance["_id"]] else: context = await context_provider(instance) preloaded_contexts[instance["_id"]] = context + with lock: with open(contexts_filename, "w") as file: json.dump(preloaded_contexts, file) - with open(contexts_filename, "w") as file: - json.dump(preloaded_contexts, file)
69-89
:⚠️ Potential issueFix race conditions in answer caching.
The implementation has similar thread-safety issues as
answer_qa_instance
.Previous review already suggested using
filelock
. Apply similar optimizations:+from filelock import FileLock + async def deepeval_on_instances( instances, context_provider, eval_metrics, answers_filename, contexts_filename ): + lock = FileLock(f"{answers_filename}.lock") preloaded_answers = {} + with lock: if os.path.exists(answers_filename): with open(answers_filename, "r") as file: preloaded_answers = json.load(file) answers = [] for instance in tqdm(instances, desc="Getting answers"): if instance["_id"] in preloaded_answers: answer = preloaded_answers[instance["_id"]] else: answer = await answer_qa_instance(instance, context_provider, contexts_filename) preloaded_answers[instance["_id"]] = answer + with lock: + with open(answers_filename, "w") as file: + json.dump(preloaded_answers, file) answers.append(answer) - with open(answers_filename, "w") as file: - json.dump(preloaded_answers, file)
🧹 Nitpick comments (1)
evals/qa_context_provider_utils.py (1)
42-48
: Refactor description logic to improve readability and reduce duplication.The nested conditional expressions make the code less readable and the logic is duplicated between node1 and node2 descriptions.
Consider extracting the logic into a helper function:
+def _get_node_description(node: dict) -> str: + return node.get("description") or node.get("text") or "N/A" + def _insight_to_string(triplet: tuple) -> str: # ... previous code ... - node1_description = ( - node1["description"] - if "description" in node1 - else node1["text"] - if "text" in node1 - else "N/A" - ) + node1_description = _get_node_description(node1) node1_string = f"name: {node1_name}, description: {node1_description}" node2_name = node2["name"] if "name" in node2 else "N/A" - node2_description = ( - node2["description"] - if "description" in node2 - else node2["text"] - if "text" in node2 - else "N/A" - ) + node2_description = _get_node_description(node2)Also applies to: 51-57
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
evals/eval_on_hotpot.py
(4 hunks)evals/qa_context_provider_utils.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: profiler
- GitHub Check: docker-compose-test
🔇 Additional comments (1)
evals/eval_on_hotpot.py (1)
142-142
: LGTM!The changes correctly propagate the
out_path
parameter toeval_on_QA_dataset
.Also applies to: 149-149
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
Release Notes
New Features
Improvements
Changes