Merge pull request langchain-ai#12 from langchain-ai/wfh/change_to_co…

…t_qa Better recommendations
Prince-Mendiratta · Aug 16, 2023 · 0f92913 · 0f92913
2 parents dc00382 + 09dde3b
commit 0f92913
Showing 1 changed file with 2 additions and 3 deletions.
diff --git a/docs/evaluation/evaluator-implementations.mdx b/docs/evaluation/evaluator-implementations.mdx
@@ -28,10 +28,10 @@ Many of the LLM-based evaluators return a binary score for a given data point, s
 
 QA evalutors help to measure the correctness of a response to a user query or question. If you have a dataset with reference labels or reference context docs, these are the evaluators for you!
 
-Three QA evaluators you can load are: `"qa"`, `"context_qa"`, and `"cot_qa"`:
+Three QA evaluators you can load are: `"context_qa"`, `"qa"`, `"cot_qa"`. Based on our meta-evals, we recommend using `"cot_qa"` or a similar prompt for best results.
 
-- The `"qa"` evaluator ([reference](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html#langchain-evaluation-qa-eval-chain-qaevalchain)) instructs an LLMChain to directly grade a response as "correct" or "incorrect" based on the reference answer.
 - The `"context_qa"` evaluator ([reference](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.ContextQAEvalChain.html#langchain.evaluation.qa.eval_chain.ContextQAEvalChain)) instructs the LLM chain to use reference "context" (provided throught the example outputs) in determining correctness. This is useful if you have a larger corpus of grounding docs but don't have ground truth answers to a query.
+- The `"qa"` evaluator ([reference](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html#langchain-evaluation-qa-eval-chain-qaevalchain)) instructs an LLMChain to directly grade a response as "correct" or "incorrect" based on the reference answer.
 - The `"cot_qa"` evaluator ([reference](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.CotQAEvalChain.html#langchain.evaluation.qa.eval_chain.CotQAEvalChain)) is similar to the "context_qa" evaluator, expect it instructs the LLMChain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.
 
 <CodeTabs
@@ -128,7 +128,6 @@ Your dataset may have ground truth labels or contextual information demonstratin
 from langchain.smith import RunEvalConfig, run_on_dataset\n
 evaluation_config = RunEvalConfig(
     evaluators=[
-        RunEvalConfig.LabeledCriteria("correctness"),
         # You can define an arbitrary criterion as a key: value pair in the criteria dict
         RunEvalConfig.LabeledCriteria(
             {