Skip to content

Commit

Permalink
cleanup and use average stats
Browse files Browse the repository at this point in the history
Signed-off-by: Jack Luar <[email protected]>
  • Loading branch information
luarss committed Nov 10, 2024
1 parent 61d54ca commit b3f05ef
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 4 deletions.
8 changes: 6 additions & 2 deletions evaluation/auto_evaluation/dataset/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,12 @@ def read_deepeval_cache():
metric["metric_data"]["success"]
)

print("Metric Scores: ", metric_scores)
print("Metric Passes: ", metric_passes)
print("Average Metric Scores: ")
for key, value in metric_scores.items():
print(key, sum(value) / len(value))
print("Metric Passrates: ")
for key, value in metric_passes.items():
print(key, value.count(True) / len(value))


if __name__ == "__main__":
Expand Down
2 changes: 0 additions & 2 deletions evaluation/auto_evaluation/src/models/vertex_ai.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@
import instructor

from typing import Any

# from langchain_google_vertexai import ChatVertexAI, HarmBlockThreshold, HarmCategory
from vertexai.generative_models import GenerativeModel, HarmBlockThreshold, HarmCategory # type: ignore
from deepeval.models.base_model import DeepEvalBaseLLM
from pydantic import BaseModel
Expand Down

1 comment on commit b3f05ef

@luarss
Copy link
Collaborator Author

@luarss luarss commented on b3f05ef Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 1.5.0 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(

Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 3.50it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 6.97it/s]

Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:15<25:59, 15.75s/it]
Evaluating: 2%|▏ | 2/100 [00:26<20:49, 12.75s/it]
Evaluating: 3%|▎ | 3/100 [00:39<21:12, 13.12s/it]
Evaluating: 4%|▍ | 4/100 [00:49<19:00, 11.88s/it]
Evaluating: 5%|▌ | 5/100 [01:01<18:29, 11.68s/it]
Evaluating: 6%|▌ | 6/100 [01:13<18:24, 11.75s/it]
Evaluating: 7%|▋ | 7/100 [01:25<18:27, 11.90s/it]
Evaluating: 8%|▊ | 8/100 [01:36<18:06, 11.81s/it]
Evaluating: 9%|▉ | 9/100 [01:48<17:55, 11.82s/it]
Evaluating: 10%|█ | 10/100 [01:59<16:59, 11.32s/it]
Evaluating: 11%|█ | 11/100 [02:09<16:31, 11.14s/it]
Evaluating: 12%|█▏ | 12/100 [02:19<15:40, 10.69s/it]
Evaluating: 13%|█▎ | 13/100 [02:31<16:13, 11.18s/it]
Evaluating: 14%|█▍ | 14/100 [02:44<16:30, 11.52s/it]
Evaluating: 15%|█▌ | 15/100 [02:56<16:40, 11.77s/it]
Evaluating: 16%|█▌ | 16/100 [03:07<16:06, 11.51s/it]
Evaluating: 17%|█▋ | 17/100 [03:19<16:04, 11.62s/it]
Evaluating: 18%|█▊ | 18/100 [03:30<15:43, 11.50s/it]
Evaluating: 19%|█▉ | 19/100 [03:40<15:03, 11.15s/it]
Evaluating: 20%|██ | 20/100 [03:50<14:25, 10.82s/it]
Evaluating: 21%|██ | 21/100 [04:01<14:22, 10.92s/it]
Evaluating: 22%|██▏ | 22/100 [04:14<14:39, 11.28s/it]
Evaluating: 23%|██▎ | 23/100 [04:26<14:49, 11.55s/it]
Evaluating: 24%|██▍ | 24/100 [04:35<13:53, 10.97s/it]
Evaluating: 25%|██▌ | 25/100 [04:46<13:38, 10.91s/it]
Evaluating: 26%|██▌ | 26/100 [04:56<13:04, 10.61s/it]
Evaluating: 27%|██▋ | 27/100 [05:08<13:15, 10.89s/it]
Evaluating: 28%|██▊ | 28/100 [05:19<13:08, 10.95s/it]
Evaluating: 29%|██▉ | 29/100 [05:30<12:56, 10.93s/it]
Evaluating: 30%|███ | 30/100 [05:41<13:03, 11.19s/it]
Evaluating: 31%|███ | 31/100 [05:52<12:37, 10.98s/it]
Evaluating: 32%|███▏ | 32/100 [06:03<12:40, 11.18s/it]
Evaluating: 33%|███▎ | 33/100 [06:15<12:35, 11.27s/it]
Evaluating: 34%|███▍ | 34/100 [06:26<12:19, 11.21s/it]
Evaluating: 35%|███▌ | 35/100 [06:37<12:00, 11.08s/it]
Evaluating: 36%|███▌ | 36/100 [06:48<11:43, 11.00s/it]
Evaluating: 37%|███▋ | 37/100 [06:58<11:22, 10.83s/it]
Evaluating: 38%|███▊ | 38/100 [07:10<11:35, 11.21s/it]
Evaluating: 39%|███▉ | 39/100 [07:21<11:24, 11.22s/it]
Evaluating: 40%|████ | 40/100 [07:34<11:37, 11.62s/it]
Evaluating: 41%|████ | 41/100 [07:45<11:15, 11.45s/it]
Evaluating: 42%|████▏ | 42/100 [07:57<11:08, 11.53s/it]
Evaluating: 43%|████▎ | 43/100 [08:07<10:41, 11.25s/it]
Evaluating: 44%|████▍ | 44/100 [08:18<10:23, 11.14s/it]
Evaluating: 45%|████▌ | 45/100 [08:31<10:34, 11.54s/it]
Evaluating: 46%|████▌ | 46/100 [08:43<10:34, 11.76s/it]
Evaluating: 47%|████▋ | 47/100 [08:54<10:14, 11.59s/it]
Evaluating: 48%|████▊ | 48/100 [09:05<09:52, 11.39s/it]
Evaluating: 49%|████▉ | 49/100 [09:15<09:20, 10.99s/it]
Evaluating: 50%|█████ | 50/100 [09:28<09:33, 11.47s/it]
Evaluating: 51%|█████ | 51/100 [09:39<09:26, 11.55s/it]
Evaluating: 52%|█████▏ | 52/100 [09:51<09:09, 11.46s/it]
Evaluating: 53%|█████▎ | 53/100 [10:01<08:42, 11.11s/it]
Evaluating: 54%|█████▍ | 54/100 [10:13<08:46, 11.45s/it]
Evaluating: 55%|█████▌ | 55/100 [10:26<08:52, 11.84s/it]
Evaluating: 56%|█████▌ | 56/100 [10:39<08:51, 12.07s/it]
Evaluating: 57%|█████▋ | 57/100 [10:52<09:01, 12.60s/it]
Evaluating: 58%|█████▊ | 58/100 [11:04<08:30, 12.15s/it]
Evaluating: 59%|█████▉ | 59/100 [11:15<08:10, 11.97s/it]
Evaluating: 60%|██████ | 60/100 [11:27<07:58, 11.97s/it]
Evaluating: 61%|██████ | 61/100 [11:39<07:41, 11.84s/it]
Evaluating: 62%|██████▏ | 62/100 [11:50<07:27, 11.77s/it]
Evaluating: 63%|██████▎ | 63/100 [12:01<07:04, 11.46s/it]
Evaluating: 64%|██████▍ | 64/100 [12:09<06:12, 10.34s/it]
Evaluating: 65%|██████▌ | 65/100 [12:18<05:52, 10.07s/it]
Evaluating: 66%|██████▌ | 66/100 [12:29<05:50, 10.30s/it]
Evaluating: 67%|██████▋ | 67/100 [12:40<05:49, 10.59s/it]
Evaluating: 68%|██████▊ | 68/100 [12:51<05:44, 10.78s/it]
Evaluating: 69%|██████▉ | 69/100 [13:01<05:24, 10.48s/it]
Evaluating: 70%|███████ | 70/100 [13:12<05:13, 10.46s/it]
Evaluating: 71%|███████ | 71/100 [13:22<05:05, 10.54s/it]
Evaluating: 72%|███████▏ | 72/100 [13:33<04:59, 10.71s/it]
Evaluating: 73%|███████▎ | 73/100 [13:45<04:58, 11.06s/it]
Evaluating: 74%|███████▍ | 74/100 [13:57<04:51, 11.20s/it]
Evaluating: 75%|███████▌ | 75/100 [14:06<04:24, 10.58s/it]
Evaluating: 76%|███████▌ | 76/100 [14:18<04:27, 11.13s/it]
Evaluating: 77%|███████▋ | 77/100 [14:30<04:22, 11.40s/it]
Evaluating: 78%|███████▊ | 78/100 [14:42<04:14, 11.56s/it]
Evaluating: 79%|███████▉ | 79/100 [14:52<03:53, 11.14s/it]
Evaluating: 80%|████████ | 80/100 [15:03<03:39, 10.96s/it]
Evaluating: 81%|████████ | 81/100 [15:13<03:23, 10.69s/it]
Evaluating: 82%|████████▏ | 82/100 [15:24<03:11, 10.62s/it]
Evaluating: 83%|████████▎ | 83/100 [15:34<02:59, 10.58s/it]
Evaluating: 84%|████████▍ | 84/100 [15:46<02:56, 11.03s/it]
Evaluating: 85%|████████▌ | 85/100 [15:55<02:35, 10.40s/it]
Evaluating: 86%|████████▌ | 86/100 [16:07<02:33, 10.95s/it]
Evaluating: 87%|████████▋ | 87/100 [16:17<02:19, 10.70s/it]
Evaluating: 88%|████████▊ | 88/100 [16:30<02:14, 11.23s/it]
Evaluating: 89%|████████▉ | 89/100 [16:41<02:02, 11.12s/it]
Evaluating: 90%|█████████ | 90/100 [16:55<01:59, 11.95s/it]
Evaluating: 91%|█████████ | 91/100 [17:08<01:52, 12.45s/it]
Evaluating: 92%|█████████▏| 92/100 [17:21<01:39, 12.43s/it]
Evaluating: 93%|█████████▎| 93/100 [17:33<01:27, 12.50s/it]
Evaluating: 94%|█████████▍| 94/100 [17:43<01:10, 11.79s/it]
Evaluating: 95%|█████████▌| 95/100 [17:55<00:58, 11.61s/it]
Evaluating: 96%|█████████▌| 96/100 [18:05<00:45, 11.32s/it]
Evaluating: 97%|█████████▋| 97/100 [18:17<00:34, 11.54s/it]
Evaluating: 98%|█████████▊| 98/100 [18:30<00:23, 11.75s/it]
Evaluating: 99%|█████████▉| 99/100 [18:40<00:11, 11.26s/it]
Evaluating: 100%|██████████| 100/100 [18:52<00:00, 11.66s/it]
Evaluating: 100%|██████████| 100/100 [18:52<00:00, 11.33s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...

Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:11, 11.46s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:12, 5.05s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:12, 2.98s/test case]
Evaluating 100 test case(s) in parallel: |▍ | 4% (4/100) [Time Taken: 00:13, 2.00s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:13, 1.34s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 6% (6/100) [Time Taken: 00:13, 1.01test case/s]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:13, 1.39test case/s]
Evaluating 100 test case(s) in parallel: |▉ | 9% (9/100) [Time Taken: 00:13, 2.49test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:14, 3.18test case/s]
Evaluating 100 test case(s) in parallel: |█▏ | 12% (12/100) [Time Taken: 00:14, 3.38test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:15, 2.22test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:15, 2.61test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:16, 2.38test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:16, 2.25test case/s]
Evaluating 100 test case(s) in parallel: |█▊ | 18% (18/100) [Time Taken: 00:17, 2.56test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:17, 2.65test case/s]
Evaluating 100 test case(s) in parallel: |██▍ | 24% (24/100) [Time Taken: 00:17, 5.50test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 25% (25/100) [Time Taken: 00:18, 5.54test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:18, 4.27test case/s]
Evaluating 100 test case(s) in parallel: |███ | 30% (30/100) [Time Taken: 00:18, 7.81test case/s]
Evaluating 100 test case(s) in parallel: |███▍ | 34% (34/100) [Time Taken: 00:18, 11.81test case/s]
Evaluating 100 test case(s) in parallel: |███▊ | 38% (38/100) [Time Taken: 00:18, 15.80test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:19, 9.64test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:19, 12.81test case/s]
Evaluating 100 test case(s) in parallel: |████▉ | 49% (49/100) [Time Taken: 00:19, 15.99test case/s]
Evaluating 100 test case(s) in parallel: |█████▏ | 52% (52/100) [Time Taken: 00:19, 18.01test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:20, 18.34test case/s]
Evaluating 100 test case(s) in parallel: |█████▊ | 58% (58/100) [Time Taken: 00:20, 14.64test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 61% (61/100) [Time Taken: 00:20, 16.44test case/s]
Evaluating 100 test case(s) in parallel: |██████▍ | 64% (64/100) [Time Taken: 00:20, 17.97test case/s]
Evaluating 100 test case(s) in parallel: |██████▋ | 67% (67/100) [Time Taken: 00:20, 18.75test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 70% (70/100) [Time Taken: 00:21, 15.90test case/s]
Evaluating 100 test case(s) in parallel: |███████▎ | 73% (73/100) [Time Taken: 00:21, 17.04test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:21, 16.05test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:21, 16.60test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:21, 15.52test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 81% (81/100) [Time Taken: 00:21, 16.22test case/s]
Evaluating 100 test case(s) in parallel: |████████▎ | 83% (83/100) [Time Taken: 00:21, 16.52test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:21, 14.23test case/s]
Evaluating 100 test case(s) in parallel: |████████▋ | 87% (87/100) [Time Taken: 00:23, 4.07test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:24, 3.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:24, 3.31test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:25, 3.85test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:25, 3.66test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:25, 3.90test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:26, 5.17test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:26, 3.43test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:33, 1.74s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:33, 2.96test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7399523809523809
Contextual Recall 0.8846666666666667
Hallucination 0.4535991851285969
Metric Passrates:
Contextual Precision 0.73
Contextual Recall 0.83
Hallucination 0.64

Please sign in to comment.