{"id":317,"date":"2026-01-26T04:40:11","date_gmt":"2026-01-25T20:40:11","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=317"},"modified":"2026-01-26T04:40:11","modified_gmt":"2026-01-25T20:40:11","slug":"a-coding-implementation-to-automating-llm-quality-assurance-with-deepeval-custom-retrievers-and-llm-as-a-judge-metrics","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=317","title":{"rendered":"A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics"},"content":{"rendered":"<p>We initiate this tutorial by configuring a high-performance evaluation environment, specifically focused on integrating the <a href=\"https:\/\/github.com\/confident-ai\/deepeval\"><strong>DeepEval<\/strong><\/a> framework to bring unit-testing rigor to our LLM applications. By bridging the gap between raw retrieval and final generation, we implement a system that treats model outputs as testable code and uses LLM-as-a-judge metrics to quantify performance. We move beyond manual inspection by building a structured pipeline in which every query, retrieved context, and generated response is validated against rigorous academic-standard metrics. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import sys, os, textwrap, json, math, re\nfrom getpass import getpass\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f527.png\" alt=\"\ud83d\udd27\" class=\"wp-smiley\" \/> Hardening environment (prevents common Colab\/py3.12 numpy corruption)...\")\n\n\n!pip -q uninstall -y numpy || true\n!pip -q install --no-cache-dir --force-reinstall \"numpy==1.26.4\"\n\n\n!pip -q install -U deepeval openai scikit-learn pandas tqdm\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Packages installed.\")\n\n\n\n\nimport numpy as np\nimport pandas as pd\nfrom tqdm.auto import tqdm\n\n\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.metrics.pairwise import cosine_similarity\n\n\nfrom deepeval import evaluate\nfrom deepeval.test_case import LLMTestCase, LLMTestCaseParams\nfrom deepeval.metrics import (\n   AnswerRelevancyMetric,\n   FaithfulnessMetric,\n   ContextualRelevancyMetric,\n   ContextualPrecisionMetric,\n   ContextualRecallMetric,\n   GEval,\n)\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Imports loaded successfully.\")\n\n\n\n\nOPENAI_API_KEY = getpass(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f511.png\" alt=\"\ud83d\udd11\" class=\"wp-smiley\" \/> Enter OPENAI_API_KEY (leave empty to run without OpenAI): \").strip()\nopenai_enabled = bool(OPENAI_API_KEY)\n\n\nif openai_enabled:\n   os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f50c.png\" alt=\"\ud83d\udd0c\" class=\"wp-smiley\" \/> OpenAI enabled: {openai_enabled}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We initialize our environment by stabilizing core dependencies and installing the deepeval framework to ensure a robust testing pipeline. Next, we import specialized metrics like Faithfulness and Contextual Recall while configuring our API credentials to enable automated, high-fidelity evaluation of our LLM responses. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">DOCS = [\n   {\n       \"id\": \"doc_01\",\n       \"title\": \"DeepEval Overview\",\n       \"text\": (\n           \"DeepEval is an open-source LLM evaluation framework for unit testing LLM apps. \"\n           \"It supports LLM-as-a-judge metrics, custom metrics like G-Eval, and RAG metrics \"\n           \"such as contextual precision and faithfulness.\"\n       ),\n   },\n   {\n       \"id\": \"doc_02\",\n       \"title\": \"RAG Evaluation: Why Faithfulness Matters\",\n       \"text\": (\n           \"Faithfulness checks whether the answer is supported by retrieved context. \"\n           \"In RAG, hallucinations occur when the model states claims not grounded in context.\"\n       ),\n   },\n   {\n       \"id\": \"doc_03\",\n       \"title\": \"Contextual Precision\",\n       \"text\": (\n           \"Contextual precision evaluates how well retrieved chunks are ranked by relevance \"\n           \"to a query. High precision means relevant chunks appear earlier in the ranked list.\"\n       ),\n   },\n   {\n       \"id\": \"doc_04\",\n       \"title\": \"Contextual Recall\",\n       \"text\": (\n           \"Contextual recall measures whether the retriever returns enough relevant context \"\n           \"to answer the query. Low recall means key information was missed in retrieval.\"\n       ),\n   },\n   {\n       \"id\": \"doc_05\",\n       \"title\": \"Answer Relevancy\",\n       \"text\": (\n           \"Answer relevancy measures whether the generated answer addresses the user's query. \"\n           \"Even grounded answers can be irrelevant if they don't respond to the question.\"\n       ),\n   },\n   {\n       \"id\": \"doc_06\",\n       \"title\": \"G-Eval (GEval) Custom Rubrics\",\n       \"text\": (\n           \"G-Eval lets you define evaluation criteria in natural language. \"\n           \"It uses an LLM judge to score outputs against your rubric (e.g., correctness, tone, policy).\"\n       ),\n   },\n   {\n       \"id\": \"doc_07\",\n       \"title\": \"What a DeepEval Test Case Contains\",\n       \"text\": (\n           \"A test case typically includes input (query), actual_output (model answer), \"\n           \"expected_output (gold answer), and retrieval_context (ranked retrieved passages) for RAG.\"\n       ),\n   },\n   {\n       \"id\": \"doc_08\",\n       \"title\": \"Common Pitfall: Missing expected_output\",\n       \"text\": (\n           \"Some RAG metrics require expected_output in addition to input and retrieval_context. \"\n           \"If expected_output is None, evaluation fails for metrics like contextual precision\/recall.\"\n       ),\n   },\n]\n\n\n\n\nEVAL_QUERIES = [\n   {\n       \"query\": \"What is DeepEval used for?\",\n       \"expected\": \"DeepEval is used to evaluate and unit test LLM applications using metrics like LLM-as-a-judge, G-Eval, and RAG metrics.\",\n   },\n   {\n       \"query\": \"What does faithfulness measure in a RAG system?\",\n       \"expected\": \"Faithfulness measures whether the generated answer is supported by the retrieved context and avoids hallucinations not grounded in that context.\",\n   },\n   {\n       \"query\": \"What does contextual precision mean?\",\n       \"expected\": \"Contextual precision evaluates whether relevant retrieved chunks are ranked higher than irrelevant ones for a given query.\",\n   },\n   {\n       \"query\": \"What does contextual recall mean in retrieval?\",\n       \"expected\": \"Contextual recall measures whether the retriever returns enough relevant context to answer the query, capturing key missing information issues.\",\n   },\n   {\n       \"query\": \"Why might an answer be relevant but still low quality in RAG?\",\n       \"expected\": \"An answer can address the question (relevant) but still be low quality if it is not grounded in retrieved context or misses important details.\",\n   },\n]\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define a structured knowledge base consisting of documentation snippets that serve as our ground-truth context for the RAG system. We also establish a set of evaluation queries and corresponding expected outputs to create a \u201cgold dataset,\u201d enabling us to assess how accurately our model retrieves information and generates grounded responses. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class TfidfRetriever:\n   def __init__(self, docs):\n       self.docs = docs\n       self.texts = [f\"{d['title']}n{d['text']}\" for d in docs]\n       self.vectorizer = TfidfVectorizer(stop_words=\"english\", ngram_range=(1, 2))\n       self.matrix = self.vectorizer.fit_transform(self.texts)\n\n\n   def retrieve(self, query, k=4):\n       qv = self.vectorizer.transform([query])\n       sims = cosine_similarity(qv, self.matrix).flatten()\n       top_idx = np.argsort(-sims)[:k]\n       results = []\n       for i in top_idx:\n           results.append(\n               {\n                   \"id\": self.docs[i][\"id\"],\n                   \"score\": float(sims[i]),\n                   \"text\": self.texts[i],\n               }\n           )\n       return results\n\n\nretriever = TfidfRetriever(DOCS)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement a custom TF-IDF Retriever class that transforms our documentation into a searchable vector space using bigram-aware TF-IDF vectorization. This allows us to perform cosine similarity searches against the knowledge base, ensuring we can programmatically fetch the top-k most relevant text chunks for any given query. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def extractive_baseline_answer(query, retrieved_contexts):\n   \"\"\"\n   Offline fallback: we create a short answer by extracting the most relevant sentences.\n   This keeps the notebook runnable even without OpenAI.\n   \"\"\"\n   joined = \"n\".join(retrieved_contexts)\n   sents = re.split(r\"(?&lt;=[.!?])s+\", joined)\n   keywords = [w.lower() for w in re.findall(r\"[a-zA-Z]{4,}\", query)]\n   scored = []\n   for s in sents:\n       s_l = s.lower()\n       score = sum(1 for k in keywords if k in s_l)\n       if len(s.strip()) &gt; 20:\n           scored.append((score, s.strip()))\n   scored.sort(key=lambda x: (-x[0], -len(x[1])))\n   best = [s for sc, s in scored[:3] if sc &gt; 0]\n   if not best:\n       best = [s.strip() for s in sents[:2] if len(s.strip()) &gt; 20]\n   ans = \" \".join(best).strip()\n   if not ans:\n       ans = \"I could not find enough context to answer confidently.\"\n   return ans\n\n\ndef openai_answer(query, retrieved_contexts, model=\"gpt-4.1-mini\"):\n   \"\"\"\n   Simple RAG prompt for demonstration. DeepEval metrics can still evaluate even if\n   your generation prompt differs; the key is we store retrieval_context separately.\n   \"\"\"\n   from openai import OpenAI\n   client = OpenAI()\n\n\n   context_block = \"nn\".join([f\"[CTX {i+1}]n{c}\" for i, c in enumerate(retrieved_contexts)])\n   prompt = f\"\"\"You are a concise technical assistant.\nUse ONLY the provided context to answer the query. If the answer is not in context, say you don't know.\n\n\nQuery:\n{query}\n\n\nContext:\n{context_block}\n\n\nAnswer:\"\"\"\n   resp = client.chat.completions.create(\n       model=model,\n       messages=[{\"role\": \"user\", \"content\": prompt}],\n       temperature=0.2,\n   )\n   return resp.choices[0].message.content.strip()\n\n\ndef rag_answer(query, retrieved_contexts):\n   if openai_enabled:\n       try:\n           return openai_answer(query, retrieved_contexts)\n       except Exception as e:\n           print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/> OpenAI generation failed, falling back to extractive baseline. Error: {e}\")\n           return extractive_baseline_answer(query, retrieved_contexts)\n   else:\n       return extractive_baseline_answer(query, retrieved_contexts)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement a hybrid answering mechanism that prioritizes high-fidelity generation via OpenAI while maintaining a keyword-based extractive baseline as a reliable fallback. By isolating the retrieval context from the final generation, we ensure our DeepEval test cases remain consistent regardless of whether the answer is synthesized by an LLM or extracted programmatically. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" \/> Running RAG to create test cases...\")\n\n\ntest_cases = []\nK = 4\n\n\nfor item in tqdm(EVAL_QUERIES):\n   q = item[\"query\"]\n   expected = item[\"expected\"]\n\n\n   retrieved = retriever.retrieve(q, k=K)\n   retrieval_context = [r[\"text\"] for r in retrieved] \n\n\n   actual = rag_answer(q, retrieval_context)\n\n\n   tc = LLMTestCase(\n       input=q,\n       actual_output=actual,\n       expected_output=expected,\n       retrieval_context=retrieval_context,\n   )\n   test_cases.append(tc)\n\n\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Built {len(test_cases)} LLMTestCase objects.\")\n\n\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Metrics configured.\")\n\n\nmetrics = [\n   AnswerRelevancyMetric(threshold=0.5, model=\"gpt-4.1\", include_reason=True, async_mode=True),\n   FaithfulnessMetric(threshold=0.5, model=\"gpt-4.1\", include_reason=True, async_mode=True),\n   ContextualRelevancyMetric(threshold=0.5, model=\"gpt-4.1\", include_reason=True, async_mode=True),\n   ContextualPrecisionMetric(threshold=0.5, model=\"gpt-4.1\", include_reason=True, async_mode=True),\n   ContextualRecallMetric(threshold=0.5, model=\"gpt-4.1\", include_reason=True, async_mode=True),\n\n\n   GEval(\n       name=\"RAG Correctness Rubric (GEval)\",\n       criteria=(\n           \"Score the answer for correctness and usefulness. \"\n           \"The answer must directly address the query, must not invent facts not supported by context, \"\n           \"and should be concise but complete.\"\n       ),\n       evaluation_params=[\n           LLMTestCaseParams.INPUT,\n           LLMTestCaseParams.ACTUAL_OUTPUT,\n           LLMTestCaseParams.EXPECTED_OUTPUT,\n           LLMTestCaseParams.RETRIEVAL_CONTEXT,\n       ],\n       model=\"gpt-4.1\",\n       threshold=0.5,\n       async_mode=True,\n   ),\n]\n\n\nif not openai_enabled:\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/> You did NOT provide an OpenAI API key.\")\n   print(\"DeepEval's LLM-as-a-judge metrics (AnswerRelevancy\/Faithfulness\/Contextual* and GEval) require an LLM judge.\")\n   print(\"Re-run this cell and provide OPENAI_API_KEY to run DeepEval metrics.\")\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> However, your RAG pipeline + test case construction succeeded end-to-end.\")\n   rows = []\n   for i, tc in enumerate(test_cases):\n       rows.append({\n           \"id\": i,\n           \"query\": tc.input,\n           \"actual_output\": tc.actual_output[:220] + (\"...\" if len(tc.actual_output) &gt; 220 else \"\"),\n           \"expected_output\": tc.expected_output[:220] + (\"...\" if len(tc.expected_output) &gt; 220 else \"\"),\n           \"contexts\": len(tc.retrieval_context or []),\n       })\n   display(pd.DataFrame(rows))\n   raise SystemExit(\"Stopped before evaluation (no OpenAI key).\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We execute the RAG pipeline to generate LLMTestCase objects by pairing our retrieved context with model-generated answers and ground-truth expectations. We then configure a comprehensive suite of DeepEval metrics, including G-Eval and specialized RAG indicators, to evaluate the system\u2019s performance using an LLM-as-a-judge approach. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f9ea.png\" alt=\"\ud83e\uddea\" class=\"wp-smiley\" \/> Running DeepEval evaluate(...) ...\")\n\n\nresults = evaluate(test_cases=test_cases, metrics=metrics)\n\n\nsummary_rows = []\nfor idx, tc in enumerate(test_cases):\n   row = {\n       \"case_id\": idx,\n       \"query\": tc.input,\n       \"actual_output\": tc.actual_output[:200] + (\"...\" if len(tc.actual_output) &gt; 200 else \"\"),\n   }\n   for m in metrics:\n       row[m.__class__.__name__ if hasattr(m, \"__class__\") else str(m)] = None\n\n\n   summary_rows.append(row)\n\n\ndef try_extract_case_metrics(results_obj):\n   extracted = []\n   candidates = []\n   for attr in [\"test_results\", \"results\", \"evaluations\"]:\n       if hasattr(results_obj, attr):\n           candidates = getattr(results_obj, attr)\n           break\n   if not candidates and isinstance(results_obj, list):\n       candidates = results_obj\n\n\n   for case_i, case_result in enumerate(candidates or []):\n       item = {\"case_id\": case_i}\n       metrics_list = None\n       for attr in [\"metrics_data\", \"metrics\", \"metric_results\"]:\n           if hasattr(case_result, attr):\n               metrics_list = getattr(case_result, attr)\n               break\n       if isinstance(metrics_list, dict):\n           for k, v in metrics_list.items():\n               item[f\"{k}_score\"] = getattr(v, \"score\", None) if v is not None else None\n               item[f\"{k}_reason\"] = getattr(v, \"reason\", None) if v is not None else None\n       else:\n           for mr in metrics_list or []:\n               name = getattr(mr, \"name\", None) or getattr(getattr(mr, \"metric\", None), \"name\", None)\n               if not name:\n                   name = mr.__class__.__name__\n               item[f\"{name}_score\"] = getattr(mr, \"score\", None)\n               item[f\"{name}_reason\"] = getattr(mr, \"reason\", None)\n       extracted.append(item)\n   return extracted\n\n\ncase_metrics = try_extract_case_metrics(results)\n\n\ndf_base = pd.DataFrame([{\n   \"case_id\": i,\n   \"query\": tc.input,\n   \"actual_output\": tc.actual_output,\n   \"expected_output\": tc.expected_output,\n} for i, tc in enumerate(test_cases)])\n\n\ndf_metrics = pd.DataFrame(case_metrics) if case_metrics else pd.DataFrame([])\ndf = df_base.merge(df_metrics, on=\"case_id\", how=\"left\")\n\n\nscore_cols = [c for c in df.columns if c.endswith(\"_score\")]\ncompact = df[[\"case_id\", \"query\"] + score_cols].copy()\n\n\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4ca.png\" alt=\"\ud83d\udcca\" class=\"wp-smiley\" \/> Compact score table:\")\ndisplay(compact)\n\n\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f9fe.png\" alt=\"\ud83e\uddfe\" class=\"wp-smiley\" \/> Full details (includes reasons):\")\ndisplay(df)\n\n\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Done. Tip: if contextual precision\/recall are low, improve retriever ranking\/coverage; if faithfulness is low, tighten generation to only use context.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We finalize the workflow by executing the evaluate function, which triggers the LLM-as-a-judge process to score each test case against our defined metrics. We then aggregate these scores and their corresponding qualitative reasoning into a centralized DataFrame, providing a granular view of where the RAG pipeline excels or requires further optimization in retrieval and generation.<\/p>\n<p>At last, we conclude by running our comprehensive evaluation suite, in which DeepEval transforms complex linguistic outputs into actionable data using metrics such as Faithfulness, Contextual Precision, and the G-Eval rubric. This systematic approach allows us to diagnose \u201csilent failures\u201d in retrieval and hallucinations in generation with surgical precision, providing the reasoning necessary to justify architectural changes. With these results, we move forward from experimental prototyping to a production-ready RAG system backed by a verifiable, metric-driven safety net.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/rag_deepeval_quality_benchmarking_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/01\/25\/a-coding-implementation-to-automating-llm-quality-assurance-with-deepeval-custom-retrievers-and-llm-as-a-judge-metrics\/\">A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>We initiate this tutorial by c&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-317","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/317","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=317"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/317\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=317"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=317"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=317"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}