{"id":456,"date":"2026-02-23T11:39:16","date_gmt":"2026-02-23T03:39:16","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=456"},"modified":"2026-02-23T11:39:16","modified_gmt":"2026-02-23T03:39:16","slug":"a-coding-guide-to-instrumenting-tracing-and-evaluating-llm-applications-using-trulens-and-openai-models","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=456","title":{"rendered":"A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models"},"content":{"rendered":"<p>In this tutorial, we focus on building a transparent and measurable evaluation pipeline for large language model applications using <a href=\"https:\/\/github.com\/truera\/trulens\"><strong>TruLens<\/strong><\/a>. Rather than treating LLMs as black boxes, we instrument each stage of an application so that inputs, intermediate steps, and outputs are captured as structured traces. We then attach feedback functions that quantitatively evaluate model behavior along dimensions such as relevance, grounding, and contextual alignment. By running multiple application variants under the same evaluation setup, we show how TruLens enables disciplined experimentation, reproducibility, and data-driven improvement of LLM systems.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip -q install trulens trulens-providers-openai chromadb openai\n\n\nimport os, re, getpass\nfrom dataclasses import dataclass\nfrom typing import List, Dict, Any\nimport numpy as np\n\n\nimport chromadb\nfrom chromadb.utils.embedding_functions import OpenAIEmbeddingFunction\n\n\nfrom openai import OpenAI\n\n\nfrom trulens.core import TruSession, Feedback\nfrom trulens.providers.openai import OpenAI as TruOpenAI\nfrom trulens.apps.app import TruApp\nfrom trulens.core.otel.instrument import instrument\nfrom trulens.otel.semconv.trace import SpanAttributes\nfrom trulens.dashboard import run_dashboard\n\n\nif not os.environ.get(\"OPENAI_API_KEY\"):\n   os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter OPENAI_API_KEY (input hidden): \")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We prepare the Colab environment by installing all required libraries and importing the core dependencies used throughout the tutorial. We securely read the OpenAI API key from the terminal to avoid hardcoding sensitive credentials. We also initialize the foundational tooling that enables tracing, feedback evaluation, and dashboard visualization.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def normalize_ws(s: str) -&gt; str:\n   return re.sub(r\"s+\", \" \", s).strip()\n\n\nRAW_DOCS = [\n   {\n       \"doc_id\": \"trulens_core\",\n       \"title\": \"TruLens core idea\",\n       \"text\": \"TruLens is used to track and evaluate LLM applications. It can log app runs, compute feedback scores, and provide a dashboard to compare versions and investigate traces and results.\"\n   },\n   {\n       \"doc_id\": \"trulens_feedback\",\n       \"title\": \"Feedback functions\",\n       \"text\": \"TruLens feedback functions can score groundedness, context relevance, and answer relevance. They are configured by specifying which parts of an app record should be used as inputs.\"\n   },\n   {\n       \"doc_id\": \"trulens_rag\",\n       \"title\": \"RAG workflow\",\n       \"text\": \"A typical RAG system retrieves relevant chunks from a vector database and then generates an answer using those chunks as context. The quality depends on retrieval, prompt design, and generation behavior.\"\n   },\n   {\n       \"doc_id\": \"trulens_instrumentation\",\n       \"title\": \"Instrumentation\",\n       \"text\": \"Instrumentation adds tracing spans to your app functions (like retrieval and generation). This makes it possible to analyze which contexts were retrieved, latency, token usage, and connect feedback evaluations to specific steps.\"\n   },\n   {\n       \"doc_id\": \"vectorstores\",\n       \"title\": \"Vector stores and embeddings\",\n       \"text\": \"Vector stores index embeddings for text chunks, enabling semantic search. OpenAI embedding models can be used to embed chunks and queries, and Chroma can store them locally in memory for a notebook demo.\"\n   },\n   {\n       \"doc_id\": \"prompting\",\n       \"title\": \"Prompting and citations\",\n       \"text\": \"Prompting can encourage careful, citation-grounded answers. A stronger prompt can enforce: answer only from context, be explicit about uncertainty, and provide short citations that map to retrieved chunks.\"\n   },\n]\n\n\n@dataclass\nclass Chunk:\n   chunk_id: str\n   doc_id: str\n   title: str\n   text: str\n   meta: Dict[str, Any]\n\n\ndef chunk_docs(docs, chunk_size=350, overlap=80) -&gt; List[Chunk]:\n   chunks: List[Chunk] = []\n   for d in docs:\n       text = normalize_ws(d[\"text\"])\n       start = 0\n       idx = 0\n       while start &lt; len(text):\n           end = min(len(text), start + chunk_size)\n           chunk_text = text[start:end]\n           chunk_id = f'{d[\"doc_id\"]}_c{idx}'\n           chunks.append(\n               Chunk(\n                   chunk_id=chunk_id,\n                   doc_id=d[\"doc_id\"],\n                   title=d[\"title\"],\n                   text=chunk_text,\n                   meta={\"doc_id\": d[\"doc_id\"], \"title\": d[\"title\"], \"chunk_index\": idx},\n               )\n           )\n           idx += 1\n           start = end - overlap\n           if start &lt; 0:\n               start = 0\n           if end == len(text):\n               break\n   return chunks\n\n\nCHUNKS = chunk_docs(RAW_DOCS)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define the raw knowledge sources and implement a clean, reusable text-chunking pipeline. We normalize document text and split it into overlapping chunks to preserve semantic continuity during retrieval. We structure each chunk with metadata so it can later be traced, evaluated, and cited during RAG execution.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">EMBED_MODEL = \"text-embedding-3-small\"\nembedding_function = OpenAIEmbeddingFunction(\n   api_key=os.environ.get(\"OPENAI_API_KEY\"),\n   model_name=EMBED_MODEL,\n)\n\n\nchroma_client = chromadb.Client()\ncollection = chroma_client.get_or_create_collection(\n   name=\"trulens_demo_kb\",\n   embedding_function=embedding_function,\n)\n\n\nids = [c.chunk_id for c in CHUNKS]\ndocs = [c.text for c in CHUNKS]\nmetas = [c.meta for c in CHUNKS]\ncollection.add(ids=ids, documents=docs, metadatas=metas)\n\n\noai_client = OpenAI()\n\n\ndef format_context(hits):\n   lines = []\n   for i, h in enumerate(hits):\n       meta = h[\"meta\"]\n       lines.append(\n           f\"[C{i}] ({meta.get('title','')}, {meta.get('doc_id','')}, chunk={meta.get('chunk_index','?')}): {h['text']}\"\n       )\n   return \"n\".join(lines)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We create the vector database using Chroma and OpenAI embeddings to enable semantic search over the chunked knowledge base. We insert all chunks into the collection and prepare the OpenAI client for downstream generation. We also define a context-formatting utility that converts retrieved chunks into a structured prompt-ready format.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class RAG:\n   def __init__(self, *, gen_model: str, prompt_style: str = \"base\", k: int = 4):\n       self.gen_model = gen_model\n       self.prompt_style = prompt_style\n       self.k = k\n\n\n   @instrument(\n       span_type=SpanAttributes.SpanType.RETRIEVAL,\n       attributes={\n           SpanAttributes.RETRIEVAL.QUERY_TEXT: \"query\",\n           SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: \"return\",\n       },\n   )\n   def retrieve(self, query: str) -&gt; list:\n       res = collection.query(query_texts=[query], n_results=self.k)\n       hits = []\n       for i in range(len(res[\"ids\"][0])):\n           hits.append(\n               {\n                   \"id\": res[\"ids\"][0][i],\n                   \"text\": res[\"documents\"][0][i],\n                   \"meta\": res[\"metadatas\"][0][i],\n               }\n           )\n       return hits\n\n\n   @instrument(span_type=SpanAttributes.SpanType.GENERATION)\n   def generate(self, query: str, hits: list) -&gt; str:\n       if not hits:\n           return \"I don't have enough relevant information in the knowledge base to answer.\"\n\n\n       context = format_context(hits)\n\n\n       if self.prompt_style == \"strict_citations\":\n           system = (\n               \"You are a careful assistant. Use ONLY the provided context. \"\n               \"If the context is insufficient, say so. \"\n               \"When you make a claim, cite it with [C#] tags matching the context chunks.\"\n           )\n           user = f\"Context:n{context}nnQuestion: {query}nnAnswer (with [C#] citations):\"\n       else:\n           system = \"You are a helpful assistant.\"\n           user = f\"Context:n{context}nnQuestion: {query}nnAnswer using the context above:\"\n\n\n       resp = oai_client.chat.completions.create(\n           model=self.gen_model,\n           messages=[\n               {\"role\": \"system\", \"content\": system},\n               {\"role\": \"user\", \"content\": user},\n           ],\n       )\n       out = resp.choices[0].message.content\n       return out if out else \"No answer returned.\"\n\n\n   @instrument(\n       span_type=SpanAttributes.SpanType.RECORD_ROOT,\n       attributes={\n           SpanAttributes.RECORD_ROOT.INPUT: \"query\",\n           SpanAttributes.RECORD_ROOT.OUTPUT: \"return\",\n       },\n   )\n   def query(self, query: str) -&gt; str:\n       hits = self.retrieve(query=query)\n       return self.generate(query=query, hits=hits)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement the core RAG application with explicit instrumentation on retrieval, generation, and the request root. We capture queries, retrieved contexts, and generated outputs as traceable spans for later evaluation. We also support multiple prompt styles, allowing us to systematically compare different prompting strategies under identical conditions.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">session = TruSession()\nsession.reset_database()\n\n\nEVAL_MODEL = \"gpt-4o-mini\"\nprovider = TruOpenAI(model_engine=EVAL_MODEL)\n\n\nf_groundedness = (\n   Feedback(\n       provider.groundedness_measure_with_cot_reasons_consider_answerability,\n       name=\"Groundedness\",\n   )\n   .on_context(collect_list=True)\n   .on_output()\n   .on_input()\n)\n\n\nf_answer_relevance = (\n   Feedback(provider.relevance_with_cot_reasons, name=\"Answer Relevance\")\n   .on_input()\n   .on_output()\n)\n\n\nf_context_relevance = (\n   Feedback(provider.context_relevance_with_cot_reasons, name=\"Context Relevance\")\n   .on_input()\n   .on_context(collect_list=False)\n   .aggregate(np.mean)\n)\n\n\nGEN_MODEL = \"gpt-4o-mini\"\n\n\nrag_base = RAG(gen_model=GEN_MODEL, prompt_style=\"base\", k=4)\nrag_strict = RAG(gen_model=GEN_MODEL, prompt_style=\"strict_citations\", k=4)\n\n\ntru_base = TruApp(\n   rag_base,\n   app_name=\"TruLens-RAG\",\n   app_version=\"v1_base_prompt\",\n   feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],\n)\n\n\ntru_strict = TruApp(\n   rag_strict,\n   app_name=\"TruLens-RAG\",\n   app_version=\"v2_strict_citations\",\n   feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],\n)\n\n\nEVAL_QUERIES = [\n   \"What is TruLens used for?\",\n   \"What are the three common RAG feedbacks to evaluate?\",\n   \"Why does instrumentation matter in RAG evaluation?\",\n   \"What role do embeddings play in a vector store?\",\n   \"How can prompting encourage grounded answers?\",\n]\n\n\nwith tru_base as recording:\n   for q in EVAL_QUERIES:\n       rag_base.query(q)\n\n\nwith tru_strict as recording:\n   for q in EVAL_QUERIES:\n       rag_strict.query(q)\n\n\nleaderboard = session.get_leaderboard()\nprint(leaderboard)\n\n\nrun_dashboard(session)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We configure the TruLens evaluation session and define feedback functions for groundedness, answer relevance, and context relevance. We run multiple versions of the RAG system across a shared evaluation set to generate comparable records. We then surface the results through the leaderboard and interactive dashboard to analyze performance differences and reasoning quality.<\/p>\n<p>In conclusion, we established a practical workflow for understanding and evaluating LLM behavior beyond surface-level outputs. We demonstrated how instrumentation turns every model call into an inspectable artifact and how feedback functions convert subjective judgments into consistent metrics. Through versioned runs, leaderboards, and dashboards, we can compare design choices with clarity and confidence. This tutorial lays the groundwork for building reliable, auditable, and continuously improving LLM applications in real-world settings where trust and explainability matter as much as performance.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Evaluation\/trulens_llm_instrumentation_feedback_evaluation_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/22\/a-coding-guide-to-instrumenting-tracing-and-evaluating-llm-applications-using-trulens-and-openai-models\/\">A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we focus on &hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-456","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/456","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=456"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/456\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=456"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=456"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=456"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}