{"id":815,"date":"2026-04-29T10:47:22","date_gmt":"2026-04-29T02:47:22","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=815"},"modified":"2026-04-29T10:47:22","modified_gmt":"2026-04-29T02:47:22","slug":"how-to-build-traceable-and-evaluated-llm-workflows-using-promptflow-prompty-and-openai","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=815","title":{"rendered":"How to Build Traceable and Evaluated LLM Workflows Using Promptflow,\u00a0Prompty, and OpenAI"},"content":{"rendered":"<p>In this tutorial, we build a complete, production-style LLM workflow using <a href=\"https:\/\/github.com\/microsoft\/promptflow\"><strong>Promptflow<\/strong><\/a> within a Colab environment. We begin by setting up a reliable keyring backend to avoid OS dependency issues and securely configure our OpenAI connection. From there, we establish a clean workspace and define a structured Prompty file that acts as the core LLM component of our pipeline. We then design a class-based flex flow that combines deterministic preprocessing with LLM reasoning, allowing us to inject computed hints into model responses. We also enable tracing to monitor each execution step, run both single- and batch-queries, and generate outputs in a structured format. Finally, we extend the system with an evaluation pipeline that leverages an LLM-as-a-judge to score responses against expected answers.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip install -q keyrings.alt\n\n\nimport keyring\nfrom keyrings.alt.file import PlaintextKeyring\nkeyring.set_keyring(PlaintextKeyring())\n\n\nimport os\nfrom promptflow.client import PFClient\nfrom promptflow.connections import OpenAIConnection\n\n\npf = PFClient()\nCONN = \"open_ai_connection\"\ntry:\n   pf.connections.get(name=CONN)\n   print(f\"Using existing connection '{CONN}'\")\nexcept Exception:\n   pf.connections.create_or_update(\n       OpenAIConnection(name=CONN, api_key=os.environ[\"OPENAI_API_KEY\"])\n   )\n   print(f\"Created connection '{CONN}'\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by installing a fallback keyring backend to avoid dependency issues in environments like Colab. We then initialize the Promptflow client and check if an OpenAI connection already exists. If not, we create one using the API key from the environment, ensuring a reusable and consistent connection setup.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip install -q \"promptflow&gt;=1.13.0\" \"promptflow-tracing\" \"promptflow-tools\" openai\n\n\nimport os, sys, json, getpass, textwrap, importlib\nfrom pathlib import Path\n\n\nif \"OPENAI_API_KEY\" not in os.environ:\n   os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Paste your OpenAI API key: \")\n\n\nWORK_DIR = Path(\"\/content\/pf_demo\"); WORK_DIR.mkdir(exist_ok=True, parents=True)\nos.chdir(WORK_DIR); sys.path.insert(0, str(WORK_DIR))\n\n\nfrom promptflow.client import PFClient\nfrom promptflow.connections import OpenAIConnection\nfrom promptflow.tracing import start_trace\n\n\npf = PFClient()\nCONN = \"open_ai_connection\"\ntry:\n   pf.connections.get(name=CONN); print(f\"Using existing connection '{CONN}'\")\nexcept Exception:\n   pf.connections.create_or_update(OpenAIConnection(name=CONN, api_key=os.environ[\"OPENAI_API_KEY\"]))\n   print(f\"Created connection '{CONN}'\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install all required Promptflow libraries and set up the project\u2019s working directory. We securely capture the OpenAI API key if it is not already set and configure the environment accordingly. We then reinitialize the Promptflow client and ensure that the connection is properly established for downstream usage.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">(WORK_DIR \/ \"researcher.prompty\").write_text(\"\"\"---\nname: Researcher\ndescription: Concise research assistant.\nmodel:\n api: chat\n configuration:\n   type: openai\n   connection: open_ai_connection\n   model: gpt-4o-mini\n parameters:\n   temperature: 0.2\n   max_tokens: 350\ninputs:\n question: {type: string}\n hint:     {type: string, default: \"\"}\nsample:\n question: \"What is the speed of light in vacuum?\"\n hint: \"\"\n---\nsystem:\nYou are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in.\n\n\nuser:\nQ: {{question}}\n{% if hint %}Hint: {{hint}}{% endif %}\n\"\"\")\n\n\n(WORK_DIR \/ \"flow.py\").write_text(textwrap.dedent('''\n   from pathlib import Path\n   from promptflow.tracing import trace\n   from promptflow.core import Prompty\n\n\n   BASE = Path(__file__).parent\n\n\n   @trace\n   def safe_calc(expression: str) -&gt; str:\n       \"\"\"A tiny deterministic 'tool' the assistant can lean on.\"\"\"\n       if not set(expression) &lt;= set(\"0123456789+-*\/(). \"):\n           return \"unsafe\"\n       try: return str(eval(expression))\n       except Exception as e: return f\"error:{e}\"\n\n\n   class ResearchAssistant:\n       \"\"\"Class-based flex flow. __init__ args become flow init parameters.\"\"\"\n       def __init__(self, model: str = \"gpt-4o-mini\"):\n           self.model = model\n           self.llm = Prompty.load(source=BASE \/ \"researcher.prompty\")\n\n\n       @trace\n       def __call__(self, question: str) -&gt; dict:\n           hint = \"\"\n           if \"*\" in question or \"+\" in question:\n               tokens = [t for t in question.replace(\"?\",\"\").split() if any(c.isdigit() for c in t)]\n               expr = \"\".join(tokens)\n               if expr:\n                   hint = f\"computed: {expr} = {safe_calc(expr)}\"\n\n\n           answer = self.llm(question=question, hint=hint)\n\n\n           return {\"question\": question, \"answer\": str(answer).strip(), \"hint_used\": hint}\n'''))\n\n\n(WORK_DIR \/ \"flow.flex.yaml\").write_text(\n   \"$schema: https:\/\/azuremlschemas.azureedge.net\/promptflow\/latest\/Flow.schema.jsonn\"\n   \"entry: flow:ResearchAssistantn\"\n)\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define a Prompty file that structures how the LLM should behave as a concise research assistant. We then create a class-based flow that combines a deterministic calculation tool with an LLM call, enabling hybrid reasoning. Finally, we register this flow using a YAML configuration, making it executable within the Promptflow framework.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">try: start_trace()\nexcept Exception as e: print(\"trace ui unavailable on Colab \u2014 traces still recorded:\", e)\n\n\nimport flow as _flow; importlib.reload(_flow)\nagent = _flow.ResearchAssistant(model=\"gpt-4o-mini\")\n\n\nprint(\"n=== Single call ===\")\nprint(json.dumps(agent(question=\"In one sentence, what is photosynthesis?\"), indent=2))\nprint(json.dumps(agent(question=\"What is 21 * 19 ?\"), indent=2))\n\n\ndata = [\n   {\"question\": \"What is the capital of France?\",          \"expected\": \"Paris\"},\n   {\"question\": \"Chemical symbol for gold?\",               \"expected\": \"Au\"},\n   {\"question\": \"Who wrote the play Hamlet?\",              \"expected\": \"Shakespeare\"},\n   {\"question\": \"What is 12 * 11 ?\",                       \"expected\": \"132\"},\n   {\"question\": \"Boiling point of water at sea level (C)?\",\"expected\": \"100\"},\n   {\"question\": \"Largest planet in our solar system?\",     \"expected\": \"Jupiter\"},\n]\ndata_path = WORK_DIR \/ \"data.jsonl\"\ndata_path.write_text(\"n\".join(json.dumps(r) for r in data))\n\n\nprint(\"n=== Batch run ===\")\nbase_run = pf.run(\n   flow=str(WORK_DIR \/ \"flow.flex.yaml\"),\n   data=str(data_path),\n   column_mapping={\"question\": \"${data.question}\"},\n   stream=True,\n)\nprint(pf.get_details(base_run))<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We enable tracing to capture execution details and instantiate our research assistant flow. We test the system with individual queries to verify both natural language and arithmetic handling. We then prepare a dataset and run a batch job in Promptflow, collecting structured outputs for further evaluation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">(WORK_DIR \/ \"judge.prompty\").write_text(\"\"\"---\nname: Judge\nmodel:\n api: chat\n configuration:\n   type: openai\n   connection: open_ai_connection\n   model: gpt-4o-mini\n parameters:\n   temperature: 0\n   max_tokens: 150\n   response_format: {type: json_object}\ninputs:\n question: {type: string}\n answer:   {type: string}\n expected: {type: string}\n---\nsystem:\nYou are an exacting grader. Decide whether the assistant's answer contains the expected fact (case-insensitive, allowing reasonable phrasing\/synonyms). Reply ONLY as JSON: {\"score\": 0 or 1, \"reason\": \"...\"}.\n\n\nuser:\nQuestion: {{question}}\nExpected: {{expected}}\nAnswer:   {{answer}}\n\"\"\")\n\n\n(WORK_DIR \/ \"eval_flow.py\").write_text(textwrap.dedent('''\n   import json\n   from pathlib import Path\n   from promptflow.tracing import trace\n   from promptflow.core import Prompty\n\n\n   BASE = Path(__file__).parent\n\n\n   class Evaluator:\n       def __init__(self):\n           self.judge = Prompty.load(source=BASE \/ \"judge.prompty\")\n\n\n       @trace\n       def __call__(self, question: str, answer: str, expected: str) -&gt; dict:\n           raw = self.judge(question=question, answer=answer, expected=expected)\n           if isinstance(raw, str):\n               try: raw = json.loads(raw)\n               except Exception: raw = {\"score\": 0, \"reason\": f\"unparseable:{raw[:80]}\"}\n           return {\"score\": int(raw.get(\"score\", 0)), \"reason\": str(raw.get(\"reason\",\"\"))}\n\n\n       def __aggregate__(self, line_results):\n           \"\"\"Run-level aggregation. Whatever this returns shows up in pf.get_metrics().\"\"\"\n           scores = [r[\"score\"] for r in line_results if r]\n           return {\n               \"accuracy\": (sum(scores) \/ len(scores)) if scores else 0.0,\n               \"passed\":   sum(scores),\n               \"total\":    len(scores),\n           }\n'''))\n\n\n(WORK_DIR \/ \"eval.flex.yaml\").write_text(\n   \"$schema: https:\/\/azuremlschemas.azureedge.net\/promptflow\/latest\/Flow.schema.jsonn\"\n   \"entry: eval_flow:Evaluatorn\"\n)\n\n\nprint(\"n=== Evaluation run ===\")\neval_run = pf.run(\n   flow=str(WORK_DIR \/ \"eval.flex.yaml\"),\n   data=str(data_path),\n   run=base_run,\n   column_mapping={\n       \"question\": \"${data.question}\",\n       \"expected\": \"${data.expected}\",\n       \"answer\":   \"${run.outputs.answer}\",\n   },\n   stream=True,\n)\n\n\neval_details = pf.get_details(eval_run)\nprint(eval_details)\n\n\nprint(\"n=== Aggregated metrics (from __aggregate__) ===\")\nprint(json.dumps(pf.get_metrics(eval_run), indent=2))\n\n\nimport pandas as pd\nif \"outputs.score\" in eval_details.columns:\n   s = pd.to_numeric(eval_details[\"outputs.score\"], errors=\"coerce\").fillna(0)\n   print(f\"Manual accuracy: {s.mean():.2%}  ({int(s.sum())}\/{len(s)})\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We create a judging Prompty that evaluates model outputs against expected answers using structured JSON responses. We implement an evaluator class that parses results, computes scores, and defines an aggregation method for overall metrics. Also, we run the evaluation pipeline, link it to the base run, and compute accuracy both through Promptflow metrics and a manual fallback.<\/p>\n<p>In conclusion, we built a robust, modular LLM pipeline that extends beyond basic prompt-response interactions. We integrated deterministic tools, structured prompting, and reusable flow components to create a system that is both transparent and scalable. Through batch execution and linked evaluation runs, we established a clear feedback loop that helps us measure performance using accuracy metrics and detailed reasoning. The inclusion of tracing and aggregation functions enables us to debug, monitor, and improve the system efficiently. Also, this workflow demonstrates how we can design reliable, end-to-end LLM applications with strong foundations in structure, evaluation, and reproducibility.<\/p>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/promptflow_traceable_llm_workflow_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/28\/how-to-build-traceable-and-evaluated-llm-workflows-using-promptflow-prompty-and-openai\/\">How to Build Traceable and Evaluated LLM Workflows Using Promptflow,\u00a0Prompty, and OpenAI<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a c&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-815","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/815","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=815"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/815\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=815"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=815"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=815"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}