{"id":910,"date":"2026-05-15T11:38:10","date_gmt":"2026-05-15T03:38:10","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=910"},"modified":"2026-05-15T11:38:10","modified_gmt":"2026-05-15T03:38:10","slug":"poetiqs-meta-system-automatically-builds-a-model-agnostic-harness-that-improved-every-llm-tested-on-livecodebench-pro-without-fine-tuning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=910","title":{"rendered":"Poetiq\u2019s Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning"},"content":{"rendered":"<p>Poetiq has just published some very interesting results showing its Meta-System reached a new state-of-the-art on LiveCodeBench Pro (LCB Pro), a competitive coding benchmark, by automatically building and optimizing its own inference harness \u2014 without fine-tuning any underlying model or accessing model internals.<\/p>\n<p>The result: GPT 5.5 High with Poetiq\u2019s harness scores 93.9% on LCB Pro (25Q2), up from its baseline of 89.6%. Gemini 3.1 Pro, the model the harness was specifically optimized on, jumps from 78.6% to 90.9% \u2014 surpassing Google\u2019s own Gemini 3 Deep Think (88.8%), a model that isn\u2019t even accessible via API for external verification.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1660\" height=\"890\" data-attachment-id=\"79874\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/14\/poetiqs-meta-system-automatically-builds-a-model-agnostic-harness-that-improved-every-llm-tested-on-livecodebench-pro-without-fine-tuning\/screenshot-2026-05-14-at-8-33-46-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-14-at-8.33.46-PM-1.png\" data-orig-size=\"1660,890\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-14 at 8.33.46\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-14-at-8.33.46-PM-1-1024x549.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-14-at-8.33.46-PM-1.png\" alt=\"\" class=\"wp-image-79874\" \/><figcaption class=\"wp-element-caption\">https:\/\/poetiq.ai\/posts\/recursive_self_improvement_coding\/<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What is LiveCodeBench Pro?<\/strong><\/h2>\n<p>Before getting into the mechanics, it helps to understand why the benchmark matters. LiveCodeBench Pro (LCB) is designed to test AI coding ability in a way that resists two common failure modes in benchmarks: data contamination and overfitting.<\/p>\n<p>LCB Pro pulls problems from major competitive programming competitions and withholds public ground-truth code. Instead, solutions are validated against a comprehensive testing framework. Correct output alone isn\u2019t enough \u2014 solutions must also satisfy specific memory and runtime constraints. The benchmark is also subject to continuous updates, which distinguishes it from many standard benchmarks that become stale.<\/p>\n<p>The benchmark focuses on C++ challenges and emphasizes creative coding, testing a model\u2019s capacity for complex problem-solving and high-quality, performant procedural logic. This distinguishes it from datasets like SWEBench that evaluate tool usage or bug-fixing workflows. Problems are categorized by difficulty \u2014 Easy, Medium, and Hard \u2014 based on competitive human solve rates.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2246\" height=\"948\" data-attachment-id=\"79876\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/14\/poetiqs-meta-system-automatically-builds-a-model-agnostic-harness-that-improved-every-llm-tested-on-livecodebench-pro-without-fine-tuning\/screenshot-2026-05-14-at-8-34-28-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-14-at-8.34.28-PM-1.png\" data-orig-size=\"2246,948\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-14 at 8.34.28\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-14-at-8.34.28-PM-1-1024x432.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-14-at-8.34.28-PM-1.png\" alt=\"\" class=\"wp-image-79876\" \/><figcaption class=\"wp-element-caption\">https:\/\/poetiq.ai\/posts\/recursive_self_improvement_coding\/<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Poetiq\u2019s Strategic Framing: Three LLM Task Categories<\/strong><\/h2>\n<p>This is Poetiq\u2019s third publicly reported benchmark, and the choice of LCB Pro was deliberate. The research team frames LLM performance around three distinct task categories: Reasoning challenges (ARC-AGI is their benchmark here), Retrieval challenges (Humanity\u2019s Last Exam, or HLE), and Coding challenges \u2014 which, as the most pervasive commercial application for AI today, meld reasoning and retrieval with the generation of specialized procedural logic.<\/p>\n<p>Their coding initiative had three specific, stated objectives: first, prove that an intelligent harness can boost efficacy without fine-tuning or special model access; second, validate the Meta-System\u2019s capacity for recursive self-improvement in creating that harness automatically; and third, demonstrate that the resulting harness is model-agnostic and can be applied to any model without modification. According to their results, all three were satisfied.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What is a Harness, and Why Does It Matter?<\/strong><\/h2>\n<p>In this context, a harness refers to the infrastructure wrapped around a language model to handle a specific task. Think of it as an orchestration layer \u2014 it controls how the model is prompted, how outputs are structured, how answers are assembled across multiple calls, and how solutions are evaluated.<\/p>\n<p>Traditionally, these harnesses are hand-built by engineers. Poetiq\u2019s claim is that their Meta-System builds and optimizes these harnesses automatically, through recursive self-improvement. Internally, the Meta-System works by developing better strategies for determining what to ask, refining sequential chain-of-questions, and devising new methods for assembling the answers. The system constantly incorporates learnings from previous and current tasks and datasets to create new, custom task-specific harnesses \u2014 as well as agents and orchestrators for other task types.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How the Harness was Built<\/strong>?<\/h2>\n<p>Poetiq\u2019s Meta-System was given the LCB Pro task and constructed a harness from scratch using only Gemini 3.1 Pro as the base model. The Meta-System accounted for all three dimensions LCB Pro tests: accuracy, runtime, and memory constraints. The system built on insights from its previous work on ARC-AGI and HLE when designing the harness. No fine-tuning of the underlying model was performed, and no access to internal model activations was required \u2014 only standard API access.<\/p>\n<p>Once the harness was built and optimized for Gemini 3.1 Pro, it was then applied to a broad set of other models from different providers and generations \u2014 both open-weights and proprietary \u2014 without any additional optimization. Every model tested improved.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Numbers<\/strong><\/h2>\n<p>The benchmark results across difficulty tiers are worth looking at in detail. On Hard problems \u2014 the category where gaps between models are largest \u2014 Gemini 3.1 Pro with Poetiq\u2019s harness scores 58.3%, up from its 7.7% baseline. GPT 5.5 High with the harness reaches 75.0% on Hard, up from 50.0%. Across Easy and Medium categories, the harness also outperforms all base models.<\/p>\n<p>Some of the smaller model results are also notable. Gemini 3.0 Flash improves by 10 percentage points, going from 72.3% to 82.3% \u2014 overtaking Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.2 High, all larger and more expensive models. This mirrors a pattern Poetiq previously observed on ARC-AGI, where their optimization allowed a smaller, more economical model to surpass a bigger one. Kimi K2.6 sees the largest jump: from 50.0% to 79.9%, a roughly 30 percentage point improvement. Nemotron 3 Super 120B improves by 12.8%.<\/p>\n<p>Accuracy numbers are reported directly from the LCB Pro leaderboard at livecodebenchpro.com (25Q2). For models not featured on the leaderboard, Poetiq conducted its own evaluations, cross-validating its experimental setup by replicating official leaderboard accuracies for baseline models.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Poetiq\u2019s Meta-System automatically builds task-specific harnesses through recursive self-improvement, with no model fine-tuning or internal model access<\/li>\n<li>GPT 5.5 High with the harness reaches 93.9% on LCB Pro (25Q2), up 4.3% from its 89.6% baseline; Gemini 3.1 Pro jumps 12.3% (78.6% \u2192 90.9%)<\/li>\n<li>The harness is model-agnostic: optimized using only Gemini 3.1 Pro, it improved every other model tested \u2014 open-weights and proprietary \u2014 without modification<\/li>\n<li>Gemini 3.0 Flash gains 10 percentage points with the harness (72.3% \u2192 82.3%), surpassing Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.2 High despite being smaller and cheaper<\/li>\n<li>Kimi K2.6 shows the largest gain at ~30 percentage points (50.0% \u2192 79.9%); Nemotron 3 Super 120B improves by 12.8%<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/poetiq.ai\/posts\/recursive_self_improvement_coding\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details here.<\/a>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/14\/poetiqs-meta-system-automatically-builds-a-model-agnostic-harness-that-improved-every-llm-tested-on-livecodebench-pro-without-fine-tuning\/\">Poetiq\u2019s Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Poetiq has just published some&hellip;<\/p>\n","protected":false},"author":1,"featured_media":911,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-910","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/910","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=910"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/910\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/911"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=910"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=910"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=910"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}