{"id":669,"date":"2026-04-05T17:21:16","date_gmt":"2026-04-05T09:21:16","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=669"},"modified":"2026-04-05T17:21:16","modified_gmt":"2026-04-05T09:21:16","slug":"meet-autoagent-the-open-source-library-that-lets-an-ai-engineer-and-optimize-its-own-agent-harness-overnight","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=669","title":{"rendered":"Meet \u2018AutoAgent\u2019: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight"},"content":{"rendered":"<p>There\u2019s a particular kind of tedium that every AI engineer knows intimately: the prompt-tuning loop. You write a system prompt, run your agent against a benchmark, read the failure traces, tweak the prompt, add a tool, rerun. Repeat this a few dozen times and you might move the needle. It\u2019s grunt work dressed up in Python files. Now, a new open-source library called <strong>AutoAgent<\/strong>, built by Kevin Gu at <a href=\"https:\/\/github.com\/kevinrgu\">thirdlayer.inc<\/a>, proposes an unsettling alternative \u2014 don\u2019t do that work yourself. Let an AI do it.<\/p>\n<p>AutoAgent is an open source library for autonomously improving an agent on any domain. In a 24-hour run, it hit #1 on SpreadsheetBench with a score of 96.5%, and achieved the #1 GPT-5 score on TerminalBench with 55.1%.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"413\" data-attachment-id=\"78804\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/05\/meet-autoagent-the-open-source-library-that-lets-an-ai-engineer-and-optimize-its-own-agent-harness-overnight\/image-410\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-6.png\" data-orig-size=\"1200,413\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-6-300x103.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-6-1024x352.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-6.png\" alt=\"\" class=\"wp-image-78804\" \/><figcaption class=\"wp-element-caption\">https:\/\/x.com\/kevingu\/status\/2039843234760073341<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What Is AutoAgent, Really?<\/strong><\/h3>\n<p>AutoAgent is described as being \u2018like autoresearch but for agent engineering.\u2019 The idea: give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.<\/p>\n<p>To understand the analogy: Andrej Karpathy\u2019s <code>autoresearch<\/code> does the same thing for ML training \u2014 it loops through propose-train-evaluate cycles, keeping only changes that improve validation loss. AutoAgent ports that same ratchet loop from ML training into agent engineering. Instead of optimizing a model\u2019s weights or training hyperparameters, it optimizes the <em>harness<\/em> \u2014 the system prompt, tool definitions, routing logic, and orchestration strategy that determine how an agent behaves on a task.<\/p>\n<p>A <strong>harness<\/strong>, in this context, is the scaffolding around an LLM: what system prompt it receives, what tools it can call, how it routes between sub-agents, and how tasks are formatted as inputs. Most agent engineers hand-craft this scaffolding. AutoAgent automates the iteration on that scaffolding itself.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Architecture: Two Agents, One File, One Directive<\/strong><\/h3>\n<p>The <a href=\"https:\/\/github.com\/kevinrgu\/autoagent\/tree\/main\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub repo<\/a> has a deliberately simple structure. <code>agent.py<\/code> is the entire harness under test in a single file \u2014 it contains config, tool definitions, agent registry, routing\/orchestration, and the Harbor adapter boundary. The adapter section is explicitly marked as fixed; the rest is the primary edit surface for the meta-agent. <code>program.md<\/code> contains instructions for the meta-agent plus the directive (what kind of agent to build), and this is the only file the human edits. <\/p>\n<p>Think of it as a separation of concerns between human and machine. The human sets the <em>direction<\/em> inside <code>program.md<\/code>. The <strong>meta-agent<\/strong> (a separate, higher-level AI) then reads that directive, inspects <code>agent.py<\/code>, runs the benchmark, diagnoses what failed, rewrites the relevant parts of <code>agent.py<\/code>, and repeats. The human never touches <code>agent.py<\/code> directly.<\/p>\n<p>A critical piece of infrastructure that keeps the loop coherent across iterations is <code>results.tsv<\/code> \u2014 an experiment log automatically created and maintained by the meta-agent. It tracks every experiment run, giving the meta-agent a history to learn from and calibrate what to try next. The full project structure also includes <code>Dockerfile.base<\/code>, an optional <code>.agent\/<\/code> directory for reusable agent workspace artifacts like prompts and skills, a <code>tasks\/<\/code> folder for benchmark payloads (added per benchmark branch), and a <code>jobs\/<\/code> directory for Harbor job outputs.<\/p>\n<p>The metric is total score produced by the benchmark\u2019s task test suites. The meta-agent hill-climbs on this score. Every experiment produces a numeric score: keep if better, discard if not \u2014 the same loop as autoresearch. <\/p>\n<h3 class=\"wp-block-heading\"><strong>The Task Format and Harbor Integration<\/strong><\/h3>\n<p>Benchmarks are expressed as tasks in Harbor format. Each task lives under <code>tasks\/my-task\/<\/code> and includes a <code>task.toml<\/code> for config like timeouts and metadata, an <code>instruction.md<\/code> which is the prompt sent to the agent, a <code>tests\/<\/code> directory with a <code>test.sh<\/code> entry point that writes a score to <code>\/logs\/reward.txt<\/code>, and a <code>test.py<\/code> for verification using either deterministic checks or LLM-as-judge. An <code>environment\/Dockerfile<\/code> defines the task container, and a <code>files\/<\/code> directory holds reference files mounted into the container. Tests write a score between 0.0 and 1.0 to the verifier logs. The meta-agent hill-climbs on this. <\/p>\n<p>The <strong>LLM-as-judge<\/strong> pattern here is worth flagging: instead of only checking answers deterministically (like unit tests), the test suite can use another LLM to evaluate whether the agent\u2019s output is \u2018correct enough.\u2019 This is common in agentic benchmarks where correct answers aren\u2019t reducible to string matching.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Autonomous harness engineering works<\/strong> \u2014 AutoAgent proves that a meta-agent can replace the human prompt-tuning loop entirely, iterating on <code>agent.py<\/code> overnight without any human touching the harness files directly.<\/li>\n<li><strong>Benchmark results validate the approach<\/strong> \u2014 In a 24-hour run, AutoAgent hit #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%), beating every other entry that was hand-engineered by humans.<\/li>\n<li><strong>\u2018Model empathy\u2019 may be a real phenomenon<\/strong> \u2014 A Claude meta-agent optimizing a Claude task agent appeared to diagnose failures more accurately than when optimizing a GPT-based agent, suggesting same-family model pairing could matter when designing your AutoAgent loop.<\/li>\n<li><strong>The human\u2019s job shifts from engineer to director<\/strong> \u2014 You don\u2019t write or edit <code>agent.py<\/code>. You write <code>program.md<\/code> \u2014 a plain Markdown directive that steers the meta-agent. The distinction mirrors the broader shift in agentic engineering from writing code to setting goals.<\/li>\n<li><strong>It\u2019s plug-and-play with any benchmark<\/strong> \u2014 Because tasks follow Harbor\u2019s open format and agents run in Docker containers, AutoAgent is domain-agnostic. Any scorable task \u2014 spreadsheets, terminal commands, or your own custom domain \u2014 can become a target for autonomous self-optimization.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/kevinrgu\/autoagent\/tree\/main\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a><\/strong> and <a href=\"https:\/\/x.com\/kevingu\/status\/2039843234760073341\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Tweet<\/strong><\/a><strong>. \u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? <strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\">Connect with us<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/05\/meet-autoagent-the-open-source-library-that-lets-an-ai-engineer-and-optimize-its-own-agent-harness-overnight\/\">Meet \u2018AutoAgent\u2019: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>There\u2019s a particular kind of t&hellip;<\/p>\n","protected":false},"author":1,"featured_media":670,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-669","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/669","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=669"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/669\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/670"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=669"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=669"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=669"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}