{"id":250,"date":"2026-01-11T23:12:24","date_gmt":"2026-01-11T15:12:24","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=250"},"modified":"2026-01-11T23:12:24","modified_gmt":"2026-01-11T15:12:24","slug":"meet-seta-open-source-training-reinforcement-learning-environments-for-terminal-agents-with-400-tasks-and-camel-toolkit","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=250","title":{"rendered":"Meet SETA: Open Source Training Reinforcement Learning Environments for Terminal Agents with 400 Tasks and CAMEL Toolkit"},"content":{"rendered":"<p>What does an end to end stack for terminal agents look like when you combine structured toolkits, synthetic RL environments, and benchmark aligned evaluation? A team of researchers from CAMEL AI, Eigent AI and other collaborators have released <strong><a href=\"https:\/\/github.com\/camel-ai\/seta-env\" target=\"_blank\" rel=\"noreferrer noopener\">SETA<\/a><\/strong>, a toolkit and environment stack that focuses on reinforcement learning for terminal agents. The project targets agents that operate inside a Unix style shell and must complete verifiable tasks under a benchmark harness such as Terminal Bench. <\/p>\n<h3 class=\"wp-block-heading\"><strong>Three main contributions:<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>A state of the art terminal agent on Terminal Bench: They achieve state of the art performance with a Claude Sonnet 4.5 based agent on Terminal Bench 2.0 and with a GPT 4.1 based agent on Terminal Bench 1.0. The comparison is restricted to agents that use the same base model.<\/li>\n<li>Scalable RL training with synthetic terminal environments: The research team release an initial synthetic dataset with 400 terminal tasks that cover a range of difficulty levels. Out of these, 260 tasks are used for RLVR finetuning of a Qwen3-8B model. <\/li>\n<li>A clean agent design that generalizes across training and evaluation frameworks: The same agent implementation is used for both local task runs and the official Terminal Bench evaluation harness.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Terminal Toolkit and log structure<\/strong><\/h3>\n<p>The SETA code repository showcases a Terminal Toolkit that turns a language model into an executable terminal agent. For each task run, the framework creates a structured log directory under <code>evaluation\/terminal_bench_run<\/code>. The README page shows a concrete layout for a task called <code>play-zork<\/code>. <\/p>\n<p><strong>Key files include:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><code>chatagent.log<\/code> which records the full history of agent messages and tool calls including test results.<\/li>\n<li>A <code>sessions<\/code> directory with <code>session_logs<\/code> that capture terminal interactions from the toolkit.<\/li>\n<li>Within <code>session_logs<\/code>, files such as <code>blocking_commands.log<\/code>, <code>session_run_zork_1_correct_path.log<\/code>, <code>session_zork-1.log<\/code>, and <code>session_zork_start.log<\/code> store command output for different sessions and modes.<\/li>\n<li><code>tests.log<\/code> and <code>tests.log.strip<\/code> which record the test run output, with the latter removing terminal control characters.<\/li>\n<\/ul>\n<p>This structure gives a concrete way to debug an agent. You can trace from high level chat decisions in <code>chatagent.log<\/code> down to individual shell commands in the session logs and confirm success or failure from the test logs.<\/p>\n<p>For official Terminal Bench evaluation, the GitHub repository provides a separate entry point under <code>evaluation\/terminal_bench_eval<\/code>. A developer moves into that directory and runs <code>run_eval.sh<\/code> for Terminal Bench 1.0 and <code>run_tb2.sh<\/code> for Terminal Bench 2.0. <\/p>\n<p>Results are written into <code>evaluation\/terminal_bench_eval\/run\/{run_id}\/results.json<\/code>. Task specific session logs are placed under <code>evaluation\/terminal_bench_eval\/logs\/camel_logs\/{task_id}<\/code>. The agent class that binds the CAMEL agent to the benchmark is implemented in <code>tbench_camel_agent.py<\/code>. <\/p>\n<h3 class=\"wp-block-heading\"><strong>Note Taking Toolkit as persistent memory<\/strong><\/h3>\n<p>The research team also introduces a Note Taking Toolkit described as persistent memory for long horizon tasks. They show example note taking tool calls where the agent writes and reads notes in a structured way while solving terminal tasks. The current public material focuses on the existence of this toolkit and the examples of use. It does not yet describe a full training objective for note usage. <\/p>\n<p>The important point is that the agent has an explicit channel where it can externalize intermediate results and hints, separate from the raw terminal buffer.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Understanding the performance<\/strong><\/h3>\n<p>SETA\u2019s agent harness achieves leading results on Terminal Bench. With Claude Sonnet-4.5 as the backbone, the CAMEL terminal agent reaches 46.5% accuracy on Terminal Bench 2.0 across 89 real world tasks, ranking first and outperforming the second system by 3 percentage points, with especially strong results in git workflows, DevOps automation, and code security tasks. On Terminal Bench 1.0, a GPT 4.1 based agent attains 35% accuracy, which is 4.7 percentage points above the next entry, again within the same model family. In comparison, a supervised Qwen3 8B baseline attains 3.4% on Terminal Bench 2.0, and the Qwen3 8B terminal agent trained with the SETA RL pipeline improves over this baseline on the curated synthetic environments.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>SETA is a joint community project that provides both agent toolkits and synthetic RL environments specifically for terminal agents, aligned with the Terminal Bench evaluation format. <\/li>\n<li>The framework reports state of the art performance for CAMEL terminal agents on Terminal Bench 1.0 and 2.0 when using Claude Sonnet 4.5 and GPT 4.1 as the base models, evaluated against agents built on the same model families. <\/li>\n<li>The SETA RL dataset on Hugging Face contains 400 synthetic terminal tasks, each packaged as <code>task.yaml<\/code>, <code>Dockerfile<\/code>, and <code>run-tests.sh<\/code>, with 260 tasks used for RLVR finetuning of a Qwen3-8B based agent. <\/li>\n<li>The open source SETA codebase exposes a Terminal Toolkit with structured logging and a Note Taking Toolkit for long horizon memory, and integrates directly with Terminal Bench evaluation scripts and logging paths in the <code>seta<\/code> GitHub repository. <\/li>\n<li>The overall design demonstrates a clean path from synthetic RL environments to benchmark verified agents, giving developers a reproducible stack to train, debug, and evaluate terminal agents rather than relying on ad hoc tool calling examples.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/eigent-ai.notion.site\/SETA-Scaling-Environments-for-Terminal-Agents-2d2511c70ba280a9b7c0fe3e7f1b6ab8\" target=\"_blank\" rel=\"noreferrer noopener\">Blog<\/a>, <a href=\"https:\/\/x.com\/CamelAIOrg\/status\/2009675880503599571\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>, <a href=\"https:\/\/github.com\/camel-ai\/seta-env\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a><\/strong> and<strong> <a href=\"https:\/\/huggingface.co\/datasets\/camel-ai\/seta-env\" target=\"_blank\" rel=\"noreferrer noopener\">Weights<\/a><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Check out our latest release of\u00a0<a href=\"https:\/\/ai2025.dev\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><mark>ai2025.dev<\/mark><\/strong><\/a>, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/01\/11\/meet-seta-open-source-training-reinforcement-learning-environments-for-terminal-agents-with-400-tasks-and-camel-toolkit\/\">Meet SETA: Open Source Training Reinforcement Learning Environments for Terminal Agents with 400 Tasks and CAMEL Toolkit<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>What does an end to end stack &hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-250","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/250","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=250"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/250\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}