{"id":622,"date":"2026-03-28T13:38:49","date_gmt":"2026-03-28T05:38:49","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=622"},"modified":"2026-03-28T13:38:49","modified_gmt":"2026-03-28T05:38:49","slug":"nvidia-ai-unveils-prorl-agent-a-decoupled-rollout-as-a-service-infrastructure-for-reinforcement-learning-of-multi-turn-llm-agents-at-scale","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=622","title":{"rendered":"NVIDIA AI Unveils ProRL Agent: A Decoupled Rollout-as-a-Service Infrastructure for Reinforcement Learning of Multi-Turn LLM Agents at Scale"},"content":{"rendered":"<p>NVIDIA researchers introduced <strong>ProRL AGENT<\/strong>, a scalable infrastructure designed for reinforcement learning (RL) training of multi-turn LLM agents. By adopting a \u2018Rollout-as-a-Service\u2019 philosophy, the system decouples agentic rollout orchestration from the training loop. This architectural shift addresses the inherent resource conflicts between I\/O-intensive environment interactions and GPU-intensive policy updates that currently bottleneck agent development.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Core Problem: Tight Coupling<\/strong><\/h3>\n<p>Multi-turn agent tasks involve interacting with external environments, such as code repositories or operating systems, via iterative tool use. Many existing frameworks\u2014including <strong>SkyRL<\/strong>, <strong>VeRL-Tool<\/strong>, <strong>Agent Lightning<\/strong>, <strong>rLLM<\/strong>, and <strong>GEM<\/strong>\u2014embed rollout control directly within the training process.<\/p>\n<p><strong>This tight coupling leads to two primary limitations:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Conflicting System Requirements<\/strong>: Rollouts are I\/O-bound, requiring sandbox creation, long-lived tool sessions, and asynchronous coordination. Training is GPU-intensive, centered on forward\/backward passes and gradient synchronization. Running both in one process causes interference and reduces hardware efficiency.<\/li>\n<li><strong>Maintenance Barriers<\/strong>: Embedding rollout logic in the trainer makes it difficult to migrate to different training backends or support new runtime environments without re-implementing the execution pipeline.<\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1746\" height=\"924\" data-attachment-id=\"78660\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/27\/nvidia-ai-unveils-prorl-agent-a-decoupled-rollout-as-a-service-infrastructure-for-reinforcement-learning-of-multi-turn-llm-agents-at-scale\/screenshot-2026-03-27-at-10-37-36-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.37.36-PM-1.png\" data-orig-size=\"1746,924\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-27 at 10.37.36\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.37.36-PM-1-300x159.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.37.36-PM-1-1024x542.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.37.36-PM-1.png\" alt=\"\" class=\"wp-image-78660\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.18815<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>System Design: Rollout-as-a-Service<\/strong><\/h3>\n<p><strong>ProRL AGENT<\/strong> operates as a standalone HTTP service that manages the full rollout lifecycle. The RL trainer interacts with the server solely through an API, remaining agnostic to the underlying rollout infrastructure.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Three-Stage Asynchronous Pipeline<\/strong><\/h4>\n<p><strong>To maximize throughput, the server orchestrates rollouts through an asynchronous three-stage \u2018assembly line\u2019:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>INIT<\/strong>: Initialization workers spin up sandbox containers and configure tools.<\/li>\n<li><strong>RUN<\/strong>: Rollout workers drive the multi-turn agent loop and collect trajectories.<\/li>\n<li><strong>EVAL<\/strong>: Evaluation workers score results against ground truth to produce reward signals.<\/li>\n<\/ol>\n<p>By assigning each stage to an independent worker pool, <strong>ProRL AGENT<\/strong> allows phases to overlap across different jobs, preventing slow evaluations (such as full test suite executions) from stalling the rollout process.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1690\" height=\"1146\" data-attachment-id=\"78662\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/27\/nvidia-ai-unveils-prorl-agent-a-decoupled-rollout-as-a-service-infrastructure-for-reinforcement-learning-of-multi-turn-llm-agents-at-scale\/screenshot-2026-03-27-at-10-38-17-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.38.17-PM-1.png\" data-orig-size=\"1690,1146\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-27 at 10.38.17\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.38.17-PM-1-300x203.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.38.17-PM-1-1024x694.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-27-at-10.38.17-PM-1.png\" alt=\"\" class=\"wp-image-78662\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.18815<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>HPC-Compatible Sandboxing and Optimized Tools<\/strong><\/h3>\n<p><strong>ProRL AGENT<\/strong> utilizes <strong>Singularity<\/strong> for its sandbox infrastructure. Unlike Docker-based platforms, Singularity allows rootless execution, which is required for deployment on shared HPC clusters managed by Slurm.<\/p>\n<p><strong>The system includes several optimizations to reduce tool execution latency, which often dominates total rollout time:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Efficient Bash<\/strong>: Replaces tmux-based terminal multiplexing with a <strong>ptyprocess<\/strong>-based direct pseudo-terminal, reducing shell command latency from 0.78s to 0.42s.<\/li>\n<li><strong>Direct IPython API<\/strong>: Connects to persistent kernels via an in-process API instead of network gateways, removing networking overhead.<\/li>\n<li><strong>Unix Domain Sockets (UDS)<\/strong>: Replaces TCP loopback for communication between the agent and the execution server inside the container to shave off additional latency.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Advanced Features for Scalable RL<\/strong><\/h3>\n<p><strong>The infrastructure introduces mechanisms to improve training stability and hardware utilization:<\/strong><\/p>\n<h4 class=\"wp-block-heading\"><strong>Load Balancing and Prefix Cache Reuse<\/strong><\/h4>\n<p>The server manages a pool of LLM inference backends (e.g., vLLM) using a min-heap keyed by assignment counts<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. When a task is assigned, all subsequent calls within that task are routed to the same backend<sup><\/sup>. This strategy maximizes <strong>prefix cache reuse<\/strong>, reducing inference time across multiple agent turns<sup><\/sup>.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Token-in\/Token-out Communication<\/strong><\/h4>\n<p>To eliminate <strong>re-tokenization drift<\/strong>\u2014where the token sequence generated during rollout differs from what is used during training\u2014<strong>ProRL AGENT<\/strong> uses token IDs as the canonical representation throughout the entire process. Log-probabilities and IDs are propagated unchanged from the inference backend to the trainer.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Optimized DAPO Implementation<\/strong><\/h4>\n<p>The system supports <strong>Dynamic Sampling Policy Optimization (DAPO)<\/strong>, which filters out \u2018non-informative\u2019 prompts that yield uniform rewards. <strong>ProRL AGENT<\/strong> uses an asynchronous replenishment mechanism to maintain maximum throughput, terminating redundant active jobs early once the target number of informative prompts is reached.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Experimental Results on SWE-Bench Verified<\/strong><\/h3>\n<p>The system was validated using Qwen3 models across multiple scales. <strong>ProRL AGENT<\/strong> consistently improved performance compared to reproduced baselines.<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Model Scale<\/strong><\/td>\n<td><strong>Reproduced Baseline<\/strong><\/td>\n<td><strong>ProRL Agent (RL)<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Qwen3-4B<\/strong><\/td>\n<td>14.8<\/td>\n<td><strong>21.2<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen3-8B<\/strong><\/td>\n<td>9.6<\/td>\n<td><strong>18.0<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen3-14B<\/strong><\/td>\n<td>15.4 (reproduced baseline)<\/td>\n<td><strong>23.6<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><strong>Note: The reported prior result for SkyRL-Agent-14B-v0 was 21.6.<\/strong><\/p>\n<p>In addition to software engineering, the system demonstrated generality in <strong>STEM<\/strong>, <strong>Math<\/strong>, and <strong>Code<\/strong> domains, showing steady reward growth during RL training<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. Scalability tests confirmed that rollout throughput increases near-linearly as compute nodes are added<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Architectural Decoupling<\/strong>: ProRL Agent treats the full agentic rollout lifecycle\u2014including environment initialization, tool execution, and reward scoring\u2014as an independent HTTP service, separating I\/O-intensive tasks from GPU-intensive policy training.<\/li>\n<li><strong>Significant Performance Gains<\/strong>: This infrastructure enabled the Qwen3-8B model to nearly double its performance on the SWE-Bench Verified benchmark (from 9.6% to 18.0%), while the Qwen3-14B model improved from 15.4% to 23.6%.<\/li>\n<li><strong>System Latency Reductions<\/strong>: Targeted optimizations, such as replacing tmux with ptyprocess for shell execution, reduced action latency from 0.78s to 0.42s, contributing to near-linear throughput scaling across compute nodes.<\/li>\n<li><strong>Elimination of Tokenization Drift<\/strong>: The framework utilizes a token-in\/token-out communication pipeline, ensuring that the exact token IDs generated during rollout are passed to the trainer without the risk of lossy re-tokenization.<\/li>\n<li><strong>HPC-Native Deployment<\/strong>: By using Singularity instead of Docker, ProRL Agent supports rootless execution and native Slurm integration, allowing large-scale agent training on shared high-performance computing clusters.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2603.18815\" target=\"_blank\" rel=\"noreferrer noopener\">Paper <\/a><\/strong>and<strong>\u00a0<a href=\"https:\/\/github.com\/NVIDIA-NeMo\/ProRL-Agent-Server\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/27\/nvidia-ai-unveils-prorl-agent-a-decoupled-rollout-as-a-service-infrastructure-for-reinforcement-learning-of-multi-turn-llm-agents-at-scale\/\">NVIDIA AI Unveils ProRL Agent: A Decoupled Rollout-as-a-Service Infrastructure for Reinforcement Learning of Multi-Turn LLM Agents at Scale<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>NVIDIA researchers introduced &hellip;<\/p>\n","protected":false},"author":1,"featured_media":623,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-622","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/622","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=622"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/622\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/623"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=622"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=622"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=622"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}