{"id":689,"date":"2026-04-09T05:54:44","date_gmt":"2026-04-08T21:54:44","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=689"},"modified":"2026-04-09T05:54:44","modified_gmt":"2026-04-08T21:54:44","slug":"meet-osgym-a-new-os-infrastructure-framework-that-manages-1000-replicas-at-0-23-day-for-computer-use-agent-research","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=689","title":{"rendered":"Meet OSGym: A New OS Infrastructure Framework That Manages 1,000+ Replicas at $0.23\/Day for Computer Use Agent Research"},"content":{"rendered":"<p>Training AI agents that can actually use a computer \u2014 opening apps, clicking buttons, browsing the web, writing code \u2014 is one of the hardest infrastructure problems in modern AI. It\u2019s not a data problem. It\u2019s not a model problem. It\u2019s a plumbing problem.<\/p>\n<p>You need to spin up hundreds, potentially thousands, of full operating system environments with actual graphical user interfaces. Each one needs to run real software. Each one needs to handle unpredictable crashes. And you need all of them to run simultaneously at a cost that doesn\u2019t bankrupt a university research lab.<\/p>\n<p>That\u2019s the problem \u2018<strong>OSGym<\/strong>\u2018, a new research from a team of researchers at MIT, UIUC, CMU, USC, UVA, and UC Berkeley, is designed to solve.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1922\" height=\"1142\" data-attachment-id=\"78852\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/08\/meet-osgym-a-new-os-infrastructure-framework-that-manages-1000-replicas-at-0-23-day-for-computer-use-agent-research\/screenshot-2026-04-08-at-2-53-44-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-08-at-2.53.44-PM-1.png\" data-orig-size=\"1922,1142\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-08 at 2.53.44\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-08-at-2.53.44-PM-1-300x178.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-08-at-2.53.44-PM-1-1024x608.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-08-at-2.53.44-PM-1.png\" alt=\"\" class=\"wp-image-78852\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2511.11672<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What is a Computer Use Agent?<\/strong><\/h3>\n<p>Before unpacking the infrastructure, it helps to understand what a computer use agent actually is. Unlike a chatbot that responds to text prompts, a computer use agent observes a screenshot of a desktop, decides what to do \u2014 click a button, type text, open a file \u2014 and executes that action through keyboard and mouse inputs. Think of it as an AI that can operate any software the way a human would.<\/p>\n<p>Models like Anthropic\u2019s Claude Computer Use and OpenAI\u2019s Operator are early commercial examples. Research models like UI-TARS, Agent-S2, and CogAgent are pushing the boundaries further. But training any of these systems requires massive amounts of interaction data generated inside real OS environments \u2014 and that\u2019s where things get expensive and complicated fast.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Core Problem: OS Sandboxes at Scale<\/strong><\/h3>\n<p>A coding environment or a web browser sandbox is relatively lightweight to run. A full OS sandbox with a GUI is not. Each virtual machine needs its own bootable disk (around 24 GB), its own CPU and RAM allocation, and its own display stack. Multiply that by hundreds or thousands of parallel instances and you have a resource consumption problem that typical academic compute budgets simply cannot absorb.<\/p>\n<p>On top of resource costs, there\u2019s the reliability problem. Software crashes. Browser sessions time out. Applications freeze. If your training pipeline doesn\u2019t handle these failures gracefully, one bad VM can stall an entire training batch.<\/p>\n<p>OSGym tackles both problems with four distinct architectural optimizations.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Decentralized OS State Management<\/strong><\/h3>\n<p>The first design choice concerns how the system manages the state of each OS replica \u2014 tracking whether it\u2019s healthy, what task it\u2019s running, and how to recover it if something goes wrong.<\/p>\n<p>A naive approach uses a single centralized manager for all replicas. This is a classic single point of failure: as replica count grows into the thousands, the central manager becomes overwhelmed, latency increases, and one crash can halt the whole system. OSGym instead gives every OS replica its own dedicated state manager. Each state manager exposes public methods modeled after the OpenAI Gym API \u2014 <code>reset<\/code>, <code>step<\/code>, and <code>shutdown<\/code> \u2014 but handles its own health monitoring and crash recovery internally. A failure in one replica cannot propagate to any other.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Hardware-Aware OS Replica Orchestration<\/strong><\/h3>\n<p>Here\u2019s a non-obvious insight this research surfaces: when you run many OS replicas on a single server, the bottleneck depends on how many replicas you pack per machine. For a small number of replicas per server (low K), the system is CPU-bounded \u2014 most replicas are fighting over processor time. But as you pack more replicas per server (large K), the bottleneck shifts to RAM \u2014 and RAM is dramatically cheaper than CPU.<\/p>\n<p>A 32 GB DDR4 RAM module typically costs 10\u201320% of what a 16-core CPU costs. OSGym runs replicas as Docker containers (using Docker images from OSWorld as a foundation) rather than full Virtual Machines to reduce per-replica overhead. By choosing servers with higher RAM capacity and running more replicas per machine, the daily cost drops from around $300 for 128 replicas at K=1, to roughly $30 at K=64 \u2014 approximately $0.234 per replica per day, a number that fits comfortably within many academic grant budgets.<\/p>\n<h3 class=\"wp-block-heading\"><strong>KVM Virtualization with Copy-on-Write Disk Management<\/strong><\/h3>\n<p>The disk provisioning problem is solved with a filesystem technique called reflink copy-on-write (CoW). Normally, spinning up 128 VM instances would mean duplicating a 24 GB base image 128 times \u2014 over 3 TB of storage and 30 seconds of provisioning time per VM.<\/p>\n<p>OSGym instead uses <code>cp --reflink=always<\/code> on XFS-formatted NVMe drives. Each per-VM disk image shares physical disk blocks with the base image and only allocates new blocks when the VM actually writes to them. The result: 128 VMs consume 366 GB of physical disk instead of 3.1 TB \u2014 an 88% reduction \u2014 and disk provisioning time drops from 30 seconds to 0.8 seconds per VM, a 37\u00d7 speedup. Each VM still sees its full 24 GB logical disk with near-native CPU performance.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Robust Container Pool with Multi-Layer Fault Recovery<\/strong><\/h3>\n<p>OSGym maintains a pre-warmed runner pool \u2014 by default, 128 runners per executor node \u2014 initialized before training begins. Rather than creating and destroying VMs on demand, runners are recycled between tasks. Before each VM creation, OSGym reads <code>\/proc\/meminfo<\/code> and <code>\/proc\/loadavg<\/code> to verify the host can safely accommodate another instance, blocking creation if available memory falls below 10% or under 8 GB absolute. Each container is memory-limited to 6 GB to prevent over-provisioning under burst scenarios.<\/p>\n<p>The system also tunes Linux kernel parameters that would otherwise cause silent failures at high concurrency \u2014 for example, <code>fs.aio-max-nr<\/code> is raised from 65,536 to 1,048,576, and <code>fs.inotify.max_user_instances<\/code> from 128 to 8,192. Fault recovery operates at two levels: at the step level, each action gets up to 10 retries by default; at the task level, if a runner fails permanently, the task is automatically reassigned to a fresh runner.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Unified Task Flow and Centralized Data Server<\/strong><\/h3>\n<p>Two design elements that are particularly important for devs integrating OSGym: every task follows a four-phase unified execution flow \u2014 Configure, Reset, Operate, Evaluate \u2014 regardless of which software or domain is involved. This standardization makes it straightforward to add new task types without changing the surrounding infrastructure.<\/p>\n<p>Above the replica layer, a centralized data server Python class exposes a single-entry batched interface (<code>__next__<\/code> and <code>async_step<\/code>) that hides all the complexity of state manager communication and queuing. The batched step method is asynchronous, meaning the training loop is never blocked while waiting for OS replicas to complete their actions.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What the Numbers Look Like in Practice<\/strong><\/h3>\n<p>Using 1,024 parallel OS replicas, the system collected trajectories across ten task categories \u2014 including LibreOffice Writer, Calc, and Impress, Chrome, ThunderBird, VLC, VS Code, GIMP, OS system configuration, and multi-app workflows \u2014 at approximately 1,420 trajectories per minute, versus 115,654 seconds without parallelization. The entire dataset cost $43 in cloud compute.<\/p>\n<p>The research team then used that data to fine-tune Qwen2.5-VL 32B via supervised fine-tuning, followed by reinforcement learning using a PPO-based semi-online asynchronous pipeline (200 steps, batch size 64, learning rate 1e-6). The resulting model achieved a 56.3% success rate on the OSWorld-Verified benchmark \u2014 competitive with existing methods for a 32B parameter base model with no task-specific tuning.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Training computer use agents is an infrastructure problem first<\/strong>: Full OS sandboxes with GUIs are far heavier than coding or browser environments \u2014 each VM needs ~24 GB of disk, dedicated CPU and RAM, and a display stack. Without careful optimization, scaling to hundreds of replicas is simply unaffordable for most academic labs.<\/li>\n<li><strong>RAM is a smarter scaling lever than CPU<\/strong>: OSGym\u2019s hardware-aware orchestration reveals that packing more replicas per server shifts the bottleneck from CPU to RAM \u2014 and RAM is 5\u201310\u00d7 cheaper. This single insight cuts per-replica cost from ~$2.10\/day to as low as $0.23\/day.<\/li>\n<li><strong>Copy-on-write disk management eliminates the storage wall.<\/strong> By using XFS reflink CoW (<code>cp --reflink=always<\/code>), OSGym reduces physical disk consumption by 88% and speeds up VM disk provisioning by 37\u00d7 \u2014 turning a 3.1 TB, 30-second-per-VM problem into a 366 GB, 0.8-second one.<\/li>\n<li><strong>Decentralized state management is the key to robustness at scale.<\/strong> Giving each OS replica its own dedicated state manager means failures stay isolated. Even starting from a fully crashed state, OSGym self-recovers all replicas within a short window \u2014 critical for uninterrupted long-running training jobs.<\/li>\n<li><strong>Academic-scale computer use agent research is now financially viable.<\/strong> With 1,024 replicas generating 1,420 trajectories per minute and a full dataset costing just $43 in cloud compute, OSGym brings the infrastructure cost of training general-purpose computer agents within reach of university research budgets.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2511.11672\" target=\"_blank\" rel=\"noreferrer noopener\">Paper here<\/a>. \u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/08\/meet-osgym-a-new-os-infrastructure-framework-that-manages-1000-replicas-at-0-23-day-for-computer-use-agent-research\/\">Meet OSGym: A New OS Infrastructure Framework That Manages 1,000+ Replicas at $0.23\/Day for Computer Use Agent Research<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Training AI agents that can ac&hellip;<\/p>\n","protected":false},"author":1,"featured_media":690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-689","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/689","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=689"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/689\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/690"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=689"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}