{"id":446,"date":"2026-02-21T04:30:46","date_gmt":"2026-02-20T20:30:46","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=446"},"modified":"2026-02-21T04:30:46","modified_gmt":"2026-02-20T20:30:46","slug":"nvidia-releases-dreamdojo-an-open-source-robot-world-model-trained-on-44711-hours-of-real-world-human-video-data","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=446","title":{"rendered":"NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data"},"content":{"rendered":"<p>Building simulators for robots has been a long term challenge. Traditional engines require manual coding of physics and perfect 3D models. NVIDIA is changing this with <strong>DreamDojo<\/strong>, a fully open-source, generalizable robot world model. Instead of using a physics engine, DreamDojo \u2018dreams\u2019 the results of robot actions directly in pixels. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1364\" height=\"766\" data-attachment-id=\"78001\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/20\/nvidia-releases-dreamdojo-an-open-source-robot-world-model-trained-on-44711-hours-of-real-world-human-video-data\/screenshot-2026-02-20-at-12-22-37-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.22.37-PM-1.png\" data-orig-size=\"1364,766\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-20 at 12.22.37\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.22.37-PM-1-300x168.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.22.37-PM-1-1024x575.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.22.37-PM-1.png\" alt=\"\" class=\"wp-image-78001\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2602.06949<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Scaling Robotics with 44k+ Hours of Human Experience<\/strong><\/h3>\n<p>The biggest hurdle for AI in robotics is data. Collecting robot-specific data is expensive and slow. DreamDojo solves this by learning from <strong>44k+ hours<\/strong> of egocentric human videos. This dataset, called <strong>DreamDojo-HV<\/strong>, is the largest of its kind for world model pretraining.<\/p>\n<ul class=\"wp-block-list\">\n<li>It features 6,015 unique tasks across 1M+ trajectories.<\/li>\n<li>The data covers 9,869 unique scenes and 43,237 unique objects. <\/li>\n<li>Pretraining used <strong>100,000 NVIDIA H100 GPU hours<\/strong> to build 2B and 14B model variants. <\/li>\n<\/ul>\n<p>Humans have already mastered complex physics, such as pouring liquids or folding clothes. DreamDojo uses this human data to give robots a \u2018common sense\u2019 understanding of how the world works. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1408\" height=\"890\" data-attachment-id=\"78003\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/20\/nvidia-releases-dreamdojo-an-open-source-robot-world-model-trained-on-44711-hours-of-real-world-human-video-data\/screenshot-2026-02-20-at-12-23-16-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.16-PM-1.png\" data-orig-size=\"1408,890\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-20 at 12.23.16\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.16-PM-1-300x190.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.16-PM-1-1024x647.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.16-PM-1.png\" alt=\"\" class=\"wp-image-78003\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2602.06949<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Bridging the Gap with Latent Actions<\/strong><\/h3>\n<p>Human videos do not have robot motor commands. To make these videos \u2018robot-readable,\u2019 NVIDIA\u2019s research team introduced <strong>continuous latent actions<\/strong>. This system uses a spatiotemporal Transformer VAE to extract actions directly from pixels.<\/p>\n<ul class=\"wp-block-list\">\n<li>The VAE encoder takes 2 consecutive frames and outputs a 32-dimensional latent vector. <\/li>\n<li>This vector represents the most critical motion between frames. <\/li>\n<li>The design creates an information bottleneck that disentangles action from visual context. <\/li>\n<li>This allows the model to learn physics from humans and apply them to different robot bodies. <\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1398\" height=\"754\" data-attachment-id=\"78005\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/20\/nvidia-releases-dreamdojo-an-open-source-robot-world-model-trained-on-44711-hours-of-real-world-human-video-data\/screenshot-2026-02-20-at-12-23-49-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.49-PM-1.png\" data-orig-size=\"1398,754\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-20 at 12.23.49\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.49-PM-1-300x162.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.49-PM-1-1024x552.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-20-at-12.23.49-PM-1.png\" alt=\"\" class=\"wp-image-78005\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2602.06949<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Better Physics through Architecture<\/strong><\/h3>\n<p>DreamDojo is based on the <strong>Cosmos-Predict2.5<\/strong> latent video diffusion model. It uses the <strong>WAN2.2 tokenizer<\/strong>, which has a temporal compression ratio of 4. <strong>The team improved the architecture with 3 key features:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Relative Actions:<\/strong> The model uses joint deltas instead of absolute poses. This makes it easier for the model to generalize across different trajectories.<\/li>\n<li><strong>Chunked Action Injection:<\/strong> It injects 4 consecutive actions into each latent frame. This aligns the actions with the tokenizer\u2019s compression ratio and fixes causality confusion.<\/li>\n<li><strong>Temporal Consistency Loss:<\/strong> A new loss function matches predicted frame velocities to ground-truth transitions. This reduces visual artifacts and keeps objects physically consistent.<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\"><strong>Distillation for 10.81 FPS Real-Time Interaction<\/strong><\/h3>\n<p>A simulator is only useful if it is fast. Standard diffusion models require too many denoising steps for real-time use. NVIDIA team used a <strong>Self Forcing<\/strong> distillation pipeline to solve this. <\/p>\n<ul class=\"wp-block-list\">\n<li>The distillation training was conducted on <strong>64 NVIDIA H100 GPUs<\/strong>. <\/li>\n<li>The \u2018student\u2019 model reduces denoising from 35 steps down to 4 steps. <\/li>\n<li>The final model achieves a real-time speed of <strong>10.81 FPS<\/strong>.<\/li>\n<li>It is stable for continuous rollouts of 60 seconds (600 frames). <\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Unlocking Downstream Applications<\/strong><\/h3>\n<p>DreamDojo\u2019s speed and accuracy enable several advanced applications for AI engineers.<\/p>\n<h4 class=\"wp-block-heading\"><strong>1. Reliable Policy Evaluation<\/strong><\/h4>\n<p>Testing robots in the real world is risky. DreamDojo acts as a high-fidelity simulator for benchmarking. <\/p>\n<ul class=\"wp-block-list\">\n<li>Its simulated success rates show a Pearson correlation of (Pearson \ud835\udc5f=0.995) with real-world results.<\/li>\n<li>The Mean Maximum Rank Violation (MMRV) is only <strong>0.003<\/strong>.<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>2. Model-Based Planning<\/strong><\/h4>\n<p>Robots can use DreamDojo to \u2018look ahead.\u2019 A robot can simulate multiple action sequences and pick the best one.<\/p>\n<ul class=\"wp-block-list\">\n<li>In a fruit-packing task, this improved real-world success rates by <strong>17%<\/strong>. <\/li>\n<li>Compared to random sampling, it provided a 2x increase in success. <\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>3. Live Teleoperation<\/strong><\/h4>\n<p>Developers can teleoperate virtual robots in real time. NVIDIA team demonstrated this using a <strong>PICO VR controller<\/strong> and a local desktop with an <strong>NVIDIA RTX 5090<\/strong>. This allows for safe and rapid data collection.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Summary of Model Performance<\/strong><\/h3>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Metric<\/strong><\/td>\n<td><strong>DREAMDOJO-2B<\/strong><\/td>\n<td><strong>DREAMDOJO-14B<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Physics Correctness<\/strong><\/td>\n<td>62.50%<\/td>\n<td>73.50%<\/td>\n<\/tr>\n<tr>\n<td><strong>Action Following<\/strong><\/td>\n<td>63.45%<\/td>\n<td>72.55%<\/td>\n<\/tr>\n<tr>\n<td><strong>FPS (Distilled)<\/strong><\/td>\n<td>10.81<\/td>\n<td>N\/A<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>NVIDIA has released all weights, training code, and evaluation benchmarks. This open-source release allows you to post-train DreamDojo on your own robot data today.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Massive Scale and Diversity<\/strong>: DreamDojo is pretrained on <strong>DreamDojo-HV<\/strong>, the largest egocentric human video dataset to date, featuring <strong>44,711 hours<\/strong> of footage across <strong>6,015 unique tasks<\/strong> and <strong>9,869 scenes<\/strong>.<\/li>\n<li><strong>Unified Latent Action Proxy<\/strong>: To overcome the lack of action labels in human videos, the model uses <strong>continuous latent actions<\/strong> extracted via a spatiotemporal Transformer VAE, which serves as a hardware-agnostic control interface.<\/li>\n<li><strong>Optimized Training and Architecture<\/strong>: The model achieves high-fidelity physics and precise controllability by utilizing <strong>relative action transformations<\/strong>, <strong>chunked action injection<\/strong>, and a specialized <strong>temporal consistency loss<\/strong>.<\/li>\n<li><strong>Real-Time Performance via Distillation<\/strong>: Through a <strong>Self Forcing<\/strong> distillation pipeline, the model is accelerated to <strong>10.81 FPS<\/strong>, enabling interactive applications like live teleoperation and stable, long-horizon simulations for over <strong>1 minute<\/strong>.<\/li>\n<li><strong>Reliable for Downstream Tasks<\/strong>: DreamDojo functions as an accurate simulator for <strong>policy evaluation<\/strong>, showing a <strong>0.995 Pearson correlation<\/strong> with real-world success rates, and can improve real-world performance by <strong>17%<\/strong> when used for <strong>model-based planning<\/strong>.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2602.06949\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> <\/strong>and <strong><a href=\"https:\/\/github.com\/NVIDIA\/DreamDojo\" target=\"_blank\" rel=\"noreferrer noopener\">Codes<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/20\/nvidia-releases-dreamdojo-an-open-source-robot-world-model-trained-on-44711-hours-of-real-world-human-video-data\/\">NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Building simulators for robots&hellip;<\/p>\n","protected":false},"author":1,"featured_media":447,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-446","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/446","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=446"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/446\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/447"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}