{"id":827,"date":"2026-05-01T08:40:01","date_gmt":"2026-05-01T00:40:01","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=827"},"modified":"2026-05-01T08:40:01","modified_gmt":"2026-05-01T00:40:01","slug":"microsoft-researchs-world-r1-uses-flow-grpo-and-3d-aware-rewards-to-inject-geometric-consistency-into-wan-2-1-without-architectural-changes","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=827","title":{"rendered":"Microsoft Research\u2019s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes"},"content":{"rendered":"<p>Video foundation models can paint a beautiful frame. They are still notoriously bad at remembering it. Push the camera through a corridor in Wan 2.1 or CogVideoX and walls warp, objects morph, and details vanish \u2014 the giveaway that these models are fitting 2D pixel correlations rather than simulating a coherent 3D scene.<\/p>\n<p>A team of researchers from Microsoft Research and Zhejiang University introduced <strong>World-R1: <\/strong>a framework that aligns video generation with 3D constraints through reinforcement learning. The research team lean on a recent finding that video foundation models already encode rich 3D geometric information internally. The job, then, is to <em>elicit<\/em> that latent knowledge rather than supervise it with expensive 3D assets. World-R1 does this by post-training an existing text-to-video (T2V) model with reinforcement learning, using rewards derived from pre-trained 3D foundation models and a vision-language critic. The base architecture is left untouched and inference cost is unchanged.<\/p>\n<p>Two <strong>World-R1<\/strong> variants are released: <strong>World-R1-Small<\/strong> (built on Wan2.1-T2V-1.3B) and <strong>World-R1-Large<\/strong> (built on Wan2.1-T2V-14B).<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1274\" height=\"892\" data-attachment-id=\"79417\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/30\/microsoft-researchs-world-r1-uses-flow-grpo-and-3d-aware-rewards-to-inject-geometric-consistency-into-wan-2-1-without-architectural-changes\/screenshot-2026-04-30-at-5-32-14-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-30-at-5.32.14-PM-1.png\" data-orig-size=\"1274,892\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-30 at 5.32.14\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-30-at-5.32.14-PM-1-1024x717.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-30-at-5.32.14-PM-1.png\" alt=\"\" class=\"wp-image-79417\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.24764<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The setup: Flow-GRPO on a flow-matching video model<\/strong><\/h3>\n<p>World-R1 uses <strong>Flow-GRPO-Fast<\/strong>, a recent adaptation of GRPO to flow-matching diffusion models. Flow-GRPO converts the deterministic ODE sampler into a reverse-time SDE so the policy is stochastic enough for advantage estimation, then optimizes a clipped GRPO surrogate with KL regularization to a reference policy. The Fast variant only injects SDE noise at randomly selected intermediate steps to cut rollout cost.<\/p>\n<p>Training runs at 832\u00d7480 resolution on 48 NVIDIA H200 GPUs for the Small model and 96 H200s for the Large model, with a GRPO group size of G=8 across 48 parallel groups.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The 3D-aware reward: analysis-by-synthesis<\/strong><\/h3>\n<p>The interesting work happens in the reward. For each generated video x, the system reconstructs a 3D Gaussian Splatting (3DGS) representation \u03a6<sub>GS<\/sub> using <strong>Depth Anything 3<\/strong> and recovers an estimated camera trajectory \u00ca. <strong>The composite 3D reward is:<\/strong><\/p>\n<p><strong>R<sub>3D<\/sub> = S<sub>meta<\/sub> + S<sub>recon<\/sub> + S<sub>traj<\/sub><\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong><\/strong><strong>S<sub>meta<\/sub><\/strong> renders \u03a6<sub>GS<\/sub> from a <em>meta-view<\/em> \u2014 a camera pose offset from the generation trajectory \u2014 and asks <strong>Qwen3-VL<\/strong> to score the reconstruction from 0\u20139 as a \u201c3D vision expert,\u201d penalizing floaters, billboard artifacts, and texture stretching that look fine head-on but collapse off-axis.<\/li>\n<li><strong><\/strong><strong>S<sub>recon<\/sub><\/strong> re-renders the scene along \u00ca and compares against x via 1 \u2212 LPIPS.<\/li>\n<li><strong><\/strong><strong>S<sub>traj<\/sub><\/strong> measures deviation between the requested trajectory E and the recovered \u00ca using L2 for translation and geodesic distance for rotation, wrapped in a negative exponential.<\/li>\n<\/ul>\n<p>A general aesthetic term <strong>R<sub>gen<\/sub><\/strong>, computed as the mean <strong>HPSv3<\/strong> score across the first K frames, is added with \u03bb<sub>gen<\/sub> = 1 to keep visual quality from collapsing under geometric pressure.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Implicit camera conditioning via noise wrapping<\/strong><\/h3>\n<p>Rather than training a CameraCtrl-style adapter, World-R1 follows the <strong>Go-with-the-Flow<\/strong> paradigm: the prompt is parsed for motion tokens (<code>push_in<\/code>, <code>orbit_left<\/code>, <code>pull_out<\/code>, etc.), a sequence of camera extrinsics is generated, projected into 2D optical flow under a fronto-parallel scene assumption, and used to perform discrete noise transport on the initial latent. The transported noise preserves unit variance via a density-tracker normalization, so the diffusion prior is undisturbed but the latent already encodes the requested trajectory. No new parameters, no architectural change.<\/p>\n<h3 class=\"wp-block-heading\"><strong>A pure text dataset, and periodic decoupling to keep motion alive<\/strong><\/h3>\n<p>Training data is a synthetic <strong>Pure Text Dataset<\/strong> of roughly 3,000 prompts generated by Gemini, organized along the WorldScore camera-trajectory taxonomy (intra-scene, inter-scene, composite, static) and across Natural Landscapes, Urban &amp; Architectural, Micro &amp; Still Life, Fantasy &amp; Surrealism, and Artistic Styles. Going text-only dissociates 3D learning from the visual biases of any specific video corpus.<\/p>\n<p>Strict 3D rewards have a known failure mode: the model overfits to rigid scenes and stops generating dynamic content. World-R1 mitigates this with <strong>periodic decoupled training<\/strong>. Every 100 steps, R<sub>3D<\/sub> is suspended and the model is fine-tuned with R<sub>gen<\/sub> alone on a roughly 500-prompt <strong>dynamic data subset<\/strong> (waterfalls, crowds, fire, transforming objects). Removing this stage actually <em>raises<\/em> reconstruction PSNR but drops VBench AVG from 85.21 to 82.64 \u2014 exactly the reward-hacking degeneracy the research team flags.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Understanding the Results<\/strong><\/h3>\n<p>On a 3DGS-based reconstruction protocol, World-R1-Large hits <strong>27.67 PSNR \/ 0.865 SSIM \/ 0.162 LPIPS<\/strong>, against 19.76 \/ 0.629 \/ 0.405 for Wan2.1-T2V-14B \u2014 a 7.91 dB PSNR gain. World-R1-Small posts a 10.23 dB gain over its 1.3B backbone. On the reconstruction-independent <strong>Multi-View Consistency Score<\/strong> (MVCS) borrowed from GeoVideo, World-R1-Large reaches 0.993, ahead of all 3D-conditioned and camera-control baselines tested (Voyager, ViewCrafter, FlashWorld, ReCamMaster, etc.).<\/p>\n<p>Camera control is competitive with specialized methods: RotErr 1.21, TransErr 1.30, CamMC 2.95 for the Large model, edging out CamCloneMaster and ReCamMaster despite not being a dedicated camera-control architecture. VBench scores improve over the base Wan 2.1 in Aesthetic Quality, Imaging Quality, Motion Smoothness, and Subject Consistency, with only a small regression on Background Consistency.<\/p>\n<p>Two robustness results stand out for AI professionals. A <strong>dataset scaling<\/strong> sweep shows monotonic gains from 1K \u2192 2K \u2192 3K prompts on both 3D consistency and VBench AVG, suggesting the recipe is data-efficient and could scale further. And although training is on short clips, World-R1-Large generalizes to <strong>121-frame<\/strong> generations, lifting PSNR from 18.32 to 26.32 over the Wan2.1-T2V-14B backbone. A 25-participant double-blind user study reports win rates of <strong>92% for geometric consistency, 76% for camera control accuracy, and 86% for overall preference<\/strong> versus Wan 2.1.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/world_r1_comparison-1-scaled.png\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1489\" data-attachment-id=\"79419\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/30\/microsoft-researchs-world-r1-uses-flow-grpo-and-3d-aware-rewards-to-inject-geometric-consistency-into-wan-2-1-without-architectural-changes\/world_r1_comparison-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/world_r1_comparison-1-scaled.png\" data-orig-size=\"2560,1489\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"world_r1_comparison\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/world_r1_comparison-1-1024x596.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/world_r1_comparison-1-scaled.png\" alt=\"\" class=\"wp-image-79419\" \/><\/a><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>RL replaces architectural surgery for 3D consistency.<\/strong> World-R1 post-trains Wan2.1 with Flow-GRPO-Fast instead of bolting on 3D modules or training on 3D-supervised datasets. The base architecture and inference cost are unchanged.<\/li>\n<li><strong>The reward is analysis-by-synthesis.<\/strong> Each generated video is lifted to a 3D Gaussian Splatting representation via Depth Anything 3, then scored on three axes: meta-view plausibility (judged by Qwen3-VL), reconstruction fidelity (1 \u2212 LPIPS), and trajectory alignment \u2014 combined with an HPSv3 aesthetic reward to prevent quality collapse.<\/li>\n<li><strong>Camera control comes from noise wrapping, not new parameters.<\/strong> Motion tokens in the prompt are turned into camera extrinsics, projected to 2D optical flow, and used to warp the initial latent via Go-with-the-Flow\u2019s discrete noise transport. No CameraCtrl-style adapter required.<\/li>\n<li><strong>Periodic decoupled training prevents reward hacking.<\/strong> Every 100 steps, the 3D reward is suspended and the model is fine-tuned with the aesthetic reward alone on ~500 dynamic prompts. Removing this stage raises PSNR but tanks VBench \u2014 the model collapses into static, easy-to-reconstruct outputs.<\/li>\n<li><strong>The numbers are large and hold up off-pipeline.<\/strong> World-R1-Large gains 7.91 dB PSNR over Wan2.1-T2V-14B, generalizes to 121-frame videos, and improves the reconstruction-independent MVCS metric \u2014 with an 86% overall preference win rate in a 25-participant blind user study.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.24764\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/github.com\/microsoft\/World-R1\" target=\"_blank\" rel=\"noreferrer noopener\">Codes<\/a> <\/strong>and<strong> <a href=\"https:\/\/microsoft.github.io\/World-R1\/tech.html\" target=\"_blank\" rel=\"noreferrer noopener\">Project Page<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/30\/microsoft-researchs-world-r1-uses-flow-grpo-and-3d-aware-rewards-to-inject-geometric-consistency-into-wan-2-1-without-architectural-changes\/\">Microsoft Research\u2019s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Video foundation models can pa&hellip;<\/p>\n","protected":false},"author":1,"featured_media":828,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-827","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/827","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=827"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/827\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/828"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=827"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=827"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=827"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}