{"id":597,"date":"2026-03-24T13:53:58","date_gmt":"2026-03-24T05:53:58","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=597"},"modified":"2026-03-24T13:53:58","modified_gmt":"2026-03-24T05:53:58","slug":"yann-lecuns-new-leworldmodel-lewm-research-targets-jepa-collapse-in-pixel-based-predictive-world-modeling","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=597","title":{"rendered":"Yann LeCun\u2019s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling"},"content":{"rendered":"<p>World Models (WMs) are a central framework for developing agents that reason and plan in a compact latent space. However, training these models directly from pixel data often leads to \u2018representation collapse,\u2019 where the model produces redundant embeddings to trivially satisfy prediction objectives. Current approaches attempt to prevent this by relying on complex heuristics: they utilize stop-gradient updates, exponential moving averages (EMA), and frozen pre-trained encoders. A team of researchers including <strong>Yann LeCun<\/strong> and many others (Mila &amp; Universit\u00e9 de Montr\u00e9al, New York University, Samsung SAIL and Brown University) introduced <strong>LeWorldModel (LeWM)<\/strong>, the first JEPA (Joint-Embedding Predictive Architecture) that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings<\/p>\n<h3 class=\"wp-block-heading\"><strong>Technical Architecture and Objective<\/strong><\/h3>\n<p>LeWM consists of two primary components learned jointly: an <strong>Encoder<\/strong> and a <strong>Predictor<\/strong><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Encoder ((z<sub>t<\/sub>=enc<\/strong><sub>\u03b8<\/sub><strong> (o<sub>t<\/sub>)):<\/strong> Maps a raw pixel observation into a compact, low-dimensional latent representation. The implementation uses a <strong>ViT-Tiny<\/strong> architecture (~5M parameters).<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li><strong>Predictor (\u017d<sub>t+1<\/sub>=pred<sub>\u03b8<\/sub>(z<sub>t, <\/sub>a<sub>t<\/sub>)):<\/strong> A transformer (~10M parameters) that models environment dynamics by predicting future latent states conditioned on actions.<\/li>\n<\/ul>\n<p>The model is optimized using a streamlined objective function consisting of only two loss terms<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>:<\/p>\n<div class=\"wp-block-mathml-mathmlblock\">$$mathcal{L}_{LeWM} triangleq mathcal{L}_{pred} + lambda SIGReg(Z)$$\n<\/div>\n<p>The <strong>prediction loss  (<em>L<\/em><sub>pred<\/sub>)<\/strong> computes the mean-squared error (MSE) between the predicted and actual consecutive embeddings. The <strong>SIGReg (Sketched-Isotropic-Gaussian Regularizer)<\/strong> is the anti-collapse term that enforces feature diversity.<\/p>\n<p>As per the research paper, applying a <strong>dropout rate of 0.1<\/strong> in the predictor and a specific projection step (1-layer MLP with Batch Normalization) after the encoder are critical for stability and downstream performance.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Efficiency via SIGReg and Sparse Tokenization<\/strong><\/h3>\n<p>Assessing normality in high-dimensional latent spaces is a major scaling challenge<sup><\/sup>. LeWM addresses this using <strong>SIGReg<\/strong>, which leverages the <strong>Cram\u00e9r-Wold theorem<\/strong>: a multivariate distribution matches a target (isotropic Gaussian) if all its one-dimensional projections match that target<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<p>SIGReg projects latent embeddings onto <em><strong>M <\/strong><\/em>random directions and applies the <strong>Epps-Pulley test statistic<\/strong> to each resulting one-dimensional projection. Because the regularization weight <strong>\u03bb <\/strong>is the only effective hyperparameter to tune, researchers can optimize it using a <strong>bisection search<\/strong> with <strong><em>O<\/em>(log n) complexity<\/strong>, a significant improvement over the polynomial-time search (O(n<sup>6<\/sup>)) required by previous models like PLDM.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Speed Benchmarks<\/strong><\/h4>\n<p><strong>In the reported setup, LeWM demonstrates high computational efficiency:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Token Efficiency:<\/strong> <strong>LeWM encodes observations using ~200\u00d7 fewer tokens than DINO-WM<\/strong>.<\/li>\n<li><strong>Planning Speed:<\/strong> LeWM achieves <strong>planning up to 48\u00d7 faster than DINO-WM<\/strong> (0.98s vs 47s per planning cycle).<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Latent Space Properties and Physical Understanding<\/strong><\/h3>\n<p>LeWM\u2019s latent space <strong>supports probing of physical quantities and detection of physically implausible events<\/strong><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Violation-of-Expectation (VoE)<\/strong><\/h4>\n<p>Using a VoE framework, the model was evaluated on its ability to detect \u2018surprise\u2019. <strong>It assigned higher surprise to physical perturbations such as teleportation; visual perturbations produced weaker effects, and cube color changes in OGBench-Cube were not significant<\/strong>.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Emergent Path Straightening<\/strong><\/h4>\n<p>LeWM exhibits <strong>Temporal Latent Path Straightening<\/strong>, where latent trajectories naturally become smoother and more linear over the course of training<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. Notably, LeWM achieves higher temporal straightness than PLDM despite having no explicit regularizer encouraging this behavior<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Feature<\/strong><\/td>\n<td><strong>LeWorldModel (LeWM)<\/strong><\/td>\n<td><strong>PLDM<\/strong><\/td>\n<td><strong>DINO-WM<\/strong><\/td>\n<td><strong>Dreamer \/ TD-MPC<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Training Paradigm<\/strong><\/td>\n<td>Stable End-to-End<\/td>\n<td>End-to-End<\/td>\n<td>Frozen Foundation Encoder<\/td>\n<td>Task-Specific<\/td>\n<\/tr>\n<tr>\n<td><strong>Input Type<\/strong><\/td>\n<td>Raw Pixels<\/td>\n<td>Raw Pixels<\/td>\n<td>Pixels (DINOv2 features)<\/td>\n<td>Rewards \/ Privileged State<\/td>\n<\/tr>\n<tr>\n<td><strong>Loss Terms<\/strong><\/td>\n<td><strong>2<\/strong> (Prediction + SIGReg)<\/td>\n<td><strong>7<\/strong> (VICReg-based)<\/td>\n<td><strong>1<\/strong> (MSE on latents)<\/td>\n<td>Multiple (Task-specific)<\/td>\n<\/tr>\n<tr>\n<td><strong>Tunable Hyperparams<\/strong><\/td>\n<td><strong>1<\/strong> (Effective weight \u03bb)<\/td>\n<td><strong>6<\/strong><\/td>\n<td>N\/A (Fixed by pre-training)<\/td>\n<td>Many (Task-dependent)<\/td>\n<\/tr>\n<tr>\n<td><strong>Planning Speed<\/strong><\/td>\n<td><strong>Up to 48x Faster<\/strong><\/td>\n<td>Fast (Compact latents)<\/td>\n<td>Slow (~50x slower than LeWM)<\/td>\n<td>Varies (often slow generation)<\/td>\n<\/tr>\n<tr>\n<td><strong>Anti-Collapse<\/strong><\/td>\n<td><strong>Provable<\/strong> (Gaussian prior)<\/td>\n<td>Under-specified \/ Unstable<\/td>\n<td>Bounded by pre-training<\/td>\n<td>Heuristic (e.g., reconstruction)<\/td>\n<\/tr>\n<tr>\n<td><strong>Requirement<\/strong><\/td>\n<td>Task-Agnostic \/ Reward-Free<\/td>\n<td>Task-Agnostic \/ Reward-Free<\/td>\n<td>Frozen Pre-trained Encoder<\/td>\n<td>Task Signals \/ Rewards<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Stable End-to-End Learning:<\/strong> LeWM is the first Joint-Embedding Predictive Architecture (JEPA) that trains stably end-to-end from raw pixels without needing \u2018hand-holding\u2019 heuristics like stop-gradients, exponential moving averages (EMA), or frozen pre-trained encoders.<\/li>\n<li><strong>A Radical Two-Term Objective:<\/strong> The training process is simplified into just two loss terms\u2014a next-embedding prediction loss and the SIGReg regularizer\u2014reducing the number of tunable hyperparameters from six to one compared to existing end-to-end alternatives.<\/li>\n<li><strong>Built for Real-Time Speed:<\/strong> By representing observations with approximately 200\u00d7 fewer tokens than foundation-model-based counterparts, LeWM plans up to 48\u00d7 faster, completing full trajectory optimizations in under one second.<\/li>\n<li><strong>Provable Anti-Collapse:<\/strong> To prevent the model from learning \u2018garbage\u2019 redundant representations, it uses the SIGReg regularizer; this utilizes the Cram\u00e9r-Wold theorem to ensure high-dimensional latent embeddings stay diverse and Gaussian-distributed.<\/li>\n<li><strong>Intrinsic Physical Logic:<\/strong> The model doesn\u2019t just predict data; it captures meaningful physical structure in its latent space, allowing it to accurately probe physical quantities and detect \u2018impossible\u2019 events like object teleportation through a violation-of-expectation framework.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2603.19312v1\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/le-wm.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Website<\/a>\u00a0<\/strong>and<strong>\u00a0<a href=\"https:\/\/github.com\/lucas-maes\/le-wm\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/23\/yann-lecuns-new-leworldmodel-lewm-research-targets-jepa-collapse-in-pixel-based-predictive-world-modeling\/\">Yann LeCun\u2019s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>World Models (WMs) are a centr&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-597","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/597","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=597"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/597\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=597"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=597"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=597"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}