{"id":989,"date":"2026-05-28T08:51:53","date_gmt":"2026-05-28T00:51:53","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=989"},"modified":"2026-05-28T08:51:53","modified_gmt":"2026-05-28T00:51:53","slug":"sakana-ai-proposes-diffusionblocks-a-block-wise-training-framework-that-converts-residual-networks-into-independently-trainable-denoising-modules","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=989","title":{"rendered":"Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Researchers from Sakana AI and the University of Tokyo propose DiffusionBlocks. It trains transformer-based networks one block at a time. Training memory is reduced by a factor of B, where B is the number of blocks. Performance is maintained across diverse architectures.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Memory Problem in Neural Network Training<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">End-to-end backpropagation requires storing intermediate activations across every layer. Memory consumption grows linearly with network depth. As models grow deeper, this becomes a significant training bottleneck.<\/p>\n<p class=\"wp-block-paragraph\">One existing technique, activation checkpointing, reduces activation memory by recomputing activations on demand. However, it does not reduce memory for parameters, gradients, or optimizer states. With the Adam optimizer, each layer requires memory for parameters, gradients, and two optimizer states (momentum and variance). This totals 4 times the parameter size per layer, unchanged by activation checkpointing.<\/p>\n<p class=\"wp-block-paragraph\">Block-wise training offers a different approach. Partitioning a network into B blocks and training each independently reduces memory to roughly 1\/B. The reduction is proportional to the number of blocks. The challenge is defining a principled local objective for each block that still produces a globally coherent model.<\/p>\n<p class=\"wp-block-paragraph\">Prior approaches like Hinton\u2019s Forward-Forward algorithm and greedy layer-wise training rely on ad-hoc local objectives. They consistently underperform end-to-end training and are largely limited to classification tasks.<\/p>\n<p class=\"wp-block-paragraph\">DiffusionBlocks addresses both the theoretical gap and the limited applicability of prior methods.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1348\" height=\"846\" data-attachment-id=\"80149\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/sakana-ai-proposes-diffusionblocks-a-block-wise-training-framework-that-converts-residual-networks-into-independently-trainable-denoising-modules\/screenshot-2026-05-27-at-5-51-29-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-5.51.29-PM-1.png\" data-orig-size=\"1348,846\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-27 at 5.51.29\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-5.51.29-PM-1-1024x643.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-5.51.29-PM-1.png\" alt=\"\" class=\"wp-image-80149\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2506.14202<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>The Core Idea: Residual Connections as Euler Steps<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The key insight builds on an established connection in the literature. Residual networks update each layer input via <math data-latex=\"z\u2113 = z\u2113\u22121 + f\u03b8\u2113 (z\u2113\u22121) \"><semantics><mrow><mi>z<\/mi><mi>\u2113<\/mi><mo>=<\/mo><mi>z<\/mi><mi>\u2113<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo>+<\/mo><mi>f<\/mi><mi>\u03b8<\/mi><mi>\u2113<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>z<\/mi><mi>\u2113<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">z\u2113 = z\u2113\u22121 + f\u03b8\u2113 (z\u2113\u22121) <\/annotation><\/semantics><\/math>. This corresponds to Euler discretization of ordinary differential equations.<\/p>\n<p class=\"wp-block-paragraph\">The research team show these updates correspond specifically to the probability flow ODE in score-based diffusion models. In the Variance Exploding (VE) formulation, the reverse diffusion process follows:<\/p>\n<p class=\"wp-block-paragraph\"> <math data-latex=\" frac{mathrm{d}mathbf{z}_sigma}{mathrm{d}sigma} = -sigma nabla_{mathbf{z}} log p_sigma(mathbf{z}_sigma) \"><semantics><mrow><mfrac><mrow><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><\/mrow><msub><mi>\ud835\udc33<\/mi><mi>\u03c3<\/mi><\/msub><\/mfrac><\/mrow><mrow><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><\/mrow><mi>\u03c3<\/mi><mo>=<\/mo><mo form=\"prefix\" stretchy=\"false\">\u2212<\/mo><mi>\u03c3<\/mi><msub><mo>\u2207<\/mo><mi>\ud835\udc33<\/mi><\/msub><mrow><mi>log<\/mi><mo>\u2061<\/mo><mspace width=\"0.1667em\"><\/mspace><\/mrow><msub><mi>p<\/mi><mi>\u03c3<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>\ud835\udc33<\/mi><mi>\u03c3<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><annotation encoding=\"application\/x-tex\"> frac{mathrm{d}mathbf{z}_sigma}{mathrm{d}sigma} = -sigma nabla_{mathbf{z}} log p_sigma(mathbf{z}_sigma) <\/annotation><\/semantics><\/math><\/p>\n<p class=\"wp-block-paragraph\">Applying Euler discretization to this equation produces an update rule that structurally matches the residual connection update. A stack of residual blocks can be interpreted as discretized denoising steps. The steps span a noise level range <code><\/code><code>[\ud835\udf82<sub>min<\/sub>, \ud835\udf82<sub>max<\/sub>]<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">In score-based diffusion models, the score matching objective can be optimized independently at each noise level. This means each block can be trained independently, using only its own local objective. No inter-block communication is needed during training.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Converting a Network: Three Steps<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Converting a standard residual network to DiffusionBlocks requires three modifications<\/strong>:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Block partitioning<\/strong>: Split the L-layer network into B blocks. Each block contains a contiguous group of layers.<\/li>\n<li><strong>Noise range assignment<\/strong>: Define a noise distribution <em>p<\/em><sub>noise<\/sub> and a noise range <code><\/code><code><\/code><code>[\ud835\udf82<sub>min<\/sub>, \ud835\udf82<sub>max<\/sub>]<\/code>. Partition this range into B intervals and assign one interval to each block. The research team recommend a log-normal distribution for <em>p<\/em><sub>noise<\/sub>.<\/li>\n<li><strong>Noise conditioning<\/strong>: Extend each block\u2019s input to include a noisy version of the target. Add noise-level conditioning via AdaLN (Adaptive Layer Normalization). Each block learns to predict the clean target from its noisy version within its assigned noise range.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">During training, a single block is sampled per iteration. The other blocks are not computed. Memory consumption corresponds to L\/B layers, not all L layers.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Equi-probability Partitioning<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">A naive uniform partition divides <code><\/code><code><\/code><code><\/code><code>[\ud835\udf82<sub>min<\/sub>, \ud835\udf82<sub>max<\/sub>]<\/code> into equal intervals. This ignores the varying difficulty of denoising across noise levels. Intermediate noise levels contribute the most to generation quality under the log-normal training distribution.<\/p>\n<p class=\"wp-block-paragraph\">DiffusionBlocks uses equi-probability partitioning instead. Boundaries are chosen so each block handles exactly 1\/B of the total probability mass under <em>p<\/em><sub>noise<\/sub>. Blocks assigned to intermediate noise levels receive narrower intervals. Blocks handling extreme noise regions receive wider intervals.<\/p>\n<p class=\"wp-block-paragraph\">In ablation studies on CIFAR-10 using DiT-S\/2, block overlap was disabled to isolate each component. Equi-probability partitioning achieved FID of 38.03 versus 43.53 for uniform partitioning (lower is better). Both used a uniform layer distribution of [4,4,4] across 3 blocks.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Experimental Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The research team evaluated DiffusionBlocks across five architectures spanning three task categories. All results compare DiffusionBlocks (trained block-wise) against the same architecture trained with end-to-end backpropagation.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Architecture<\/th>\n<th>Dataset<\/th>\n<th>Metric<\/th>\n<th>Baseline<\/th>\n<th>DiffusionBlocks<\/th>\n<th>Memory Reduction<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ViT, 12-layer, B=3<\/td>\n<td>CIFAR-100<\/td>\n<td>Accuracy (higher is better)<\/td>\n<td>60.25%<\/td>\n<td>59.30%<\/td>\n<td>3x<\/td>\n<\/tr>\n<tr>\n<td>DiT-S\/2, 12-layer, B=3<\/td>\n<td>CIFAR-10<\/td>\n<td>FID test (lower is better)<\/td>\n<td>39.83<\/td>\n<td>37.20<\/td>\n<td>3x<\/td>\n<\/tr>\n<tr>\n<td>DiT-L\/2, 24-layer, B=3<\/td>\n<td>ImageNet 256\u00d7256<\/td>\n<td>FID test (lower is better)<\/td>\n<td>12.09<\/td>\n<td>10.63<\/td>\n<td>3x<\/td>\n<\/tr>\n<tr>\n<td>MDM, 12-layer, B=3<\/td>\n<td>text8<\/td>\n<td>BPC (lower is better)<\/td>\n<td>1.56<\/td>\n<td>1.45<\/td>\n<td>3x<\/td>\n<\/tr>\n<tr>\n<td>AR Transformer, 12-layer, B=4<\/td>\n<td>LM1B<\/td>\n<td>MAUVE (higher is better)<\/td>\n<td>0.50<\/td>\n<td>0.71<\/td>\n<td>4x<\/td>\n<\/tr>\n<tr>\n<td>AR Transformer, 12-layer, B=4<\/td>\n<td>OpenWebText<\/td>\n<td>MAUVE (higher is better)<\/td>\n<td>0.85<\/td>\n<td>0.82<\/td>\n<td>4x<\/td>\n<\/tr>\n<tr>\n<td>Huginn recurrent-depth<\/td>\n<td>LM1B<\/td>\n<td>MAUVE (higher is better)<\/td>\n<td>0.49<\/td>\n<td>0.70<\/td>\n<td>~10x compute<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\"><strong>Forward-Forward comparison<\/strong>: On CIFAR-100, the Forward-Forward algorithm achieved only 7.85% accuracy under the same ViT architecture. This highlights the gap between ad-hoc contrastive objectives and the score matching objective used by DiffusionBlocks.<\/p>\n<p class=\"wp-block-paragraph\"><strong>DiT inference efficiency<\/strong>: For diffusion models, each denoising step during inference activates only one block. A 12-layer DiT with B=3 uses only 4-layer evaluations per denoising step. This is a 3x inference compute reduction versus running all 12 layers.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Huginn training<\/strong>: Huginn applies the same 4-layer recurrent block recurrently. It uses stochastic recurrence depth averaging 32 iterations. Training uses 8-step truncated backpropagation through time (BPTT). DiffusionBlocks replaces this with a single forward pass per training step. The K-iteration inference procedure is kept unchanged. The 32x iteration reduction outweighs the 3x longer training schedule. DiffusionBlocks trains for 15 epochs versus Huginn\u2019s 5 epochs. Total compute is reduced by approximately 10x.<\/p>\n<p class=\"wp-block-paragraph\"><strong>OpenWebText results<\/strong>: On OpenWebText, DiffusionBlocks MAUVE was 0.82 versus 0.85. Generative perplexity under Llama-2 was 14.99 versus 15.05. Results on this dataset were mixed, with some metrics slightly worse than the baseline.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Masked diffusion partitioning<\/strong>: For masked diffusion models, block partitioning targets the masking schedule rather than continuous noise levels. Each block handles an equal decrement in the unmasking probability alpha(t), ensuring balanced parameter utilization across blocks.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Comparison with NoProp<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/2503.24322\" target=\"_blank\" rel=\"noreferrer noopener\">NoProp<\/a> is a concurrent work that uses a diffusion framework for backpropagation-free training. It is evaluated only on classification tasks using a custom CNN-based architecture. It does not provide a procedure for applying the method to other architectures or tasks.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Continuous-time<\/th>\n<th>Block-wise<\/th>\n<th>Accuracy on CIFAR-100<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Backpropagation<\/td>\n<td>No<\/td>\n<td>No<\/td>\n<td>47.80%<\/td>\n<\/tr>\n<tr>\n<td>NoProp-DT<\/td>\n<td>No<\/td>\n<td>Yes<\/td>\n<td>46.06%<\/td>\n<\/tr>\n<tr>\n<td>NoProp-CT<\/td>\n<td>Yes<\/td>\n<td>No<\/td>\n<td>21.31%<\/td>\n<\/tr>\n<tr>\n<td>NoProp-FM<\/td>\n<td>Yes<\/td>\n<td>No<\/td>\n<td>37.57%<\/td>\n<\/tr>\n<tr>\n<td>DiffusionBlocks (ours)<\/td>\n<td>Yes<\/td>\n<td>Yes<\/td>\n<td>46.88%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">DiffusionBlocks is the only method combining a continuous-time formulation with block-wise training. It stays within 1 percentage point of the end-to-end backpropagation baseline.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths and Weaknesses<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Principled theoretical grounding via score matching, not ad-hoc local objectives<\/li>\n<li>Works across five distinct architectures without task-specific modifications<\/li>\n<li>B\u00d7 training memory reduction, proportional to the number of blocks<\/li>\n<li>For diffusion models, inference compute is also reduced by B\u00d7 during generation<\/li>\n<li>Equi-probability partitioning significantly outperforms uniform partitioning (FID 38.03 vs 43.53 on CIFAR-10)<\/li>\n<li>Replaces K-iteration BPTT in recurrent-depth models with a single forward pass<\/li>\n<li>Blocks can be trained in parallel across GPUs with zero communication overhead<\/li>\n<li>Moderate block counts (B=2 or B=3) sometimes improve FID over end-to-end training<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Weaknesses:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Requires matching input and output dimensions; cannot currently be applied to U-Net-style architectures<\/li>\n<li>Validated only on models trained from scratch; fine-tuning of pretrained models is untested<\/li>\n<li>No principled method for selecting optimal block count for a given architecture and task<\/li>\n<li>Adds noise conditioning overhead: aggregated wall time is 0.0543s versus 0.0507s under standard training<\/li>\n<li>On OpenWebText, some metrics are marginally worse than the autoregressive baseline<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"dbs-wrap\">\n<div class=\"dbs-top\">\n<div class=\"dbs-brand\"><span class=\"dbs-brand-dot\"><\/span><span>DiffusionBlocks \u00b7 Sakana AI<\/span><\/div>\n<div class=\"dbs-meta\">ICLR 2026 \u00b7 Block-wise Training<\/div>\n<\/div>\n<div class=\"dbs-viewport\">\n<div class=\"dbs-track\">\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">01 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">A Quick Guide<\/div>\n<h1 class=\"dbs-h1\">Training Transformer Networks One Block at a Time<\/h1>\n<hr class=\"dbs-rule\" \/>\n<p class=\"dbs-lead\">Sakana AI and the University of Tokyo propose <em>DiffusionBlocks<\/em>, a framework that partitions transformer-based networks into independently trainable blocks. Training memory is reduced by a factor of B, where B is the number of blocks.<\/p>\n<ul class=\"dbs-tldr\">\n<li>Each block is trained independently via a score matching objective derived from continuous-time diffusion<\/li>\n<li>Residual connections in transformers map to Euler steps of the reverse diffusion process<\/li>\n<li>Validated on ViT, DiT, masked diffusion, autoregressive, and recurrent-depth transformers<\/li>\n<li>For diffusion models, inference also activates only one block per denoising step<\/li>\n<\/ul>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">02 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">The Problem<\/div>\n<h2 class=\"dbs-h2\">Memory Grows Linearly With Network Depth<\/h2>\n<hr class=\"dbs-rule\" \/>\n<p class=\"dbs-body\">End-to-end backpropagation requires storing intermediate activations across every layer. As models grow deeper, memory consumption grows in step.<\/p>\n<p class=\"dbs-body\">Activation checkpointing reduces activation memory by recomputing on demand. It does not reduce memory for parameters, gradients, or optimizer states.<\/p>\n<p class=\"dbs-body\">With Adam, each layer needs memory for parameters, gradients, and two optimizer states (momentum and variance). This totals roughly 4x the parameter size per layer.<\/p>\n<div class=\"dbs-stat-row\">\n<div class=\"dbs-stat\">\n<div class=\"dbs-stat-num\">O(L)<\/div>\n<div class=\"dbs-stat-lbl\">Activation memory under end-to-end backprop<\/div>\n<\/div>\n<div class=\"dbs-stat\">\n<div class=\"dbs-stat-num\">4P<\/div>\n<div class=\"dbs-stat-lbl\">Per-layer memory for parameters, gradients, and optimizer states under Adam<\/div>\n<\/div>\n<div class=\"dbs-stat\">\n<div class=\"dbs-stat-num\">O(L\/B)<\/div>\n<div class=\"dbs-stat-lbl\">Memory footprint under DiffusionBlocks training<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">03 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">The Core Idea<\/div>\n<h2 class=\"dbs-h2\">Residual Connections as Euler Steps of Reverse Diffusion<\/h2>\n<hr class=\"dbs-rule\" \/>\n<p class=\"dbs-body\">Residual networks update each layer input via <code>z_l = z_{l-1} + f_tl(z_{l-1})<\/code>. This corresponds to Euler discretization of an ordinary differential equation.<\/p>\n<p class=\"dbs-body\">The authors show these updates correspond specifically to the probability flow ODE in score-based diffusion models, under the Variance Exploding formulation.<\/p>\n<div class=\"dbs-eq\">dz_sigma \/ d_sigma = -sigma \u00b7 grad_z log p_sigma(z_sigma)<\/div>\n<p class=\"dbs-body\">A stack of residual blocks can therefore be interpreted as discretized denoising steps. The score matching objective can be optimized independently at each noise level, so each block trains alone.<\/p>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">04 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Conversion Recipe<\/div>\n<h2 class=\"dbs-h2\">Three Modifications to Any Residual Network<\/h2>\n<hr class=\"dbs-rule\" \/>\n<div class=\"dbs-grid\">\n<div class=\"dbs-card\">\n<div class=\"dbs-card-step\">Step 01<\/div>\n<div class=\"dbs-card-h\">Block Partitioning<\/div>\n<div class=\"dbs-card-b\">Split the L-layer network into B blocks. Each block contains a contiguous group of layers.<\/div>\n<\/div>\n<div class=\"dbs-card\">\n<div class=\"dbs-card-step\">Step 02<\/div>\n<div class=\"dbs-card-h\">Noise Range Assignment<\/div>\n<div class=\"dbs-card-b\">Define a log-normal noise distribution and partition the range into B intervals. Assign one interval to each block.<\/div>\n<\/div>\n<div class=\"dbs-card\">\n<div class=\"dbs-card-step\">Step 03<\/div>\n<div class=\"dbs-card-h\">Noise Conditioning<\/div>\n<div class=\"dbs-card-b\">Extend each block input with a noisy version of the target. Add noise-level conditioning via AdaLN.<\/div>\n<\/div>\n<\/div>\n<p class=\"dbs-body\">During training, one block is sampled per iteration. Other blocks are not computed. Memory corresponds to L\/B layers, not L.<\/p>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">05 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Partitioning Strategy<\/div>\n<h2 class=\"dbs-h2\">Equi-Probability, Not Uniform, Intervals<\/h2>\n<hr class=\"dbs-rule\" \/>\n<p class=\"dbs-body\">A uniform partition divides the noise range into equal intervals. This ignores that intermediate noise levels contribute the most to generation quality.<\/p>\n<p class=\"dbs-body\">DiffusionBlocks chooses boundaries so each block handles exactly 1\/B of the total probability mass under the log-normal training distribution.<\/p>\n<div class=\"dbs-tbl-wrap\">\n<table class=\"dbs-tbl\">\n<thead>\n<tr>\n<th>Partition Strategy<\/th>\n<th>Layer Distribution<\/th>\n<th class=\"dbs-mono\">FID (CIFAR-10)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Uniform<\/td>\n<td>[4, 4, 4]<\/td>\n<td class=\"dbs-mono\">43.53<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-pos\">Equi-Probability<\/td>\n<td>[4, 4, 4]<\/td>\n<td class=\"dbs-mono dbs-pos\">38.03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p class=\"dbs-body\">Ablation on DiT-S\/2 with block overlap disabled. Lower FID is better.<\/p>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">06 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Experimental Results<\/div>\n<h2 class=\"dbs-h2\">Tested Across Five Architectures, Three Task Categories<\/h2>\n<hr class=\"dbs-rule\" \/>\n<div class=\"dbs-tbl-wrap\">\n<table class=\"dbs-tbl\">\n<thead>\n<tr>\n<th>Architecture<\/th>\n<th>Dataset<\/th>\n<th>Metric<\/th>\n<th class=\"dbs-mono\">Baseline<\/th>\n<th class=\"dbs-mono\">DiffusionBlocks<\/th>\n<th>Memory<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"dbs-arch\">ViT, 12L, B=3<\/td>\n<td>CIFAR-100<\/td>\n<td>Accuracy \u2191<\/td>\n<td class=\"dbs-mono\">60.25%<\/td>\n<td class=\"dbs-mono dbs-pos\">59.30%<\/td>\n<td class=\"dbs-pos\">3x<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-arch\">DiT-S\/2, 12L, B=3<\/td>\n<td>CIFAR-10<\/td>\n<td>FID test \u2193<\/td>\n<td class=\"dbs-mono\">39.83<\/td>\n<td class=\"dbs-mono dbs-pos\">37.20<\/td>\n<td class=\"dbs-pos\">3x<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-arch\">DiT-L\/2, 24L, B=3<\/td>\n<td>ImageNet 256<\/td>\n<td>FID test \u2193<\/td>\n<td class=\"dbs-mono\">12.09<\/td>\n<td class=\"dbs-mono dbs-pos\">10.63<\/td>\n<td class=\"dbs-pos\">3x<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-arch\">MDM, 12L, B=3<\/td>\n<td>text8<\/td>\n<td>BPC \u2193<\/td>\n<td class=\"dbs-mono\">1.56<\/td>\n<td class=\"dbs-mono dbs-pos\">1.45<\/td>\n<td class=\"dbs-pos\">3x<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-arch\">AR Transformer, B=4<\/td>\n<td>LM1B<\/td>\n<td>MAUVE \u2191<\/td>\n<td class=\"dbs-mono\">0.50<\/td>\n<td class=\"dbs-mono dbs-pos\">0.71<\/td>\n<td class=\"dbs-pos\">4x<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-arch\">AR Transformer, B=4<\/td>\n<td>OpenWebText<\/td>\n<td>MAUVE \u2191<\/td>\n<td class=\"dbs-mono\">0.85<\/td>\n<td class=\"dbs-mono\">0.82<\/td>\n<td class=\"dbs-pos\">4x<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">07 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Recurrent-Depth Models<\/div>\n<h2 class=\"dbs-h2\">Huginn: K-Iteration BPTT Becomes a Single Forward Pass<\/h2>\n<hr class=\"dbs-rule\" \/>\n<p class=\"dbs-body\">Huginn applies a 4-layer recurrent block with stochastic recurrence depth averaging 32 iterations during training. Standard training uses 8-step truncated backpropagation through time (BPTT).<\/p>\n<p class=\"dbs-body\">Under DiffusionBlocks, training is a single forward pass per step. The K-iteration inference procedure is kept unchanged.<\/p>\n<div class=\"dbs-stat-row\">\n<div class=\"dbs-stat\">\n<div class=\"dbs-stat-num\">0.70<\/div>\n<div class=\"dbs-stat-lbl\">MAUVE on LM1B (vs 0.49 baseline)<\/div>\n<\/div>\n<div class=\"dbs-stat\">\n<div class=\"dbs-stat-num\">16.08<\/div>\n<div class=\"dbs-stat-lbl\">Perplexity under Llama-2 (vs 17.04 baseline)<\/div>\n<\/div>\n<div class=\"dbs-stat\">\n<div class=\"dbs-stat-num\">~10x<\/div>\n<div class=\"dbs-stat-lbl\">Less total training compute<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">08 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Comparison with NoProp<\/div>\n<h2 class=\"dbs-h2\">The Only Continuous-Time, Block-Wise Method in the Comparison<\/h2>\n<hr class=\"dbs-rule\" \/>\n<div class=\"dbs-tbl-wrap\">\n<table class=\"dbs-tbl\">\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Continuous-Time<\/th>\n<th>Block-Wise<\/th>\n<th class=\"dbs-mono\">CIFAR-100 Accuracy<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Backpropagation<\/td>\n<td>No<\/td>\n<td>No<\/td>\n<td class=\"dbs-mono\">47.80%<\/td>\n<\/tr>\n<tr>\n<td>NoProp-DT<\/td>\n<td>No<\/td>\n<td>Yes<\/td>\n<td class=\"dbs-mono\">46.06%<\/td>\n<\/tr>\n<tr>\n<td>NoProp-CT<\/td>\n<td>Yes<\/td>\n<td>No<\/td>\n<td class=\"dbs-mono\">21.31%<\/td>\n<\/tr>\n<tr>\n<td>NoProp-FM<\/td>\n<td>Yes<\/td>\n<td>No<\/td>\n<td class=\"dbs-mono\">37.57%<\/td>\n<\/tr>\n<tr>\n<td class=\"dbs-pos\">DiffusionBlocks<\/td>\n<td class=\"dbs-pos\">Yes<\/td>\n<td class=\"dbs-pos\">Yes<\/td>\n<td class=\"dbs-mono dbs-pos\">46.88%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p class=\"dbs-body\">Run on NoProp\u2019s custom CNN architecture for a fair comparison.<\/p>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">09 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Trade-offs<\/div>\n<h2 class=\"dbs-h2\">Strengths and Current Limitations<\/h2>\n<hr class=\"dbs-rule\" \/>\n<div class=\"dbs-cols\">\n<div class=\"dbs-col\">\n<h3>Strengths<\/h3>\n<ul>\n<li>Principled grounding via score matching, not ad-hoc local objectives<\/li>\n<li>B\u00d7 training memory reduction proportional to block count<\/li>\n<li>Works across five distinct architectures unchanged<\/li>\n<li>Inference cost also reduced B\u00d7 for diffusion models<\/li>\n<li>Replaces K-iteration BPTT in recurrent-depth models with a single forward pass<\/li>\n<li>Blocks train in parallel with zero communication overhead<\/li>\n<\/ul>\n<\/div>\n<div class=\"dbs-col dbs-warn\">\n<h3>Limitations<\/h3>\n<ul>\n<li>Requires matching input and output dimensions, so cannot be applied to U-Net<\/li>\n<li>Validated only on models trained from scratch, not via fine-tuning<\/li>\n<li>No principled rule for selecting optimal block count<\/li>\n<li>Adds noise conditioning overhead in wall time<\/li>\n<li>On OpenWebText, some metrics are marginally lower than the baseline<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"dbs-slide\">\n<div class=\"dbs-num\">10 <span>\/<\/span> 10<\/div>\n<div class=\"dbs-kicker\">Read More<\/div>\n<h2 class=\"dbs-h2\">Paper, Code, and Project Page<\/h2>\n<hr class=\"dbs-rule\" \/>\n<p class=\"dbs-body\">Published at ICLR 2026 by Makoto Shing, Masanori Koyama, and Takuya Akiba. Full implementation and experimental configurations are open.<\/p>\n<div class=\"dbs-links\">\n<a class=\"dbs-link\" href=\"https:\/\/arxiv.org\/abs\/2506.14202\" target=\"_blank\" rel=\"noopener\">\n<div class=\"dbs-link-l\">\n<div class=\"dbs-link-lbl\">Paper<\/div>\n<div class=\"dbs-link-url\">arxiv.org\/abs\/2506.14202<\/div>\n<\/div>\n<div class=\"dbs-link-arrow\">\u2192<\/div>\n<p><\/p><\/a><br \/>\n<a class=\"dbs-link\" href=\"https:\/\/github.com\/SakanaAI\/DiffusionBlocks\" target=\"_blank\" rel=\"noopener\">\n<div class=\"dbs-link-l\">\n<div class=\"dbs-link-lbl\">Code<\/div>\n<div class=\"dbs-link-url\">github.com\/SakanaAI\/DiffusionBlocks<\/div>\n<\/div>\n<div class=\"dbs-link-arrow\">\u2192<\/div>\n<p><\/p><\/a><br \/>\n<a class=\"dbs-link\" href=\"https:\/\/pub.sakana.ai\/diffusionblocks\/\" target=\"_blank\" rel=\"noopener\">\n<div class=\"dbs-link-l\">\n<div class=\"dbs-link-lbl\">Project Page<\/div>\n<div class=\"dbs-link-url\">pub.sakana.ai\/diffusionblocks<\/div>\n<\/div>\n<div class=\"dbs-link-arrow\">\u2192<\/div>\n<p><\/p><\/a>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"dbs-controls\">\n<button class=\"dbs-btn dbs-prev\" type=\"button\" aria-label=\"Previous slide\"><span class=\"dbs-btn-arrow\">\u2190<\/span> Previous<\/button>\n<div class=\"dbs-dots\">\n<button class=\"dbs-dot dbs-active\" data-idx=\"0\" type=\"button\" aria-label=\"Slide 1\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"1\" type=\"button\" aria-label=\"Slide 2\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"2\" type=\"button\" aria-label=\"Slide 3\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"3\" type=\"button\" aria-label=\"Slide 4\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"4\" type=\"button\" aria-label=\"Slide 5\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"5\" type=\"button\" aria-label=\"Slide 6\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"6\" type=\"button\" aria-label=\"Slide 7\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"7\" type=\"button\" aria-label=\"Slide 8\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"8\" type=\"button\" aria-label=\"Slide 9\"><\/button><br \/>\n<button class=\"dbs-dot\" data-idx=\"9\" type=\"button\" aria-label=\"Slide 10\"><\/button>\n<\/div>\n<div>\n<span class=\"dbs-counter\"><span class=\"dbs-cur\">01<\/span> \/ 10<\/span><br \/>\n<button class=\"dbs-btn dbs-next\" type=\"button\" aria-label=\"Next slide\">Next <span class=\"dbs-btn-arrow\">\u2192<\/span><\/button>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>DiffusionBlocks partitions residual networks into B independently trainable blocks, reducing training memory by a factor of B<\/li>\n<li>Residual connections in transformers map to Euler steps of the reverse diffusion process, providing a principled local training objective for each block<\/li>\n<li>Equi-probability partitioning assigns equal probability mass per block, not equal noise intervals, improving image generation FID significantly over uniform partitioning<\/li>\n<li>Validated across five architectures: ViT, DiT, masked diffusion, autoregressive, and recurrent-depth transformers<\/li>\n<li>For recurrent-depth models like Huginn, replaces K-iteration BPTT with a single forward pass, reducing total training compute by approximately 10x<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2506.14202\" target=\"_blank\" rel=\"noreferrer noopener\">Research Paper<\/a>, <a href=\"https:\/\/github.com\/SakanaAI\/DiffusionBlocks\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a> <\/strong>and<strong> <a href=\"https:\/\/pub.sakana.ai\/diffusionblocks\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/sakana-ai-proposes-diffusionblocks-a-block-wise-training-framework-that-converts-residual-networks-into-independently-trainable-denoising-modules\/\">Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Researchers from Sakana AI and&hellip;<\/p>\n","protected":false},"author":1,"featured_media":990,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-989","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/989","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=989"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/989\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/990"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=989"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=989"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=989"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}