{"id":985,"date":"2026-05-27T06:31:02","date_gmt":"2026-05-26T22:31:02","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=985"},"modified":"2026-05-27T06:31:02","modified_gmt":"2026-05-26T22:31:02","slug":"stability-ai-releases-stable-audio-3-a-family-of-fast-latent-diffusion-models-for-audio-generation-and-editing","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=985","title":{"rendered":"Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Stability AI has released open weights for Stable Audio 3 along with a <a href=\"https:\/\/arxiv.org\/pdf\/2605.17991\" target=\"_blank\" rel=\"noreferrer noopener\">technical research paper<\/a>. Stable Audio 3 is a family of latent diffusion models that generate stereo audio at 44.1 kHz. The models support variable-length outputs, inpainting-based editing, and fast inference.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What Is Stable Audio 3?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Stable Audio 3 is a family of three model scales: small, medium, and large. A latent diffusion model generates audio by learning to progressively remove noise from a compressed representation of audio, called a latent. The model learns a mapping from noise to data by training on many (noisy latent, audio) pairs.<\/p>\n<p class=\"wp-block-paragraph\">The three model scales differ in capacity and maximum generation length. All parameter counts below are for the diffusion transformer component only. Each model also includes a SAME autoencoder (108M parameters for SAME-S, 852M for SAME-L).<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>small-music<\/strong> \u2014 459M diffusion transformer parameters, up to 2 minutes, music only.<\/li>\n<li><strong>small-sfx<\/strong> \u2014 459M diffusion transformer parameters, up to 2 minutes, sound effects only.<\/li>\n<li><strong>medium<\/strong> \u2014 1.4B diffusion transformer parameters, up to 6 minutes and 20 seconds, music and sound effects.<\/li>\n<li><strong>large<\/strong> \u2014 2.7B diffusion transformer parameters, up to 6 minutes and 20 seconds, music and sound effects.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Open weights for small and medium are available on Hugging Face. Large is available under an enterprise license.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Architecture: Two Components<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Stable Audio 3 has two main components:<\/strong> a semantic-acoustic autoencoder called SAME, and a diffusion transformer that generates latent sequences conditioned on text, duration, and inpainting masks.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2086\" height=\"824\" data-attachment-id=\"80120\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/26\/stability-ai-releases-stable-audio-3-a-family-of-fast-latent-diffusion-models-for-audio-generation-and-editing\/screenshot-2026-05-26-at-3-08-35-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-26-at-3.08.35-PM-1.png\" data-orig-size=\"2086,824\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-26 at 3.08.35\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-26-at-3.08.35-PM-1-1024x404.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-26-at-3.08.35-PM-1.png\" alt=\"\" class=\"wp-image-80120\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.17991<\/figcaption><\/figure>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>The SAME Autoencoder<\/strong><\/h4>\n<p class=\"wp-block-paragraph\">SAME (Semantically-Aligned Music autoEncoder) converts stereo 44.1 kHz audio into a compact latent representation and back. Its key design parameter is a 4096\u00d7 downsampling ratio \u2014 substantially higher than the 1024\u00d7 to 2048\u00d7 ratios common in prior audio autoencoders. This higher ratio reduces latent sequence lengths enough for long-form generation to run on consumer hardware.<\/p>\n<p class=\"wp-block-paragraph\">SAME achieves its 4096\u00d7 compression through two stages. First, a <strong>patching stage<\/strong> reshapes stereo audio into non-overlapping patches of 256 samples per channel, achieving 256\u00d7 downsampling. Second, a <strong>Transformer Resampling Block (TRB)<\/strong> applies a further 16\u00d7 downsampling using learnable output embeddings interleaved with the input sequence, processed through a transformer. The combined output is a 256-dimensional latent sequence at approximately 10.76 Hz for a 44.1 kHz input.<\/p>\n<p class=\"wp-block-paragraph\">The SAME autoencoder is trained with <strong>five loss types<\/strong>: spectral reconstruction, adversarial, diffusion alignment, semantic regression (predicting chroma and interaural level difference), and contrastive latent alignment. These losses push the latent to preserve both acoustic reconstruction quality and semantic structure. A soft-normalisation bottleneck constrains the scale of the latent, providing deterministic encoding.<\/p>\n<p class=\"wp-block-paragraph\">The SAME autoencoder is frozen during diffusion training. Small models use SAME-S (108M parameters, optimized for CPU inference); medium and large use SAME-L (852M parameters).<\/p>\n<h4 class=\"wp-block-heading\"><strong>The Diffusion Transformer<\/strong><\/h4>\n<p class=\"wp-block-paragraph\">The diffusion transformer operates on SAME latents. <strong>Conditioning enters through three pathways:<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li><strong>Text<\/strong> \u2014 a frozen T5Gemma encoder produces a sequence of 256 embeddings of dimension 768. Short prompts are padded to 256 with a learned embedding; long prompts are truncated.<\/li>\n<li><strong>Duration<\/strong> \u2014 encoded as a Fourier features vector and injected via both Adaptive Layer Normalization (AdaLN) and cross-attention alongside the text prompt.<\/li>\n<li><strong>Inpainting<\/strong> \u2014 a binary mask concatenated with the masked reference audio is projected through a 2-layer MLP and added to the residual stream of each transformer block.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Each transformer block contains self-attention, cross-attention, local-additive conditioning for inpainting, and a SwiGLU feed-forward network. Medium and large use <strong>differential attention<\/strong>, which computes two separate attention maps using two (Q, K) pairs sharing one set of values V, then subtracts one map from the other. This cancels attention patterns that are common to both heads. The transformer prepends <strong>64 learnable memory embeddings<\/strong> before processing each sequence. These provide a global context buffer that every position can attend to, and are removed before computing any loss.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Variable-Length Generation<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Most prior latent diffusion models for audio operate at a fixed maximum sequence length. Generating a short clip still requires running inference at full length, wasting compute on silence. Stable Audio 3 is trained to generate audio at variable lengths natively, using three mechanisms:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Variable-length flash attention and masked loss<\/strong> \u2014 sequences shorter than the batch maximum are right-padded in latent space. Padding positions are excluded from self-attention and from the loss.<\/li>\n<li><strong>Per-element timestep shifts<\/strong> \u2014 longer sequences retain more structure at a given noise level due to redundancy between neighboring elements. To compensate, the noise schedule is shifted toward higher noise levels for longer sequences during training, using a logistic shift parameterized by \u00b5 (interpolating between \u00b5min=0.5 and \u00b5max=1.15 based on sequence length).<\/li>\n<li><strong>Silence augmentation<\/strong> \u2014 the signal region is randomly extended with pre-computed silence embeddings drawn from an exponential distribution, averaging 4 seconds. This teaches the model to terminate audio with natural silence.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The practical result is that inference cost scales with output duration. Medium generates 20 seconds of audio in approximately 0.62 seconds on an H200. Generating 380 seconds takes 1.31 seconds on the same hardware.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Three-Stage Training Pipeline<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Stage 1 \u2014 Flow Matching Pre-Training.<\/strong> The model learns a velocity field that transports Gaussian noise toward audio latents. Training uses <strong>minibatch optimal transport coupling<\/strong> via Sinkhorn iterations, which pairs each data sample with the closest available noise vector in the batch. This straightens training trajectories and reduces crossing transport paths. Inpainting is trained jointly throughout: at each step, one of three mask types is sampled \u2014 full mask (80%, equivalent to unconditional generation), random segment masks (10%), or a causal prefix mask for continuation (10%).<\/p>\n<p class=\"wp-block-paragraph\"><strong>Stage 2 \u2014 Distillation Warmup.<\/strong> A frozen copy of the flow matching model (teacher) generates 15-step DPM++ trajectories with CFG scale 5. The student is trained for 10,000 steps to map any intermediate noisy state directly to the teacher\u2019s final denoised output in one step, using an MSE loss. This collapses the multi-step ODE into a single-step denoiser. The trade-off is that MSE regression produces outputs that regress toward the conditional mean, reducing fine-grained detail.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Stage 3 \u2014 Adversarial Post-Training.<\/strong> This stage replaces the MSE objective with a relativistic adversarial setup. A discriminator (initialized from the base flow matching model) evaluates the student\u2019s one-step denoised outputs directly against real data. The teacher is discarded entirely at this stage. The generator is trained with two losses: a relativistic adversarial loss (L_R) and a CLAP alignment loss (L_CLAP). The discriminator is trained with L_R and a contrastive loss (L_C) that penalizes the discriminator for ignoring text-audio alignment (it is trained to distinguish correctly paired audio-text pairs from shuffled ones). The adversarial setup allows the model to recover the perceptual sharpness that MSE distillation removes.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Inference: Ping-Pong Sampling and No CFG<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The post-trained model can generate audio in a single forward pass. However, single-step generation from pure noise remains difficult. Stable Audio 3 uses <strong>ping-pong sampling<\/strong> at inference: the model denoises to a clean estimate, then adds new noise at a reduced level, then denoises again. This repeats for 8 steps using a logSNR-uniform schedule (N+1 equally-spaced steps in the interval [\u03bbmin, \u03bbmax] = [\u22126.2, 2.0]). The iterative denoise-then-renoise schedule allows each step to correct errors from the previous step.<\/p>\n<p class=\"wp-block-paragraph\">Stable Audio 3 does <strong>not require classifier-free guidance (CFG) at inference<\/strong>. Standard diffusion models run two forward passes per step \u2014 one conditional, one unconditional \u2014 and interpolate. Here, CFG quality gains are internalized during distillation warmup, where the student is trained to match CFG-enhanced teacher trajectories. Text-audio alignment is further reinforced through L_CLAP during adversarial post-training. This eliminates the two-pass-per-step cost of CFG.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Prompt formatting note:<\/strong> All Stable Audio 3 models trained on AudioSparx (small-music, medium, large) require prompt prefixes to function correctly. Music prompts should be prepended with <code>\"TrackType: Music, VocalType: Instrumental,\"<\/code> and sound effects prompts with <code>\"TrackType: SFX,\"<\/code>. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Evaluation Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Instrumental music (Song Describer Dataset, 120s).<\/strong> On FAD (lower is better) and CLAP score (higher is better), large achieves FAD 0.101 \/ CLAP 0.393. Medium achieves FAD 0.107 \/ CLAP 0.390. Stable Audio 2.5 (the internal prior-generation baseline) achieves FAD 0.106 \/ CLAP 0.395. In the listening test, medium and large score higher on musicality (MUS) than Stable Audio 2.5 (4.15 and 4.30 vs. 3.70 out of 5, respectively). Inference time for 120s audio on an H200: 0.45s for small, 0.78s for medium, 0.81s for large. Stable Audio 2.5 takes 0.85s for the same length.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Sound effects (BBC Sound Effects Dataset, 5s).<\/strong> Medium achieves FAD 0.369 \/ CLAP 0.369. The next-best open-weight baselines are Stable Audio Open Small (FAD 0.500 \/ CLAP 0.277) and Stable Audio Open (FAD 0.501 \/ CLAP 0.263). Woosh Flow scores FAD 0.580.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Audio editing (inpainting).<\/strong> The research team evaluates three inpainting settings: single region, two independent regions, and continuation. For music, medium achieves FAD-full of 0.046 on single inpainting and 0.046 on double inpainting. Large achieves 0.047 on both. For continuation, medium achieves FAD-full 0.074 and large achieves 0.071. Sound effects results follow a similar pattern; continuation shows higher FAD than inpainting in both domains, which the team attributes to the model having less surrounding audio context to anchor the generation.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Comparison<\/strong><\/h2>\n<div class=\"wrap\">\n<div class=\"tabs\">\n<div class=\"tab active\">Model specs<\/div>\n<div class=\"tab\">Music benchmarks (SDD, 120s)<\/div>\n<div class=\"tab\">SFX benchmarks (BBC, 5s)<\/div>\n<\/div>\n<div class=\"panel active\">\n<div class=\"tscroll\">\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Developer<\/th>\n<th>Released<\/th>\n<th>Architecture<\/th>\n<th>Parameters<\/th>\n<th>Max length<\/th>\n<th>Sample rate<\/th>\n<th>Domain<\/th>\n<th>Open weights<\/th>\n<th>Inpainting<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"sep\">\n<td colspan=\"10\">STABLE AUDIO LINEAGE<\/td>\n<\/tr>\n<tr>\n<td>Stable Audio Open<\/td>\n<td>Stability AI<\/td>\n<td>Jul 2024<\/td>\n<td>Latent diffusion (DiT)<\/td>\n<td>DiT 1057M + AE 156M + T5 109M<\/td>\n<td>47s<\/td>\n<td>44.1kHz stereo<\/td>\n<td>Music + SFX<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td>Stable Audio Open Small<\/td>\n<td>Stability AI<\/td>\n<td>2024<\/td>\n<td>Latent diffusion (DiT)<\/td>\n<td>Not published<\/td>\n<td>11s<\/td>\n<td>44.1kHz stereo<\/td>\n<td>SFX<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td>Stable Audio 2.5<\/td>\n<td>Stability AI<\/td>\n<td>Internal<\/td>\n<td>Latent diffusion (DiT)<\/td>\n<td>Not published<\/td>\n<td>190s (3m 10s)<\/td>\n<td>44.1kHz stereo<\/td>\n<td>Music<\/td>\n<td><span class=\"badge-inv\">Not released<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 small-music \u2605<\/b><\/td>\n<td>Stability AI<\/td>\n<td>May 2026<\/td>\n<td>Latent diffusion (SAME + DiT)<\/td>\n<td>DT 459M + SAME-S 108M<\/td>\n<td>2m<\/td>\n<td>44.1kHz stereo<\/td>\n<td>Music only<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 small-sfx \u2605<\/b><\/td>\n<td>Stability AI<\/td>\n<td>May 2026<\/td>\n<td>Latent diffusion (SAME + DiT)<\/td>\n<td>DT 459M + SAME-S 108M<\/td>\n<td>2m<\/td>\n<td>44.1kHz stereo<\/td>\n<td>SFX only<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 medium \u2605<\/b><\/td>\n<td>Stability AI<\/td>\n<td>May 2026<\/td>\n<td>Latent diffusion (SAME + DiT)<\/td>\n<td>DT 1.4B + SAME-L 852M<\/td>\n<td>6m 20s<\/td>\n<td>44.1kHz stereo<\/td>\n<td>Music + SFX<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 large \u2605<\/b><\/td>\n<td>Stability AI<\/td>\n<td>May 2026<\/td>\n<td>Latent diffusion (SAME + DiT)<\/td>\n<td>DT 2.7B + SAME-L 852M<\/td>\n<td>6m 20s<\/td>\n<td>44.1kHz stereo<\/td>\n<td>Music + SFX<\/td>\n<td><span class=\"badge-inv\">Enterprise<\/span><\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr class=\"sep\">\n<td colspan=\"10\">COMPETITORS<\/td>\n<\/tr>\n<tr>\n<td>TangoFlux<\/td>\n<td>SUTD \/ NVIDIA \/ Lambda<\/td>\n<td>Dec 2024<\/td>\n<td>Flow matching (DiT + MMDiT)<\/td>\n<td>515M<\/td>\n<td>30s<\/td>\n<td>44.1kHz<\/td>\n<td>SFX<\/td>\n<td><span class=\"badge\">Yes (Apache 2.0)<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td>Woosh Flow<\/td>\n<td>Sony AI<\/td>\n<td>Apr 2026<\/td>\n<td>Flow matching<\/td>\n<td>Not published<\/td>\n<td>5s<\/td>\n<td>Not disclosed<\/td>\n<td>SFX<\/td>\n<td><span class=\"badge\">Yes (MIT)<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td>Woosh DFlow<\/td>\n<td>Sony AI<\/td>\n<td>Apr 2026<\/td>\n<td>Distilled flow matching<\/td>\n<td>Not published<\/td>\n<td>5s<\/td>\n<td>Not disclosed<\/td>\n<td>SFX<\/td>\n<td><span class=\"badge\">Yes (MIT)<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td>DiffRhythm 2<\/td>\n<td>ASLP Lab (NPU)<\/td>\n<td>Oct 2025<\/td>\n<td>Block flow matching (semi-autoregressive)<\/td>\n<td>Not published<\/td>\n<td>210s (3m 30s)<\/td>\n<td>48kHz output<\/td>\n<td>Music + vocals<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td>ACE-Step 1.5<\/td>\n<td>ACE Studio \/ StepFun<\/td>\n<td>Jan 2026<\/td>\n<td>Hybrid LM (0.6B\u20134B) + DiT (up to 4B)<\/td>\n<td>LM 0.6B\u20134B + XL DiT 4B<\/td>\n<td>10m<\/td>\n<td>Not disclosed<\/td>\n<td>Music + vocals + lyrics<\/td>\n<td><span class=\"badge\">Yes<\/span><\/td>\n<td>No<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p class=\"note\">\n      <b>\u2605 SA3 rows:<\/b> Parameter counts are for the diffusion transformer (DT) component only; SAME autoencoder params are listed separately. Total model size including SAME: small ~567M, medium ~2.25B, large ~3.55B.<br \/>\n      <b>Stable Audio 2.5<\/b> is an internal Stability AI model not publicly released; included as prior-generation internal baseline from the SA3 paper.<br \/>\n      <b>DiffRhythm 2<\/b> VAE processes 24kHz input audio and reconstructs at 48kHz (arXiv:2510.22950).\n    <\/p>\n<\/div>\n<div class=\"panel\">\n<p class=\"note\"><b>Evaluation setup:<\/b> Song Describer Dataset (SDD), 120s instrumental music generations, H200 GPU. FAD uses LAION-CLAP embeddings (630k-audioset-best.pt). OVL\/REL\/MUS are mean opinion scores (1\u20135) from a 14-participant listening test. Source: SA3 paper Tables 3 and 4. <b>Bold + underline<\/b> = best score in column.<\/p>\n<div class=\"tscroll\">\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>FAD \u2193<\/th>\n<th>CLAP \u2191<\/th>\n<th>OVL \u2191 (1\u20135)<\/th>\n<th>REL \u2191 (1\u20135)<\/th>\n<th>MUS \u2191 (1\u20135)<\/th>\n<th>Inference (H200)<\/th>\n<th>Sampler \/ steps<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"sep\">\n<td colspan=\"8\">COMPETITORS<\/td>\n<\/tr>\n<tr>\n<td>DiffRhythm 2<\/td>\n<td>0.293<\/td>\n<td>0.158<\/td>\n<td>3.05 \u00b1 0.94<\/td>\n<td>2.10 \u00b1 1.29<\/td>\n<td>2.60 \u00b1 1.10<\/td>\n<td>3.88s<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<tr>\n<td>ACE-Step 1.5 xl-turbo<\/td>\n<td>0.193<\/td>\n<td>0.321<\/td>\n<td>3.35 \u00b1 1.09<\/td>\n<td>3.30 \u00b1 1.13<\/td>\n<td>3.15 \u00b1 1.31<\/td>\n<td>6.23s<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<tr class=\"sep\">\n<td colspan=\"8\">STABILITY AI \u2014 PRIOR GENERATION<\/td>\n<\/tr>\n<tr>\n<td>Stable Audio 2.5 (internal)<\/td>\n<td>0.106<\/td>\n<td class=\"best\">0.395<\/td>\n<td>3.90 \u00b1 0.79<\/td>\n<td class=\"best\">4.30 \u00b1 0.66<\/td>\n<td>3.70 \u00b1 0.92<\/td>\n<td>0.85s<\/td>\n<td>DPM++ 3M SDE, 8 steps, CFG 6<\/td>\n<\/tr>\n<tr class=\"sep\">\n<td colspan=\"8\">STABLE AUDIO 3 \u2014 POST-TRAINED (8 PING-PONG STEPS, NO CFG)<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 small-music<\/b><\/td>\n<td>0.145<\/td>\n<td>0.393<\/td>\n<td>3.20 \u00b1 0.89<\/td>\n<td>3.60 \u00b1 0.94<\/td>\n<td>3.15 \u00b1 0.81<\/td>\n<td class=\"best\">0.45s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 medium<\/b><\/td>\n<td>0.107<\/td>\n<td>0.390<\/td>\n<td class=\"best\">4.20 \u00b1 0.89<\/td>\n<td>4.25 \u00b1 0.85<\/td>\n<td>4.15 \u00b1 0.93<\/td>\n<td>0.78s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 large<\/b><\/td>\n<td class=\"best\">0.101<\/td>\n<td>0.393<\/td>\n<td>3.95 \u00b1 0.89<\/td>\n<td>3.80 \u00b1 1.11<\/td>\n<td class=\"best\">4.30 \u00b1 0.73<\/td>\n<td>0.81s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p class=\"note\">\n      <b>FAD:<\/b> Fr\u00e9chet Audio Distance \u2014 lower is better. <b>CLAP:<\/b> cosine similarity between text and audio embeddings \u2014 higher is better.<br \/>\n      <b>OVL<\/b> = overall production quality. <b>REL<\/b> = text relevance. <b>MUS<\/b> = musicality (melody\/harmony coherence).<br \/>\n      ACE-Step 1.5 and DiffRhythm 2 evaluated with instrumental prompts only for fair comparison with SA3 (instrumental-only models). SA3 base flow matching models (50 steps, CFG 7, Euler sampler) are not shown here; see SA3 paper Table 11 for that comparison.\n    <\/p>\n<\/div>\n<div class=\"panel\">\n<p class=\"note\"><b>Evaluation setup:<\/b> BBC Sound Effects Dataset, \u22645s generations matched to reference duration, H200 GPU. FAD uses LAION-CLAP embeddings. OVL\/REL from 14-participant listening test. Source: SA3 paper Table 5. <b>Bold + underline<\/b> = best score in column.<\/p>\n<div class=\"tscroll\">\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>FAD \u2193<\/th>\n<th>CLAP \u2191<\/th>\n<th>OVL \u2191 (1\u20135)<\/th>\n<th>REL \u2191 (1\u20135)<\/th>\n<th>Inference (H200)<\/th>\n<th>Sampler \/ steps<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"sep\">\n<td colspan=\"7\">COMPETITORS<\/td>\n<\/tr>\n<tr>\n<td>TangoFlux<\/td>\n<td>0.760<\/td>\n<td>0.179<\/td>\n<td>2.35 \u00b1 1.04<\/td>\n<td>3.25 \u00b1 1.37<\/td>\n<td>1.90s<\/td>\n<td>Flow matching, 50 steps, CFG 4.5<\/td>\n<\/tr>\n<tr>\n<td>Woosh DFlow<\/td>\n<td>0.619<\/td>\n<td>0.228<\/td>\n<td>3.10 \u00b1 1.25<\/td>\n<td>3.20 \u00b1 1.64<\/td>\n<td class=\"best\">0.06s<\/td>\n<td>Distilled flow, 4 steps<\/td>\n<\/tr>\n<tr>\n<td>Woosh Flow<\/td>\n<td>0.580<\/td>\n<td>0.277<\/td>\n<td>3.45 \u00b1 1.19<\/td>\n<td>3.80 \u00b1 1.28<\/td>\n<td>1.92s<\/td>\n<td>Adaptive ODE (~72 steps avg)<\/td>\n<\/tr>\n<tr class=\"sep\">\n<td colspan=\"7\">STABILITY AI \u2014 PRIOR GENERATION<\/td>\n<\/tr>\n<tr>\n<td>Stable Audio Open<\/td>\n<td>0.501<\/td>\n<td>0.263<\/td>\n<td>2.95 \u00b1 1.32<\/td>\n<td>3.30 \u00b1 1.30<\/td>\n<td>12.30s<\/td>\n<td>DPM++ 3M SDE, 100 steps, CFG 7<\/td>\n<\/tr>\n<tr>\n<td>Stable Audio Open Small<\/td>\n<td>0.500<\/td>\n<td>0.277<\/td>\n<td>3.10 \u00b1 1.12<\/td>\n<td>3.55 \u00b1 1.00<\/td>\n<td>0.24s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<tr class=\"sep\">\n<td colspan=\"7\">STABLE AUDIO 3 \u2014 POST-TRAINED (8 PING-PONG STEPS, NO CFG)<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 small-sfx<\/b><\/td>\n<td>0.395<\/td>\n<td>0.351<\/td>\n<td>3.35 \u00b1 1.39<\/td>\n<td>3.25 \u00b1 1.45<\/td>\n<td>0.41s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 medium<\/b><\/td>\n<td>0.369<\/td>\n<td class=\"best\">0.369<\/td>\n<td class=\"best\">3.65 \u00b1 1.14<\/td>\n<td class=\"best\">3.95 \u00b1 1.23<\/td>\n<td>0.60s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<tr class=\"sa3\">\n<td><b>SA3 large<\/b><\/td>\n<td class=\"best\">0.358<\/td>\n<td>0.370<\/td>\n<td>3.60 \u00b1 0.94<\/td>\n<td>3.85 \u00b1 1.04<\/td>\n<td>0.64s<\/td>\n<td>PingPong, 8 steps<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p class=\"note\">\n      Woosh DFlow achieves the fastest inference (0.06s) but at a quality cost \u2014 higher FAD than Woosh Flow. SA3 small-sfx, medium, and large all outperform every competitor on FAD and CLAP at the 5s generation length.<br \/>\n      SA3 models do not use classifier-free guidance (CFG) at inference. CFG quality gains are internalized during distillation warmup training.\n    <\/p>\n<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Stable Audio 3 is a family of open-weight latent diffusion models (small, medium, large) for music and sound effects generation and editing.<\/li>\n<li>A SAME autoencoder with 4096\u00d7 downsampling compresses audio into 256-dimensional latents at ~10.76 Hz, making long-form generation tractable on consumer hardware.<\/li>\n<li>Variable-length generation is natively supported: inference cost scales with requested duration, not a fixed maximum length.<\/li>\n<li>Three-stage training (flow matching \u2192 distillation warmup \u2192 adversarial post-training) enables 8-step inference without classifier-free guidance.<\/li>\n<li>Prompt prefixes (<code>\"TrackType: Music, VocalType: Instrumental,\"<\/code> \/ <code>\"TrackType: SFX,\"<\/code>) are required for AudioSparx-trained model variants.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the <strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.17991\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong>, <strong><a href=\"https:\/\/huggingface.co\/collections\/stabilityai\/stable-audio-3\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a><\/strong> and\u00a0<strong><a href=\"https:\/\/github.com\/Stability-AI\/stable-audio-3\" target=\"_blank\" rel=\"noreferrer noopener\">Repo here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/26\/stability-ai-releases-stable-audio-3-a-family-of-fast-latent-diffusion-models-for-audio-generation-and-editing\/\">Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Stability AI has released open&hellip;<\/p>\n","protected":false},"author":1,"featured_media":986,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-985","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=985"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/985\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/986"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}